Dev Diary #6: Dealing with Catastrophic Failure
Hi everyone, this is Ian. I’m the lead designer for Tripping Whale and Duple Dragon.
If you saw our email last month, you already know that something terrible happened on the Duple Dragon project. When we sent that email the problem was ongoing; we didn’t have the full picture yet so we didn’t feel ready to tell you about it. We’ve fixed things up now, so it’s time to share the gory details with you.
tl;dr: Skip to the end to see how this impacts our release.
Look for the terminology boxes like this one where I explain some key terms as they come up in the story.
Timeline of a Disaster
Setting the Stage
Terminology – Unity, Game Engine:
Duple Dragon is made in Unity, a “game engine” used by many developers. A game engine is a specialized tool that gives you quick access to many of the common needs for a game like displaying graphics, playing audio, and receiving player input.
Terminology – Builds, Building:
A “build” is a version of the game. Android gets a build, iOS gets a build. A build is a noun, and “building” the game is a verb meaning to create a build.
November 4, 6:35 PM – Project is updated to a new version of Unity to enable support for Android 11 phones. Strange errors which break phone builds of the game start appearing. We don’t think anything of it yet because there are lots of bugs left to fix which could be causing it. Spoiler: they weren’t.
November 11, 12:23 PM – We email our followers announcing our release date of February 1.
There’s Something in the Water
November 25, 2:42 PM – Caleb discovers the new version of Unity we’ve been working with for nearly a month has known issues creating phone builds. This creates a dilemma, as there would be a lot of lost work if we reverted the project back to the old version from November 3. Caleb begins looking for a work around.
December 1, 9:45 AM – After days of trying to solve the issues, Caleb notices a growing number of “null references” in files throughout the project. A null reference usually means that a file is trying to access something that is corrupt or missing.
December 1, 2:00 PM – Caleb determines that his copy of the project is in a very broken state and reaches out to see if anyone has a more intact copy. By dumb luck I hadn’t downloaded the latest updates to the project yet, and mine was still working— this will prove to be a critical piece of recovering the project later on. I zip the entire project and send it to Caleb. It works!
December 2, 5:14 PM – It doesn’t work. Many issues are fixed using my older copy of the project but the project still won’t build. We realize the project could have been decaying for longer than we thought. Caleb takes the weekend to deliberate on how we can salvage things.
Making more changes to the project will only make things worse, and Souren and I are forced to wait and hope Caleb can find an automated solution. Things look bad, and we mentally prepare ourselves to redo most of our work from November.
The Water is Poison and Must Be Purged
Terminology – Prefabs, Nested Prefabs:
A core feature in Unity, a Prefab is a collection of objects that can be reused, such as player characters or items to collect. By making something a prefab, you can make updates to it in one place and those changes will automatically apply to all the copies of that prefab. Nearly everything in the game is a Prefab.
A Prefab can also be put inside another prefab, which is called nesting prefabs. Our menus and dialog boxes are all prefabs, and the buttons, text, and borders used in them are also prefabs.
December 3, 11:35 AM – Caleb realizes the decaying state of the project is due to a major bug in Unity’s Prefabs, specifically Nested Prefabs. Despite this being officially supported by Unity, nested prefabs can sometimes lose track of themselves, creating a null reference when the game tries to load them and crashing.
With a problem this fundamental, how did we not notice it sooner? Because of another insidious part of the problem: caches.
Terminology – Cache:
Used in software, web browsers, and apps across the world, a cache is a temporary file that saves the state of other files on a per user basis. Every time you open a website your browser “caches” all of the images and content on that page. This speeds up reloading that page later because you don’t have to download all of it again.
Unity caches the state of most files in your project to speed up loading times. That’s a good thing, but it can also create problems because it always loads a cache if one exists instead of the real file. If something changed in the original file, but not in the cache, you’ll start to see issues arise. In our case the only thing keeping the nested prefabs in our project working were caches. As soon as those caches go away—which always happens when you build the project—the whole thing falls apart. To make matters worse, the cached files aren’t complete versions of the files they represent, and it’s impossible to resurrect working versions of the files from caches.
If this wasn’t bad enough, Caleb strongly suspects the project has been invisibly and silently falling apart for several months. Even if we could find an intact version of the project in our version history, we would lose months of work by reverting to it. He decides the best course of action is to build a custom tool that attempts to repair all the broken references in the nested prefabs, of which there are almost a thousand. It won’t be easy, it’ll take some time, and no more work can be done in Unity until the project is repaired.
December 5, 1:27 PM – The recovery tool has been built and a first pass with it is done. There will still be a lot of things to fix by hand, and with Caleb’s go ahead I begin to assess the damage.
December 5, 6:36 PM – I discover that all nested prefabs have become “unpacked”. Unpacking a prefab means it’s no longer a prefab. It still has all the same objects and settings, but is no longer connected to the source prefab. This is a huge problem. The main benefit of using prefabs—that you only need to make changes in one place to change many things—doesn’t work on unpacked prefabs.
If the recovery tool can’t be improved to fix this, we’ll have to rebuild nearly every prefab from scratch. Imagine you have several hundred lego sets. Someone took them all apart and mixed the pieces up. You have the pieces, you have the instructions, but now you have to figure what goes where and rebuild the lego sets how they were before. This should give you a sense of the problem I was facing. I relay the bad news to Caleb and he goes back to the drawing board on improving the recovery tool.
The Long and Winding Road to Recovery
December 15, 12:51 PM – After stressing over the problem for days while balancing a day job, Caleb has done what seemed impossible. Connections to nested prefabs have been restored and nearly everything is working. Several hours of manual cleanup lie ahead, but the lion’s share has been taken care of by the recovery tool. Caleb tells me this tool was the hardest piece of programming he has ever done, and he’s done a lot.
Souren and I take over for the manual cleanup. We go through every object in the project and cross reference their settings with the equivalent objects in the old, broken copy. It’s a painstaking process, but nothing compared to what it could have been. It’s a relief for both of us to be able to work again.
December 22, 6:09 PM – Souren and I finish fixing all objects in the project and the rebuild is complete. There is much rejoicing!
Back on Track
It can’t be stated enough how critical Caleb, our programmer, was to getting us back on track. He spent dozens of hours creating and debugging a custom built tool that would painstakingly salvage and rebuild as much of the project as possible. Without this we would have been looking at literally months of work rebuilding everything by hand.
The project is back to working order, but it took us about a month to get it that way. Most other work on the game had to be paused during this time to avoid causing more damage, which means we lost all that development time two months before our intended release date.
We’ve created systems to prevent this nightmare from ever happening again, but the damage is done. We aren’t getting that time back, and we’ve come to terms with the fact that we won’t be able to make our release date as planned. That brings us to our next announcement…
Early Access and Delayed Release
While we’re back to making steady progress, we won’t be able to release on February 1 anymore. Instead, we’re planning to release the game in early access on March 1, followed by a full release a month later on April 1.
This new early access period will be used to fill out the variety in the gameplay as well as fix any lingering bugs that may show up with a larger player base.
It’s been a tumultuous few months for us, but the end is in sight. We can’t wait for you to finally get your hands on the game soon.
– Caleb, Ian, and Souren