This is the true story of the app that was fixed by a crash. We travel back to join fish mid-debugging:
I'm dumbfounded. The thread is just gone. And not just any thread: the main thread.
I don't even know how you make the main thread go away. exit() gets rid of the whole app. Will pthread_exit() do it? Maybe it's some crazy exception thing?
See, I've got a bug report. Surf's Up: Epic Waves doesn't work on this prelease OS. Last year, the game giddily ran through its half-dozen splash screens before landing on its main menu, full of surf boards and power chords. Today, it just sits there silently, showing nothing but black, getting all aggro. The game hasn't changed, but the OS did, and I've got to figure out what's different.
So far this is familiar territory. A black screen could be any old hang. I break out sample, crack open the Surfing carapace, peer inside. It's a big app, with over a dozen threads each doing their separate thing. Nothing unusual at all; in particular nothing that looks like a hang. Except...wait, where's the main thread? It's missing. There is NO MAIN THREAD.
I don't know that the main thread's disappearance is connected to the hang, but it's all I have to go on, so let's investigate.
Does the main thread just dry up and fall off, a sort of programmatic umbilical cord? Maybe this program doesn't need it, instead cobbling together an assortment of other threads into working software.
Or maybe it's violent. Maybe the app is forcibly decapitated but still running, unaware of its state, like some lumbering twelve-legged headless chicken, flapping uselessly, bumping into things.
This is easy to check. I boot it up on last year's OS and take a sample. The main thread is there! So it's decapitation. Now we just have to find the butcher.
My debugging toolbox overfloweth. I start the app in gdb - ah, no, it doesn't like that. Complains and crashes.
No problem. I try dtrace - complains and crashes.
One by one, my tools are blunted against the deceptively upbeat Surf's Up. Maybe it intentionally defeats debugging, maybe it's just some accident of design; either way it's nearly impenetrable. Only sample works. So I take sample after sample, slogging through them, looking for something, anything, to explain the empty black screen, which now seems to be growing, filling my vision...
I head home, frustated.
The next day came with no new ideas, so I decide to consult with my imaginary go-to guy. I close my eyes and picture him: the scruffy beard, the sardonic smirk...
"You're an idiot," says House by way of hello. I think it over, but it doesn't help. I say "The main thread is not calling exit, it's not returning, it's not crashing, so where is it going?"
House says, "You don't know that. That's just what it's telling you. And like I said, you're an idiot, because you believe it. Everybody lies."
"The patient," says House, "starts out fine. And then, from nowhere, WHACK! Off with its head! But the body keeps on going. Pretty cool!"
House continues, "But how is that possible? Most patients don't survive headless. Somehow, this one does, and we need to figure out how. We need to catch it in the act."
"And how would I do that?" I muse.
"Chop off its head."
It's darkly ingenious. I don't have the code for this app, and most of my debugging tools are neutered, but there is one inescapable dependency, one avenue into its underbelly: the OS itself. That's code that I control. I can make the OS treacherous, turn it against the app.
Let's try to crash the main thread and see what happens. I check out the source for a dynamic library the app uses, choose my line, and add a dose of poison:
*(int*)NULL = 0;
I try my weaponized library against a hapless guinea pig, and it dies immediately. So far so good. I set the framework path, and launch Surf's Up...
Power chords! My NULL dereference was medicine, not poison!
I feel my sanity begin to give. "Crashing the app, fixes the app? As if asking for help from an imaginary version of a fake doctor wasn't crazy enough. What does this mean?"
"The head can't fall off," House answers, "if it's already gone."
"You mean, our crash somehow pre-empted a different, bloodier crash? But the app doesn't report any crashes at all."
"Everybody lies," says House.
Hmm. Perhaps the app is surviving through setting a signal handler? It could elide a crash that way, catch it and continue on. If so, then clearing the signal handler should make the app visibly crash, like every other app. I change my weaponized library to clear the signal handler for SIGBUS. Surf's Up launches, then once again hangs at a black screen. No crash.
"It didn't work. I don't think the app is really crashing," I say.
"Everybody lies," repeats House.
I change the code to loop, to reset the handler for ALL signals. I start up the program. Black screen.
"This isn't it," I say. But I'm out of ideas.
"EVERYBODY LIES" insists House.
Desperate, frantic, I change the code to spawn a thread that does nothing except loop, loop, loop, constantly resetting all signal handlers. Then I launch the app. Black screen. Try again. Black. Again. Black. Again.
It seems simple in retrospect. Surf's Up went to heroics to swallow signals: its background threads continually set signal handlers that cause only the signalling thread to exit, but leave the remaining threads unharmed. And the app didn't really need its main thread after all.
Except when it did, because when the main thread crashed of its own accord, it did so while holding a lock. And when other threads tried to acquire that lock, they got stuck, because the thread which was supposed to release it was gone. The lock was forever locked! But the problem could be averted by stopping the main thread preemptively. And the simplest way to stop it was to trigger the app's own murderous machinery: dereference NULL.
The crash itself was just a mundane, run-of-the-mill regression, one that was already known and already being investigated. But the signal shenanigans of the app manifested the crash in a totally different way, and its debugging defenses hid its misdeeds. Two minutes thus became two days, but the end result was the same.
Surf's Up was treacherous, smiling as it swallowed bus errors and segmentation faults, as if nothing had gone wrong. fish responded to treachery in kind by temporarily recruiting the OS itself to aid in debugging and ultimately defeat the app's defenses. But these same defenses had a remarkable consequence: attempts to trigger a crash instead precluded a hang. One app's poison is another app's surf boards and power chords.
(Discussion threads on reddit and Hacker News)