Let It Crash the Right Way (2009)

erikb · on May 5, 2015

I'd say in a test environment you should let it crash a lot. But in production, facing users (which can be another programmer, who suddenly forgot that he is a programmer and knows what you are dealing with) there are few situations were you should really crash.

Think about a very dumb user application like Word. Let's say word crashed with a Stacktrace, some log messages, the best error message ever, maybe even saving the user data in a backup file somewhere. The user will think Word is unusable.

Now let's consider a scenario in which Word recovered under the hood by losing the users data but never losing the GUI. The user still thinks it's bad, but he won't think it's unusable. To some degree he will blame himself for not saving the data himself.

That's why I think I would be way less willing to crash in front of the user and would apply Pokemon Error Handling (Gotta Catch them All) in all user facing software as a feature, not a bug.

SEMW · on May 5, 2015

In the context of the blog post - Erlang - "Let it crash" almost never means "let the entire application crash". It means "let the current [micro-]process crash, then its supervisor will restart it in a known-good state" (and if the supervisor crashes then its supervisor will restart it, and so on).

davidw · on May 5, 2015

> and if the supervisor crashes then its supervisor will restart it, and so on

Enough of this and it will crash the node. You need to design for this in an Erlang system.

The ever-helpful jlouis has some useful writing on the subject: http://jlouisramblings.blogspot.it/2010/11/on-erlang-state-a...

As well as these: https://github.com/jlouis/fuse

Sadly, this is not discussed as much as it should be in Erlang land.

bontoJR · on May 5, 2015

I agree. AKKA is doing the same in the JVM. Letting an app to crash doesn't mean "crash the whole system", but just a process. It's very handy when dealing with a lot of parallelism and asynchronous code, sometimes is better leave a process to crash instead of trying recover it with error handling. Obviously is not always the case, but for some processes, the crash and a later recover by the system is a better solution in terms of performances, system state and logic.

erikb · on May 5, 2015

That's very interesting information. The article itself doesn't say much Erlang specific so I didn't even put it in that context. But you are right, the article is an Erlang article. Thanks for the Erlang lesson!

brudgers · on May 5, 2015

For context, Erlang is designed for building highly reliable fault tolerant systems, things like telecom switches that may run for many years without rebooting or powering down or failing. "Fail" like "error" and "fault" have specific technical meanings when talking about such systems. A fault causes the system to enter a problematic state, the problematic state is an error, if the problematic state causes the system not to perform its intended function then the system fails.

What the article is getting at is adding an error handler that's not in the specification keeps the system in an unspecified state [or transitions the system into a different unspecified state depending on how you look at it].

Well, I guess there's another piece of context: it's probably best to treat systems as algebraically closed in that combining two systems produces another system and so forth [1]. This means that when a specific system crashes (and "crashing" is not a technical term in the context) the larger system may be designed to handle the error. With proper encapsulation, there's no way for the crashed system to "know" what's going on above it or what's best...without a specification.

In other words, the assumption for such systems is that if a system is supposed to handle an error then the specification would say so. High reliability fault tolerant systems aren't built with cowboy coding. The software expresses the design and while some systems are such that failing results in failing to delight someone with a piano playing cat, other systems are such that when they fail 911 calls are missed and people die. In those cases, the author is suggesting when you don't know, don't guess.

[1]: 52. Systems have sub-systems and sub-systems have sub- systems and so on ad infinitum - which is why we're always starting over. --Alan J. Perlis

erikb · on May 5, 2015

Yes, we agree that adding unspecified handling results in the system being in an unspecified state. It's also a good thinking that every system is a sub system of another system, so at some point a crash is meaningful because it tells the higher level system that something went wrong and the higher level system should know that so that it can start to handle the error.

The problem space which I like to add is the scenario in which the higher level system is a human.

In a sense humans are like old code, they do odd things, are unable to specify their expectations and sometimes prioritize expectations that are utterly wrong. But we as smart programmers, who want to succeed with our program, can not be satisfied with "scientifically correct" code. We need to produce code that humans use. Therefore we can not always crash when entering unspecified states. That's not "cowboy coding", that's handling the imperfection of the higher level system. Sometimes the additional error handler is good, because the crash itself is worse than an undefined state.

davidw · on May 5, 2015

The idea is not to crash the entire program, but the part of it that has gotten itself into a bad state. Look at the Cowboy web server for instance: it doesn't just fall over in its entirety if a request handler barfs for some reason.

You'd want the UI to not crash, and show a nice error message:

http://joearms.github.io/2013/04/28/Fail-fast-noisely-and-po...

erikb · on May 6, 2015

Okay, but that's exception and exception handling not crashing. Crashing means the whole program dies. I can understand if the term "crash" is defined differently in Erlang world, but I don't understand why Erlang citizens don't know that the rest of the world calls that "exception" and that "exception handling" is implemented in many high level languages by default, doing exactly as the article explains you should be doing it. I mean, Erlang people also know other languages, right? I think there are more Erlang experts who know C++ than C++ experts who know Erlang.

davidw · on May 6, 2015

Erlang has exception handling too, try/catch kinds of constructs. It's different, though, than what Erlang does with its "let it crash" philosophy.

http://www.erlang.org/download/armstrong_thesis_2003.pdf

Specifically, the bits about "Let it crash".

brudgers · on May 5, 2015

A stacktrace passed to a Word User is exactly the problem. The error was handled at the lower level system which includes a stack rather than at the higher level composite system that involves people. If the low level error hadn't been handled locally, a more user friendly level of the system could have made a more user friendly level presentation of the information regarding the failure.

Because the kernel code cowboys ignored the UX city slickers, the system failure is in a sense, double. Not only does it stop letting the user type words into boxes, but then it also fails by presenting the user with a bunch of useless gobbledy-gook. It would have been better if they had asked the people working on the higher level abstractions, "Should we handle <this type> of error and if so how?" Or with wonderfully non-technical ambiguity, asking, "What should we do when we crash?"

If the error is handled locally, there's no chance that a higher level can handle it more gracefully. What causes a low level system programmer to freak out, may be nothing if the higher level system is redundant in regard to the low level system. A system shouldn't assume it knows what is catastrophic in the large. A message stating, "I have an error" is the place to start.

pjc50 · on May 5, 2015

Word recovered under the hood by losing the users data but never losing the GUI. The user still thinks it's bad, but he won't think it's unusable. To some degree he will blame himself for not saving the data himself.

You've never done user support, have you? Almost everyone would prefer "UI freaks out for a bit, restarts with no lost work" to "invisible loss of work".

Edit: I've just remembered while I was arranging my mortgage, the customer rep was using some sort of Windows native LOB application. It popped up a failure dialog box with some hex in it .. which she closed and carried on. Users don't care about quality unless and until it stops them doing what they're doing or loses work.

erikb · on May 5, 2015

I'm writing code that is used by the people next door in the same office. So while I haven't worked in user support in some regard it's about 95% of my work. So what I write is really experience directly from the users. Some users might be different though. Thinking about my mother/grandfather etc, they also hate crashes, though. "UI freaks out and then restarts" is definitely better than losing data, but the word "crash" for me means not to restart. If Word crashes it doesn't come back. You are left with your desktop and an error message if at all.

lostcolony · on May 5, 2015

In a word processor, you should be persisting the current state, so long as it appears good, in the background to a temp file anyway; a crash that leads to data loss should result in reloading from that temp file, either automatically or through asking the user if they'd like to. Just because you have the philosophy of 'let it crash' (which is per process, not the entire application, unless a process is unable to recover) doesn't change that.

Which is exactly what Word does. https://support.microsoft.com/en-us/kb/316951

erikb · on May 5, 2015

Okay I've read many comments like yours, apparently for some reason "crash" here obviously means "per process" not "per program" which I didn't catch somehow.

lostcolony · on May 5, 2015

No worries. Yeah, it's an Erlang specific approach. The app itself only crashes if processes thrash sufficiently to percolate all the way up (supervisors form hierarchies; if a process underneath a supervisor crashes the supervisor restarts it; if it crashes enough, the supervisor crashes; if that supervisor crashes enough, it triggers its own supervisor, etc, all the way up to the top level supervisor which defines the app, and if -that- one goes down the app comes down, as its viewed as an unrecoverable failure).

You still run the risk of losing state internal to the process if it's not persisted, and as a dev you have to deal with that, and ensure that when a process restarts it restarts into a good state, but it proves a really effective model.

erikb · on May 6, 2015

Btw. I now think the Erlang term "crash" might mean the same thing as, e.g. the Java/Python/C++ term "Exception". Do you know about that? Is it really as similar as I think?

Because it does exactly what the article says: You experience some unexpected system state, stop what you are doing, raise an Exception with an error type and error message and let a higher level element do something about it, which might include a restart of the subsystem if so desired.

lostcolony · on May 6, 2015

I do, but no, exceptions aren't equivalent.

Erlang has the idea of exceptions. You can explicitly throw them, as well as certain situations generating them. They can be caught, using the familiar try/catch paradigm (well, somewhat familiar; it's used a little differently, with a slightly different syntax, but basically the same purpose).

It also has the idea of an exit signal, which will kill a process regardless (unless that process is trapping exit signals, because you want to do something prior to exiting, or prevent it, or whathaveyou).

A crash is what happens when you don't catch the exception/exit. Same as in other languages. -But-, in Erlang, it only kills the lightweight process (the term 'green process' has been used, akin to a green thread, except no state is shared between them), rather than taking down the VM.

In Java/etc, when you have an exception, if you don't catch it, it percolates back up the stack to the top and takes down your application; the execution of -everything- is ended. You might have a watcher process that restarts the app, but you took out the whole app. In the event of an unexpected exception, then, you either killed the app, or have a catch(Exception) somewhere, which is widely viewed as an antipattern, and which you really don't know how to recover from.

In Erlang, when you have an exception, if you don't catch it, it merely kills the (green) process, which if not intended to be ephemeral, should be supervised, linked to any process it implicitly shares state with (linking causes a dying process to send out exit signals to the linked process, so they die and restart too) and have a well written initialization step that will read from any persisted 'good' state, and bring it back to that state. So you get back into a good state having only taken down the bit that actually went wrong, rather than everything. The other processes are unaffected; no memory was shared between them, so they can continue on their merry way. They might have been expecting a message from the process that died, but if so, they'll be waiting on a timeout, and should either retry sending it, or themselves die and restart (message passing makes no guarantee of delivery, so you have to deal with the possibility that messages won't arrive; this is really, really useful in distributed scenarios, as it means there is no difference in sending a message locally, and sending a message remotely).

In the Java case, you try your best not to let exceptions go uncaught, because the user experience is terrible if you do. But it's easy to miss things that can cause exceptions. And when you catch them, it may not be clear what to do with them anyway, so you end up writing a bunch of error handling code runs the risk of keeping you in a bad ambiguous state, that leads to hard to debug issues further down the line.

In the Erlang case, you treat everything as being able to die at any time, making sure your supervisor hierarchy is sensible, and making sure every process reinitializes properly in the event of its death. After that, you worry about catching only those exceptions you expect to happen and have a well defined way of dealing with (which is usually very few); any other causes the state to be dropped and that process to be restarted. You get a log of it, you can decide you want to explicitly handle that case, because recovery from it is well defined, or you can just shrug and move on because it truly was just a hiccup, and you don't need to address it; either way, the process is back in a good state, and the system is able to continue along with barely a blip.

EDIT: A few links to help explain it (as it's a different programming model than Java/C++/Python/etc, and understanding how the philosophy applies requires understanding the model) -

http://www.slideshare.net/dnene/actors-fault-tolerance-and-o...

http://stackoverflow.com/questions/3172542/are-erlang-otp-me...

Xophmeister · on May 5, 2015

In a big GUI-style app, one would expect the UI layer to be separate from numerous other subsystems, all of which would be managed by the main process. From a UX point-of-view, I agree that it's important to keep the UI up and responsive. However, if some other subsystem were to fail in an unexpected way, it should be allowed to; with the caveat that the "user" who experiences the error becomes the main process, rather than the human user, which can then deal with it appropriately (log the trace, restart the subsystem, etc.) As far as the human user is concerned -- providing a critical mass of subsystems stay up -- everything keeps chugging along.

erikb · on May 5, 2015

Totally agree with that one. If I have a program, the program keeps running, but a subsystem dies, for me that simply wasn't "crashing". I'm unclear about why most people understand that "crashing" does not mean the whole program dies. I have only learned the word in that context. If a subprocess/thread/function dies on me because of an error that's not a "crash" in my terms.

thaumaturgy · on May 5, 2015

I have trouble imagining very many situations where crashing is the right thing to do.

Program gets bad or unexpected input (parsers, tokenizers, renderers, data processing): parse or process as much valid input as possible, then set a helpful error condition and return.

Program encounters a resource constraint (malloc failed, can't get a lock, no socket available, permissions problem): set a helpful error condition and return.

Something really went bugnuts (kernel panic): crash, but as gracefully as possible.

Raising exceptions is an OK way to transmit errors up the call chain, although it's not my personal favorite, but allowing unhandled exceptions to find their way up the call chain doesn't seem like a great idea to me.

edit: I went back and re-re-read the post. His points here seem reasonable:

> Do I know how I’m supposed to handle an error here?

> If not, then should I handle it? (Thus going back to specification)

etc., but I don't understand the code example at the end; I'd think it would be good to have more precise information about what failed and where and why, rather than omitting that exception handling and letting a parent process or something else deal with it further up the chain where the cause is far more opaque.

lostcolony · on May 5, 2015

Not having that catchall will trigger a "badmatch", which will cause the process to crash. The Erlang VM will log it to the error log. This isn't a "throw/catch", where the caller will catch it and have to do something; the process this happens in -will- die (and the parent may timeout in waiting for a response).

But, it's an Erlang process; it should be supervised (or separately linked or monitored, but that's neither here nor there). If it is, it will be restarted in a known good state (or at least; should be. That's the guarantee the developer needs to work towards in an Erlang process).

So given that, the outcome is that you get a log of what happened (and it stands out), you wrote less code, and you return to a known good state.

The outcome of the catch all is you have to explicitly log what happened, it is very easy to miss, and your process/system is left running in a possibly ambiguous state (since either that message should have been handled, or it never should have been sent).

Now, that said, dealing with bad user data and sending them a useful error message -is- something that requires some work, some sussing out of what went wrong, yes.

Talking to external resources (as alluded to elsewhere in mentioning jlouis' Fuse library) -is- something you want to limit crashes on (since supervisors are built as a hierarchy, and just because a piece of hardware is temporarily unavailable, you shouldn't crash so often that the system comes down).

Malloc...you probably still want to crash on. Crashes will percolate up the supervisor hierarchy, until eventually hitting the highest level one, effectively dropping all memory and resetting to a good state. If that doesn't get you to a good state, there -is- no way to get there without manual intervention.

pjc50 · on May 5, 2015

Program encounters a resource constraint (malloc failed, can't get a lock, no socket available, permissions problem): set a helpful error condition and return

Return to what? If malloc has failed then it's quite likely that large areas of the program can no longer work. If you get a lock timeout then your program is deadlocked and cannot continue.

The distinction that matters is anticipated errors versus unanticipated errors, and the size of the middle chunk of "anticipated, but rare and we don't have time to care about". If you have an unanticipated error you're in the land of unknown unknowns. At that point you have to decide whether it's more important to try to preserve unsaved program state (which may be corrupt) versus the chances of further wrong action (corrupting persistent storage, bad output, physical or electrical damage in some systems).

This is where "continuous saving" and journaling-orientated persistent data structures come in handy. You can crash the app and recover its transient state from persistent state without losing anything.

stinos · on May 5, 2015

allowing unhandled exceptions to find their way up the call chain doesn't seem like a great idea to me

Depends on the situation. For instance: I usually do thorough checks on validity of arguments etc. If invalid/null ptr passed/etc (which then means there must a bug somewhere) some specific exception is thrown and this makes it all the way to the global unhandled exception handler which then take appropriate actions (dialog, log, kill application). None of the callers which are passing the invalid argument along know how to handle the error else the error wouldn't occur in the first place. I'm not sure there is a more appropriate way to handle this in eg C#?

polack · on May 5, 2015

Not having that catch all case in the example will make any unspecified calls to crash. That will log the state, received call and so on, so there is no needed to catch it for logging. The whole point of having a supervisor higher up the chain to handle these "unknown" errors is that you can never know what errors will occur, so having a generic way to reset the part that crashed to the last known good state is freaking awesome if you are building fault-tolerant systems.

arielb1 · on May 5, 2015

In many contexts I would prefer a program to not try to recover (and possibly cause a mess) on unexpected input. Of course if you have a big process it may be a good idea to limit the scope of a failure to a smaller level.

jimbobimbo · on May 5, 2015

Like OP mentioned, all that boils down to the spec. It's easy to imagine that Word will have the crash handling behavior specced (and it does - someone in the thread linked to the KB article about that) and it will be different from some other application. In case of web applications, for example, there could be a global customizable generic exception handler that will provide some user friendly message, while collecting loads of data that will help devs to diagnose the problem.

angersock · on May 5, 2015

Funny example.

So, Wings3D is actually written in Erlang. It's a quite nice and minimalist 3D editing program, and it behaves exactly how you describe: some operations may cause a crash, and Wings lives and your model dies.

This is not a large consolation.

danielsmmr · on May 5, 2015

Please do not write software, thanks.

jacquesm · on May 8, 2015

Please do not be a jerk in this forum.

snarfy · on May 5, 2015

"We all know the saying it’s better to ask for forgiveness than permission. And everyone knows that, but I think there is a corollary: If everyone is trying to prevent error, it screws things up. It’s better to fix problems than to prevent them. And the natural tendency for managers is to try and prevent error and overplan things."

— Ed Catmull, President of Pixar

https://signalvnoise.com/posts/2440-we-all-know-the-saying-i...

jakejake · on May 5, 2015

Please excuse the self promotion but if you like this topic I wrote a related post "Insidious Bugs or: How I Stopped Worrying and Learned to Love Exceptions" http://verysimple.com/2009/04/03/insidious-bugs/

It does seem like a lot of people are afraid to let a user see an error, but if you're not extremely careful then attempts to shield the user can mask bugs.

jimbobimbo · on May 5, 2015

Thank you for posting this! I'll be sure do share the link with my team: a lot of people are looking at exceptions as they're some kind of voodoo that needs to be handled explicitly in every nook and cranny of the code.

EGreg · on May 5, 2015

This is great advice for everything except debugging.

I would like to accumulate a "stack trace" of calls among actors to see what led to a particular exception, and log it.

lostcolony · on May 5, 2015

Slight problem there, at least in Erlang; if A sends a message to B that causes B to crash, A doesn't know about it. A doesn't in any way see the error (though it may be restarted if part of the same supervisor, and it may timeout and crash if waiting for a response). So handling that message/not handling that message doesn't affect debugging; the top of your stack trace will be in B, regardless (and you may have a separate one at A if it times out waiting for a response from B, or if it's been linked to B such that it dies when B dies). Errors do not propagate except where you've said they should; i.e., by your supervision tree, or by explicit linking.

EGreg · on May 5, 2015

No, B should log it along with the accumulated call stack, before crashing.

lostcolony · on May 5, 2015

In Erlang, B will log only what stack accumulated in B's process. Nothing from A's. That is -

  run() ->
    B = spawn(fun() -> receive_func() end),
    A = spawn(fun() -> send_func(B), io:format("A completed normally", []) end).

  send_func(Pid) -> Pid ! whatever.

  receive_func() -> receive _ -> throw(blah) end.

You will receive a call stack listing only an error in receive_func. You will not have the anonymous function in B (due to tail call optimization) in the call stack, and you will not have the send_func in the call stack, at all, because that process completed normally; there was no error involved with it (a fact you can see by the io format call, which will be invoked just fine). A sent its message, and then printed something, and that's it. B's error does not relate to A; A might have a separate error (such as if I were to add a receive clause to send_func, expecting a response, that times out after a while), but that error is unrelated to B, and will create an entirely separate error in the logs, with an entirely separate call stack.

jacquesm · on May 8, 2015

Sorry for posting that just before running out of the office. You are 100% correct. So the thing to do here is to isolate the message that caused the crash, that way you don't need to know what happened in 'A', you shouldn't have to know what happened in 'A'!

If you do have to know what happened in A then you are using messages where you should have used function calls, after all a message should be context free once it leaves the sender. This may require some re-thinking of the concept of messaging, it is wrong to see a message as some kind of RPC mechanism, a message is a self contained package sent to a recipient that contains everything the recipient needs in order to process it and/or elicit a reply. Any other state should be elsewhere and should not be required at all to identify the cause of the crash.

Getting this right can be quite tricky.