III. Embrace Failure

Expect things to go wrong and design for resilience

Following the definitions from the Reactive Manifesto, failure denotes a condition that prevents a component from servicing requests, while errors denote normal conditions that arise in your program — e.g. due to input validation — and thus are directly signaled back to the calling component.

Reactive applications consider failure as an expected condition that will eventually occur. Therefore, failure must be explicitly represented and handled at some level, for example in the infrastructure, by a supervisor component, or within the component itself (by using internal redundancy). Requests should be answered whenever possible even in the failure case, even though component autonomy will already ensure that the failure remains contained in as small an area of the application’s function as possible. Decoupling in space further allows the failure to be kept inside designated failure zones while decoupling in time enables other components to reliably detect and handle failures even when they cannot be explicitly communicated.

An explicitly represented failure condition also allows a component to purposefully provide degraded service instead of failing silently and completely. Where possible, this can also be used to implement self-healing capabilities although this cannot be done in a generic fashion apart from the let it crash approach of killing and restarting the component—a strategy used successfully in implementations of the Actor Model new tab, e.g. Erlang new tab, Akka new tab, Elixir new tab, and VLINGO new tab.

It’s also essential to remark that failures may be undetectable. So, it’s not always possible for the application to be sure of the correctness. However, even undetectable failures should not influence the application, which should continue to operate normally.

While these are powerful capabilities, employing them in a non-reactive context (such as within the non-distributed implementation of a single component) is usually more work than using traditional mechanisms like exceptions. The best way to handle failures also depends on the particular choice of programming language and paradigm, where those using exceptions for both failures and errors profit more from the explicit representation than those that, for example, use sum types new tab (like Either) to return fallible results and abort the program upon failure.

Another use of explicitly represented failure is that it can be communicated as a value to other threads, processes, or over the network. This is used in platform and language-specific ways in various Reactive Programming techniques. For example:

  • The onError signal in Reactive Streams. new tab

  • The form of throwing and catching exceptions in Observable new tab streams using async/await.

  • Only communicating the occurrence but not the nature of a failure, to ensure full encapsulation, as in the Actor Model. new tab