What Now? Handling Errors In Large Systems

More options means more options.

Cloudflare’s in-depth postmortem for the November 18 outage sparked a lot of online conversation about error handling because of one line in the postmortem:

.unwrap()

If you’re not familiar with Rust, you need to know about Result, a type of structure that can contain either a successful result or an error. unwrap Basically says “if the successful results are one then return them, otherwise crash the program”¹you can think of it like this assert,

There is much debate about whether assertare good at production²But most are missing the point. Very simply, this is not a question about any one program. This is not local property. whether assertBeing suitable for a given component is a global property of the system, and the way it handles data.

Let’s play a little error handling game. Click ✅ if you think crashing the process or server is appropriate and click ❌ if not. Then you will see my vote and justification.

One of the ten web servers behind the load balancer suffers unrepairable memory errors, and takes itself out of service.
One of the ten multi-threaded application servers behind the load balancer encounters a null pointer in the business logic while processing a client request.
A database replica receives a logical replica record from the primary that it does not know how to process.
A web server receives a global configuration file from the control plane that appears garbled.
A web server fails to write its log file due to a full disk.

If you don’t want to play, and just want to see my answers, click here: ,

There are three unifying principles behind my answers here.

Are the failures correlated? If the decision is local and highly likely to be unrelated between machines, crashing is the cleanest thing to do. Crashing has the advantage of removing system complexity. working in bad mode State. On the other hand, if failures can be correlated (including adverse user behavior), it is best to design the system to reject the cause of the errors and continue.

Can they be handled at a higher level? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a high level (for example by replacing instances or containers because they fail load balancer health checks using AWS Autoscaling), but cannot handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting from Lambda-style serverless to Erlang approaches, are designed to handle higher rates of errors, and in more cases it is better to continue rather than crash.

Is it possible to continue meaningfully? This is where you need to understand your business logic. In most cases it is possible to continue with the last-known good version, with the configuration, and in some cases with the data. This adds complexity by introducing a behavior mode that runs with that version, but that complexity may be worth the added flexibility. On the other hand, in a database that handles updates through operations (for example). x = x + 1) or conditional operations (if x == 1 then y = y + x) Then leaving some records and continuing may lead to arbitrary corruption in the state. In the latter case, the system must be designed (including its operational practices) to ensure that replicas receive only the records they understand. These types of invariants make the system less flexible, but they are needed to avoid state divergence.

The bottom line is that error handling is not a native property of the system. The correct way to handle errors is a global property of the system, and error handling should be built into the system from the beginning.

It’s hard to get this right, and this is where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means you impact less of all your traffic if you do something wrong – ideally a smaller percentage of traffic. The reduction in blast radius is humility in the face of complexity.

footnote

yes i know a panic It’s not necessarily an accident, but it’s close enough for our purposes. If you want to explain the difference to me, feel free to.
And there was a lot of debate about whether Rust helped here. I think Rust does two things very well in this case: It makes unwrap The explicit case in the code (the programmer can see that this line has “succeed or die behavior”, completely locally on this one line of code), and prevents the behavior of the action at a distance (which silently continues with one) NULL may be the cause). What Rust doesn’t do here fully is make it clear enough. some suggested that unwrap should be called or_panicwhich I like. Other people suggested things like lint clippy need to be more clear about the need unwrap Coming up with some justification, which might be helpful in some code bases. Overall, I would prefer to write Rust rather than C here.

<a href

What Now? Handling Errors in Large Systems

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply