Stop Forwarding Errors, Start Designing Them

It is three o’clock in the morning. Production has decreased. You are looking at a log line that says:

Error: serialization error: expected ',' or '}' at line 3, column 7

You know JSON is broken. but you have zero idea Why, Where?Or Who Caused this. Was it config loader? User API? Webhook consumer?

The error has successfully burrowed through 20 layers of your stack, preserving its original message perfectly, yet losing every bit of meaning along the way.

We have a name for it. We call this “error handling”. But really, it’s fair forwarding errorWe treat errors like hot potatoes – catch them, wrap them up (probably), and throw them in a pile as quickly as possible,

you add one println!Restart the service, wait for the bug to reproduce. It’s going to be a long night.

As described in a detailed analysis of error handling in a large Rust project:

“There are tons of thought-provoking articles or libraries promoting their best practices, sparking an epic debate that never ends. We all started to notice that something was wrong with error handling practices, but it’s challenging to pinpoint the exact problems.”


What is wrong with current practices?

std::error::Error Specialty: A great but flawed abstraction

rusty std::error::Error The characteristic assumes that errors form a chain – each error has an alternative source() Pointing to the underlying cause. This works in most cases; Most errors have no source or only one source.

but as a standard library Abstraction, it’s too mind-boggling. This obviously excludes cases where sources form a tree: a validation error with multiple field failures, a timeout with partial results. These scenarios exist, and the standard feature does not provide any way to represent them.

Backtrace: Expensive medicine for the wrong disease

rusty std::backtrace::Backtrace Its purpose was to improve error observation. They are better than nothing. But they have serious limitations:

In async code, they are almost useless. Your backtrace will contain 49 stack frames, 12 of which are calls GenFuture::poll()The Async Working Group notes that suspended tasks are invisible to traditional stack traces,

They only tell the origin, not the path. Backtrace tells you where the error was createdThis is not a logical path taken through your application. It won’t tell you “This was the request handler for user X, calling service Y with parameters Z.”

Capturing backtraces is expensive. The standard library documentation admits: “Capturing backtraces can be quite an expensive runtime operation.”

Provide/Request API: Overengineering in action

The Provider API (RFC 3192) and General Member Access (RFC 2895) add dynamic type-based data access for errors:

fn provide<'a>(&'a self, request: &mut Request<'a>) {

request.provide_ref::<Backtrace>(&self.backtrace);

Unstable Provide,Request The API represents the latest effort to make itself more resilient to errors. Considerations: Errors can provide dynamically typed context (such as HTTP status codes or backtraces) that callers can request at runtime.

It feels powerful. In practice, this presents new problems:

uncertainty: your mistake It is possible Provide an HTTP status code. Or it may not happen. You won’t know until runtime.

complexity:The API is so granular that LLVM struggles to optimize multiple render calls.

Sometimes, a simple structure with named fields is better than a clever abstraction.

thiserror: Classification on the basis of origin, not on the basis of function.

thiserror Makes it easy to define error calculations:

#[derive(Debug, thiserror::Error)]

#[error("connection failed: {0}")]

Connection(#[from] ConnectionError),

#[error("query failed: {0}")]

Query(#[from] QueryError),

#[error("serialization failed: {0}")]

Serde(#[from] serde_json::Error),

This seems appropriate. But notice how this common practice classifies errors: by Originalnot by What can the caller do about it,

when you get one DatabaseError::Querywhat should you do? Try again? report User? Log in and continue? The error doesn’t tell you. It simply tells you which dependencies failed.

As one blogger rightly said: “This error type tells the caller not what problem you are solving but how you solve it.”

anyhow: So convenient you’ll forget to add the reference

anyhow Takes the opposite approach: type erasure. just use anyhow::Result preach everywhere and with ?No more enum variants, no more #[from] notes.

Problem? Its Very convenient.

fn process_request(req: Request) -> anyhow::Result<Response> {

let user = db.get_user(req.user_id)?;

let data = fetch_external_api(user.api_key)?;

let result = compute(data)?;

Everyone ? A missed opportunity to add context. What was the user ID? Which API were we calling? Which calculation failed? Error knows nothing of it.

anyhow Documentation encourages use .context() To add information. But .context() Is optional—the type system does not require it. “I’ll add the context later” is the easiest lie to tell yourself. Later means never – until 3 a.m. when the production is on fire.


Problem: error handling without purpose

Consider this common pattern in the Rust codebase:

#[derive(thiserror::Error, Debug)]

#[error("database error: {0}")]

Database(#[from] sqlx::Error),

#[error("http error: {0}")]

Http(#[from] reqwest::Error),

#[error("serialization error: {0}")]

Serde(#[from] serde_json::Error),

This seems appropriate. But ask yourself:

  1. What can the caller do? ServiceError::Database, Can they try again? Should they show the raw SQL error to users? The error type does not help answer these questions.
  2. while debugging at 3amDoes “Serialization error: expected , Or }“Tell you which request, which field, which code path went here?

This is a fundamental inconsistency in how we think about error handling. we focus on Publicity Errors exactly, on lining up types, on satisfying the compiler. But we forget that errors are messages – messages that will eventually be read either by a machine trying to recover, or by a human trying to debug.

The “Library vs. Application” Myth

You’ve probably heard the conventional wisdom: “Use thiserror For libraries, anyhow For applications.”

This is a good, simple rule, not a perfect one. As Luca Palmieri says: “This is not the right framing. You need to reason about intent.”

The real question isn’t whether you’re writing a library or an application. The real question is: What do you expect the caller to do with this error?

Two audiences, two needs

Let’s clarify who consumes errors and what they need:

audience Target need
machines automatic recovery Flat structure, clear error types, predictable code
human beings debugging Rich context, call paths, business-level information

When the retry middleware receives an error, it doesn’t care about your beautifully nested error chains. All you need to know is this: Is it worth trying again? A simple boolean or enum version is sufficient.

When you’re debugging at 3 in the morning, you don’t need to know that somewhere deep in the stack there was a io::Erroryou need to know: Which file, which user, which request, what were we trying to do?

Most error handling designs are not optimized for any audience. they adapt to compiler,

For machines: flat, functional, type-based

When errors need to be handled programmatically, complexity is the enemy. Your retry logic does not want to cross check nested error chains for specific variants. It wants to ask: is_retryable()?

Here’s a pattern drawn from the error design of Apache OpenDAL:

context: Vec<(&'static str, String)>,

// ... categorized by what the caller CAN DO

Permanent, // Don't retry

Temporary, // Safe to retry

Persistent, // Was retried, still failing

This design enables clear decision making:

// Caller can make informed decisions

Err(e) if e.kind() == ErrorKind::RateLimited && e.is_temporary() => {

sleep(Duration::from_secs(1)).await;

Err(e) if e.kind() == ErrorKind::NotFound => {

Focus on key design decisions:

ErrorKind is classified based on response, not origin. NotFound This means “The object does not exist, do not try again.” RateLimited It means “slow down and try again.” The caller doesn’t need to know whether it was an S3 404 or a file system ENOENT – they need to know what to do about it.

ErrorStatus is clear. Instead of estimating retryability from error types, this is a first-class field. Services may mark errors as temporary when they know a retry might help.

One error type per library. Instead of scattering error calculations across modules, a single flat structure keeps things simple. context Provides all the functionality you need without the proliferation of field types.

No more traversing error chains, no more guessing at error types. Just ask the error directly.

For humans: low-friction context capture

The biggest enemy of good error reference isn’t capacity – it’s friction. If adding references is annoying, developers won’t do it.

The Xen library (294 lines of Rust, zero dependencies) demonstrates one approach: errors form A Tree of frames, each automatically capturing its source location #[track_caller]Unlike linear error chains, trees can represent multiple causes – useful when parallel operations fail or validation produces multiple errors,

Here’s what we need:

Automatic location capture. Instead of expensive backtrace, use #[track_caller] To capture file/line/column zero costEach error frame must know where it was created,

Ergonomic reference joint. The API should be so natural to add context No It feels wrong to add this:

.or_raise(|| AppError(format!("failed to fetch user {user_id}")))?;

compare this to thiserrorWhere adding the same reference requires defining a new version and manual wrapping:

#[derive(thiserror::Error, Debug)]

#[error("failed to fetch user {user_id}: {source}")]

// ... one variant per call site that needs context

fn fetch_user(user_id: &str) -> Result<User, AppError> {

db.query(user_id).map_err(|e| AppError::FetchUser {

user_id: user_id.to_string(),

Enforce references on module boundaries. This is where XN seriously differs anyhowwith anyhowevery error is erased anyhow::Errorso you can always use ? And go ahead—don’t let the system stop you. Reference methods exist, but but Nothing Prevents you from ignoring them.

exn takes a different approach: Exn Preserves the outermost error type. If your function returns Result>you can’t straight ? A Result>-Types do not match. compiler Force you have to call or_raise() and provide a ServiceErrorThis is the moment when you should add context about what your module was trying to do.

// This won't compile--type mismatch forces you to add context

pub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> {

let user = db.query(user_id)?; // Error: expected Exn, found Exn

// You must provide context at the boundary

pub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> {

let user = db.query(user_id)

.or_raise(|| ServiceError(format!("failed to fetch user {user_id}")))?; // Now it compiles

The type system becomes your ally: it won’t let you be lazy on module boundaries.

In practice it looks like this:

pub async fn execute(&self, task: Task) -> Result<Output, ExecutorError> {

let make_error = || ExecutorError(format!("failed to execute task {}", task.id));

let user = self.fetch_user(task.user_id).await.or_raise(make_error)?;

let result = self.process(user).or_raise(make_error)?;

Everyone ? Is the reference. When it fails at 3am instead of secret serialization errorYou see:

failed to execute task 7829, at src/executor.rs:45:12

|-> failed to fetch user "John Doe", at src/executor.rs:52:10

|-> connection refused, at src/client.rs:89:24

Now you know: it was task 7829, we were retrieving user data, and the connection was refused. You can find that task ID information in your request log and find everything you need.


putting it together

In real systems, you often need both: machine-readable errors for automatic recovery, and human-readable context for debugging. Pattern: Use a flat, type-based error type (like Apache OpenDAL) for structured data, and wrap it in a context-tracking mechanism for propagation.

// Machine-oriented: flat struct with status

pub struct StorageError {

// Human-oriented: propagate with context at each layer

pub async fn save_document(doc: Document) -> Result<(), Exn<StorageError>> {

let data = serialize(&doc)

.or_raise(|| StorageError::permanent("serialization failed"))?;

storage.write(&doc.path, data)

.or_raise(|| StorageError::temporary("write failed"))?;

At the limit, walk the error tree to find the structured error:

// Extract a typed error from anywhere in the tree

fn find_error<T>(exn: &Exn<impl Error>) -> Option<&T> {

fn walk<T>(frame: &Frame) -> Option<&T> {

if let Some(e) = frame.as_any().downcast_ref::<T>() {

frame.children().iter().find_map(walk)

match save_document(doc).await {

// For humans: log the full context tree

log::error!("{:?}", report);

// For machines: find and handle the structured error

if let Some(err) = find_error::<StorageError>(&report) {

if err.status == ErrorStatus::Temporary {

return queue_for_retry(report);

return Err(map_to_http_status(err.kind));

Err(StatusCode::INTERNAL_SERVER_ERROR)

Yes, you still have to walk on the tree. but unlike Provide,Request API, you end up like a concrete type StorageError-A documented structure with named fields that your IDE can autocomplete. No guesswork, no runtime surprises – just something you can reason about and maintain.


conclusion

Next time you write a function, see Result return type.

Don’t think of it as “I might fail.” Think of it as “I may need to explain myself.”

If your error type is “Should I try again?” Cannot respond to – then you have failed the machine. If your error log doesn’t answer “Which user was this?” – So you failed in human.

Errors are not the only failure modes to be propagated. They are communicating. These are messages that your system sends when things go wrong. And like any communication, they deserve to be designed.

Stop forwarding errors. Start designing them.

resources

Last edited Jan 04



<a href

Leave a Comment