Two days ago I was listening to the Oxide podcast on
futurelocks, a
very complicated bug involving async code in Rust. I must admit that I
struggled to understand what was going on, partly because of the subject
matter, and partly because podcasts are my backdrop to household chores. At some point towards the end, though, someone phrased
things in a way that my pea brain could immediately understand.
In essence, I think futurelocks are a complex instance of a long-standing
problem with (what I will call for now) asynchronous code. You may notice that
I did not say a “well known” problem. Personally, I only realised this problem
exists a couple of years ago. For better or worse, subsequent discussions with
many other folk have convinced me that my lack of awareness is common.
Deadlocking finalisers
Helpfully, Oxide have a thoughtful writeup of the futurelock
problem. However, even though I
like to flatter myself that I’m a competent Rust programmer, I had to work hard
to understand precisely what’s going on.
Fortunately we can create a simplified version of the underlying problem
in Python:
import threading mutex = threading.Lock() class T: def __del__(self): print("acquiring") mutex.acquire() print("acquired") mutex.release() t = T() mutex.acquire() t = None mutex.release()
In essence, this code is modelling a classic programming need. A mutex
(colloquially a “lock”) guards a shared resource (e.g. a network socket, counter, etc.). When
the garbage collector determines that an object is no longer used, it
runs its finaliser i.e. its __del__ method. In this case the finaliser
acquires the mutex (i.e. locks it), allowing it do something with the shared resource
the mutex is guarding, and then releases the mutex (i.e. unlocks it).
Unfortunately, when I run this code in CPython, it hangs at the terminal having
written just:
$ python3 t.py acquiring
What’s going on? Why hasn’t it printed acquired?
Before I explain why it hangs, it might help if I show one “fix”: if I
remove the line t = None and rerun the program it works as expected:
$ python3 t.py acquiring acquired $
The problem, in essence, is that my original program uses mutexes in a language
implementation that runs finalisers interleaved with user code on the same
operating system thread. I was able to craft this
program to show what I wanted because I happen to know that CPython’s garbage
collector uses reference counting: the t = None line meant
that the T object created on line 10 was immediately recognised as unused,
and its finaliser executed.
The finaliser then tried to acquire the mutex (line 6). That mutex is shared
with code outside the object, and unfortunately the “main” part of the program
has already grabbed the lock (line 8). At this point we have a classic
deadlock: one part of the program holds a lock; another part will not
continue until it has managed to grab that lock; but the latter needs to
complete for the former to complete, and that is not possible. No matter how
long I wait, this program will never print acquired.
The reason I have ended up in this unfortunate situation is because the
finaliser is executed asynchronously in-between lines 12 (t=None) and 13
(mutex.release()) on the same thread as the “main” program. My experience is
that we’re so used to the general idea of finalisers that we don’t stop to
think how different their execution is to that of normal functions.
Most obviously, my program has no explicit calls to t.__del__(): another part
of CPython’s run-time called that method on my behalf. Less obviously,
CPython’s run-time chose when to call __del__. It happened to do so at the
point that I wrote t=None but it could have done so at any point from then
on, and still satisfied Python’s specification. Indeed, I can cause CPython to
call __del__ at different points in many ways. For example, if I’d written
t2=t just after t=T(), my program would not have hung.
This is “non-local” reasoning at work. In general, I can’t statically guarantee
look at a line of Python code and know whether it will cause one or more
finalisers to run unless I also understand every other part of the system which
potentially relates to that line.
The history of this problem
I said earlier that this was a long-standing problem. This 2012 post from Andy
Wingo uses a
very similar example to the one above, so that’s 13 years ago. But – and I’ve
learnt this in inevitable whenever it comes to anything relating to programming
languages and memory – Hans Boehm got there much earlier, in 2002. His
wonderful paper Destructors,
Finalizers, and
Synchronization
is one of those reads where I can split my life into before/after.
In my case, I was pointed at this paper by Jacob
Hughes during the work for what ultimately became
“Alloy”, the garbage collector for Rust that we wrote up as Garbage Collection
for Rust: The Finalizer
Frontier.
Boehm’s insights in his paper had a profound effect on the design goals we
set for Alloy: the pun in the title of our paper is a fairly direct result
of the challenges that we realised Boehm had implicitly set us in his paper.
Boehm’s paper dates from a period when Java, in particular, was
experiencing all sorts of problems caused by the widespread use of
finalisers. Amongst them is the example I wrote in Python above.
Boehm explicitly, and clearly, explains exactly the problem that I
happened to show in Python above: finalisers that try to acquire mutexes
can cause surprising deadlocks. As Boehm points out, this isn’t an odd thing to want to do:
There is usually not much of a point in writing a finalizer that touches only
the object being finalized, since such object updates wouldn’t normally be
observable. Thus useful finalizers must touch global shared state.
In other words: finalisers often need to use mutexes to update other objects.
The situation in CPython isn’t too bad in this regard. Reference counting
gives me a fighting chance of working out in advance when finalisers will be
called. If I get it wrong then, assuming my program is deterministic, I will
get deadlocks at the same point in execution on every run. That makes debugging
annoying but plausible.
In a non-reference counted implementation, such as Java implementations, things
are much worse: finalisers can be called at any point; and even in seemingly
deterministic programs, they will tend to be called at different points in
different executions. Debugging such cases must have been an absolute
nightmare!
The solution for garbage collectors
Fortunately the solution is quite simple, and is most clearly articulated
on a slide accompanying Boehm’s paper:
Garbage collectors should run finalizers from a separate thread
To truly understand why this is the solution, we need to define some new terminology. Until now I have been
using “asynchronous” in its colloquial sense of “allow one operating
system thread to interleave two different portions of code”. Technical folk
like to call this concurrent execution. In contrast, parallel execution is
when two portions of code are run on two operating system threads.
Unfortunately in day-to-day use these two terms are interchangeable, leading to
endless confusion.
In our situation, I find it easier to think of cooperative
multitasking vs.
preemptive
multitasking.
I used to use a GUI-based operating system called RISC
OS which used cooperative multitasking.
A program would do some work and voluntarily give control back to
the operating system, which would then hand control over to another program.
If a program did not give control back (i.e. did not “cooperate”),
either due to a bug or because the work it was doing was taking longer than
the author had anticipated, the entire system would appear to freeze. This
happened frustratingly often, particularly for new versions of software.
Modern operating systems instead switch control between different programs when
the operating system chooses (i.e. preemptively): one program cannot hog
execution and stop others running.
Garbage collectors like CPython’s run finalizers in a “cooperative” way: they
implicitly hope that the finalizer executes quickly, or at least does not get stuck
forever. In our running example, we saw a case where a finalizer would not cooperate: it
wanted to acquire a lock that it could never get, and thus was stuck forever.
It might be tempting to look at our example and say that it’s the programmer’s
fault for writing a non-cooperative finalizer. The problem is that there is no
way to write a cooperative finalizer that achieves the desired outcome. I might
try waiting for other code to let go of the mutex with something like:
class T: def __del__(self): while not mutex.acquire(blocking=False): time.sleep(0.1) print("acquired")
but this will still hang forever and cause my CPU to get very hot! The
finalizer will only be called once: it can’t temporarily let the “main” code
run and release the mutex, despite my vague attempt above.
We now have, I hope, the terminology to understand the solution Boehm is emphasising:
Garbage collectors should run finalizers from a separate thread
As we’ve seen above, finalizers cannot be guaranteed to be cooperative in and
of themselves. The solution is then simple: running them on another thread
doesn’t require them to be cooperative! In other words, if the __del__ method
from our original example was run on another thread, then the program could not
have deadlocked: one thread would have waited until the other had released the
mutex, then continued.
This is the approach that modern JVMs take. It is also the approach we realised we
were duty-bound to take in Alloy: finalizers are always run on a separate
thread. This turned out to have huge implications which we did our level best
to address. If you read the paper, you’ll realise that a huge chunk of it is
devoted to making it possible to safely run as many Rust destructors – most of
which were not written in anticipation of multi-threading – as possible on
other threads.
From finalisers to futurelocks
At this point I can imagine many people feeling frustrated that I have
dumbed down the futurelocks problem that I motivated this post with. It’s
clearly more complex than the simple finalizer deadlock I’ve used as a running
example. Indeed, a good-faith reader might reasonably question whether the two
things share anything in common other than deadlocking.
My belief is that they share something fundamental in common: cooperative
multitasking and the interleaving execution of arbitrary code
on the same thread. As soon as you have those two ingredients, it seems to me
that you have the potential for deadlocks
In my view, async runtimes often (not always, but often) share an important
flaw with cooperative multitasking: they hope that the next future / promise
cooperates and returns control. If it cannot return control, the system will be
stuck.
In the futurelocks example, the flaw is that the select! macro
commits to running one future when another future is the one that can unlock
the mutex. Thus select! implicitly hopes that the future it is committed to
is the one that allows the system to make progress. That hope is sometimes dashed.
We can thus see some interesting similarities and differences between finaliser
deadlocks and futurelock deadlocks.
Both are based on the misplaced hope that code will always cooperate —
sometimes it is impossible for code to cooperate, even if it wants to. However,
if we could guarantee what code will execute at any given point, we could at
least assure ourselves that only truly cooperative code will be executed. Alas,
one does not know which future / promise / finaliser will be executed or
resumed: it could have come from anywhere, at any point in the past.
However, one reason I suspect that async programmers run into this issue less
frequently is that that while finalisers can run at any arbitrary point, async
schedulers only run code at awaits or when a future has completed. In other
words, the possible interleavings are many fewer in the async case. However,
“many fewer” is very different to “none”. Could one create a variant of the
select! macro that fixes the particular problem? Yes: but one could always
create other seemingly valid async code that leads to the same problem.
Solutions?
Is there a way to solve futurelocks short of causing the system to exit if a deadlock
is detected at run-time? I must admit that I am pessimistic.
Why don’t we just ban people from using mutexes in async functions? This is
clearly possible – but async system code often needs to utilise multiple
threads for the same performance reasons as non-async code. If you need that,
you’re almost certainly going to want to use mutexes to safely share state
amongst futures that might end up executing on different threads.
Can we create a static analysis that rules out the bad cases? The non-local
nature of the problem means that type systems definitely won’t be enough to
completely solve the problem, though we might be able to make it rule out some
cases. Perhaps a more sophisticated, bespoke, static analysis could do what we
want, but I suspect that it will be hard to create an analysis which accepts
enough good cases and rejects enough bad cases to be worthwhile.
Perhaps we should just run all potentially deadlocking futures in another
thread? That sounds just like the solution to the problem of deadlocking
finalisers! However, the only way this would work is if every potentially
deadlocking future / promise was executed on a thread that did not currently
hold a mutex. In general, we won’t be able to tell if a particular future /
promise might have the potential to deadlock. That would cause us having to
put many, perhaps most, futures on a thread where we know they can’t interfere
with any other half-executed future. For deeply nested async functions, this
could require the use of an unbounded number of threads. I suspect the
resulting performance would be terrible.
Summary
As this post might have suggested, I am less optimistic about
the long term desirability of async code than many other people. Async does
solve some problems extremely well — but it introduces new ones that I, and I
believe many others, find harder to reason about. This post has touched on one,
but there are others.
Fundamentally, one way the analogy in this post breaks down is that scheduling
finalisers has a simple, correct, fairly efficient solution: put them on their
own thread. I do not see that there is, or can be, a practical equivalent for the async
case, where scheduling is necessarily more complex. I fear that futurelocks will
always be a problem.
Personally, I love the fact that (at least on non-embedded devices), operating
system threads are cheap to create and run — and they give true parallelism
to boot! Rust also makes writing reliable multi-threaded code vastly easier
than in any other language. When problems do occur, which so far is very rare, I find them relatively easy to debug.
Threads aren’t a complete solution, though. In particular, they’re a very
heavyweight way of solving problems like “which of my many open network sockets
has data for me to read?” For those cases, I manually use poll or its
equivalents. I’m not a fan of the API – the number of ways to detect when a
file has been fully read with poll is mind boggling and not consistent
across platforms – but at
least problems with it remain localised to the code calling poll.
However, I appreciate that such opinions put me in a tiny minority. Despite
that, I hope that my post has shown you that there is a more general problem
which is worth knowing about. I also encourage you to listen to the Oxide
podcast, which is one of my favourite technical podcasts! Without a poke from
that, I wouldn’t have thought to write this post at all!
Newer
2025-11-12 11:05
Older
Mastodon
or Twitter;
or subscribe to the RSS feed;
or subscribe to email updates: