This post is about optimizing an extremely simple AST-walking interpreter for a dynamic language called Zef that I created for fun to the point where it is competitive with the likes of Lua, QuickJS, and CPython.
Why?
Most of what gets written about making language implementations fast focuses on the work you’d do when you already have a stable foundation, like writing yet another JIT (just in time) compiler or fine tuning an already pretty good garbage collector. I’ve written a lot of posts about crazy optimizations in a mature JS runtime. This post is different. It’s about the case where you’re starting from scratch, and you’re nowhere near writing a JIT and your GC isn’t your top problem.
The techniques in this post are easy to understand – there’s no SSA, no GC, no bytecodes, no machine code – yet they achieve a massive 16x speed-up (67x if you include the incomplete port to Yolo-C++) and bring my tiny interpreter into the ballpark of QuickJS, CPython, and Lua.
The techniques I’ll focus on in this post are:
- Value representation.
- Inline caching.
- Object model.
- Watchpoints.
- Grinding through common sense optimizations.
To evaluate my progres, I created a benchmark suite called ScriptBench1. This has ports of classic language benchmarks to Zef:
- Richards (OS scheduler)
- DeltaBlue (constraint solver)
- N-Body (physics simulation)
- Splay (binary tree test)
These benchmarks are also available in a wide variety of other languages. I found existing ports of these benchmarks to JavaScript, Python, and Lua. For Splay, there weren’t existing Python and Lua ports, so I used Claude to port them.
All experiments run on Ubuntu 22.04.5 running on a Intel Core Ultra 5 135U with 32GB RAM and Fil-C++ version 0.677. Lua 5.4.7 is compiled with GCC 11.4.0. QuickJS-ng 0.14.0 is the binary from QuickJS’s GitHub releases page. CPython 3.10 is just what came with Ubuntu.
All experiments use the average of 30 randomly interleaved runs.
To be clear: for most of this post, I’ll be comparing my interpreter compiled with Fil-C++ to other folks’ interpreters compiled with Yolo-C compilers.
This post starts with a high-level description of the original AST-walking, hashtable-heavy Zef interpreter, followed by a section for each optimization that I landed on my journey to a 16.6x speed-up.
The original Zef interpreter was written with almost no regard for performance. Only two performance-aware choices were made:
- The value representation is a 64-bit tagged value that may hold a double, a 32-bit integer, or a
Object*. Doubles are represented by offsetting them by0x1000000000000(a technique I learned from JavaScriptCore; the literature has taken to calling thisNuN tagging
). Integers and pointers are represented natively, and I’m relying on the fact that no pointer will have a value below0x100000000(a dangerous choice, but one that you could force to be true; note that I could have represented integers by giving them a high bit tag of0xffff000000000000if I was worried about this). This makes it easy to have fast paths for operations on numbers (because you can detect if you have a number, and what kind, with a bit test). Even more importantly, this avoids heap allocations for numbers. If you’re building an interpreter from scratch, it’s good to start by making good choices about the fundamental value representation, since it’s super hard to change later! 32-bit or 64-bit tagged values are a standard place to start, if you’re implementing a dynamically typed language. - I used some kind of C++. It’s important to pick a language that allows me to do all of the optimizations that language implementations eventually grow to have, and C++ is such a language. Notably, I would not pick something like Java, since there’s a ceiling to how many low level optimizations you can do. I would also not pick Rust, since a garbage collected language requires a heap representation that has global mutable state and cyclic references (though you could use Rust for some parts of the interpreter, if you were happy with being multilingual; or you could use Rust if you were happy with lots of
unsafecode).
I also made tons of expedient choices that were wrong from a performance engineering standpoint:
- I used Fil-C++. This did allow me to move very quickly – for example, I get a garbage collector for free. Also, it meant that I spent zero time debugging memory safety issues (Fil-C++ reports memory safety violations with a pretty stack trace and lots of diagnostics) or undefined behavior (Fil-C++ does not have undefined behavior). Fil-C++ costs about 4x performance typically, so I’m starting with that 4x handicap, on top of all of the other suboptimal choices.
- Recursive AST walking interpreter. The interpreter is implemented as a virtual
Node::evaluatemethod that gets overridden in a bunch of places. - Strings everywhere. For example, the
GetAST node holds astd::stringto describe the name of the variable that it’s getting, and that string is used each time a variable is accessed. - Hashtables everywhere. When that
Getexecutes, the string is used as a key to astd::unordered_map, which contains the variable value. - Chains of recursive calls to crawl the scope chain. Zef allows almost all constructs to be nested and nesting leads to closures; for example, class A nested in function F nested in class B nested in function G means that member functions of class A can see A’s fields, F’s locals, B’s fields, and G’s locals. The original interpreter achieved this by recursing in C++ over functions that can query different scope objects.
That said, those poor choices
allowed me to implement an interpreter for a fairly sophisticated language with very little code. The largest module by far is the parser. Everything else is simple and crisp.
This interpreter was 35x slower than CPython 3.10, 80x slower than Lua 5.4.7, and 23x slower than QuickJS-ng 0.14.0. Let’s see how far we can get by implementing a bunch of optimizations!
The first optimization is to have the parser generate distinct AST nodes for each operator as opposed to using the DotCall node with the name of the operator.
In Zef, this:
a + b
Is identical to this:
a.add(b)
So, the original interpreter would parse a + b to DotCall(a, "add") with b as an argument. That lead to slow execution since every since math operation involved a string lookup of the operator’s method name:
With this optimization, we have the parser create Binary<> and Unary<> nodes. With the help of some template and lambda magic, these nodes have separate virtual overrides for Node::evaluate per operator. These call directly into the corresponding Value fast paths for those operators. Hence, doing a + b now results in a call to Binary, which then calls Value::add.
This change is a 17.5% speed-up. At this point, Zef is 30x slower than CPython 3.10, 67x slower than Lua 5.4.7, and 19x slower than QuickJS-ng 0.14.0.
In the previous optimization, we made operators fast by avoiding string comparison based dispatch. But that change didn’t affect all operators! The RMW forms of those operators, like:
a += b
still used string based dispatch. So, the second optimization is to have the parser generate distinct nodes for each of the RMW cases. What’s happening here is that the parser requests LValue nodes to replace themselves with an RMW via the makeRMW virtual call:
- Get – corresponds to getting a variable, i.e. just
id - Dot – corresponds to
expr.id - Subscript – corresponds to subscript, i.e.
expr[index]
Each of these virtual calls use the SPECIALIZE_NEW_RMW macro to create template specialized forms of:
Note that while the rest of the operator specialization (from change #1) uses lambdas to dispatch to the appropriate operator function Value, for RMWs we use an enumeration. This is a practical choice because of the number of places we have to thread the enum through to handle the fact that we may arrive at an RMW three different ways (get, dot, and subscript). All of this magic then bottoms out in the Value::callRMW<> template function, which dispatches the actual RMW operator call.
This change is a 3.7% speed-up. At this point, Zef is 29x slower than CPython 3.10, 65x slower than Lua 5.4.7, and 18.5x slower than QuickJS-ng 0.14.0. We’re now 1.22x faster than where we started.
The Value fast paths have a small problem: they use isInt(), which uses isIntSlow(), which does a virtual call to Object::isInt() to check if we’re really dealing with an int.
This is happening because the Zef value representation in the original interpreter had four distinct cases:
- A tagged int32 value.
- A tagged double value.
- An IntObject for int64’s that cannot be represented as int32’s.
- All other objects.
In the IntObject case, Value still drove the dispatch for all integer methods, since that allowed the interpreter to just have one implementation of all math operators (and that implementation was alwayts in Value).
This simple optimization causes Value fast paths to only consider int32 and double, and puts all IntObject handling in IntObject itself. Additionally, this change avoids the isInt() call on every method dispatch.
This is a 1% speed-up. At this point, Zef is 29x slower than CPython 3.10, 65x slower than Lua 5.4.7, and 18x slower than QuickJS-ng 0.14.0. We’re now 1.23x faster than where we started.
The original Zef interpreter uses std::string everywhere. Particularly brutal cases:
This is unfortunate because it means that all of these lookups don’t just involve hashtables – they involve hashtables keyed by those strings! So we’re hashing and comparing strings all the time when executing Zef.
This next optimization uses pointers to hash-consed Symbol objects instead of strings for all of those lookups.. This is a large change in terms of files impacted, but it’s really quite simple:
This is a 18% speed-up. At this point, Zef is 24x slower than CPython 3.10, 54x slower than Lua 5.4.7, and 15x slower than QuickJS-ng 0.14.0. We’re now 1.46x faster than where we started.
This change delivers a significant win by allowing inlining of important functions..
Almost all of the action in this change is the introduction of the new valueinlines.h header. This has separate header from value.h, since it uses headers that need to include `value.h.
This is a 2.8% speed-up. At this point, Zef is 24x slower than CPython 3.10, 53x slower than Lua 5.4.7, and 15x slower than QuickJS-ng 0.14.0. We’re now 1.5x faster than where we started.
Sometimes the only way to make your language implementation better is to land a massive patch. Don’t let anyone tell you that good engineering happens in small, easy to digest changes. That’s not always the case! It’s certainly not the case if you want to have a fast implementation of a dynamic language!
This is a massive change that redoes how Object, ClassObject, and Context work so that objects are cheaper to allocate and accesses can avoid hashtable lookups. This change combines three changes into one:
Object model
Previously, each lexical scope allocated Context object, and each Context object contained a hashtable of fields
– i.e. the variables in that scope. Objects were even worse: each object was a hashtable that mapped the classes that the object was an instance of to Context objects. This was necessary because if you have an instance of Bar that descends from Foo, then Bar and Foo could both close over different scopes and they could share the same names for distinct fields (since fields are private by default in Zef). Clearly this is super inefficient! This change introduces the idea of Storage, which holds data at Offsets determined by some Context. So, Contexts still exist, but they are created ahead of time as part of the AST resolve pass; when objects or scopes are created, we just allocate a storage according to the size computed by the corresponding Context.
Inline caches
This is a classic technique that forms the foundation of modern high performance dynamic language implementations. But while this technique is classically discussed in the context of JIT compilers, in this change we’ll use it in an interpreter. The idea of inline caches is that given a location in code that does expr.name, we remember the last type that expr dynamically had and the last offset that name resolved to. In this change, the remembering
is done by placement constructing a specialized AST node on top of the generic one. There are five parts to this:
- The
CacheRecipeobject is used to track what a particular access did, and whether it’s cacheable. - There are calls to
CacheRecipesprinkled throughoutContext,ClassObject, andPackage. - AST evaluation functions like
Dot::evaluatepass theCacheRecipethey got from whatever polymorphic operation they performed toconstructCache<>along withthis. constructCachecompiles a new AST node specialization according toCacheRecipe. Yes, compiles:constructcache.hhas a template machine that can produce many different kinds of specialized AST nodes. For example, it might specialize to a direct load on thestoragethat the AST node was passed (like in case of a local variable access). Or, it might emit a class check (does this object still have the class I last saw?
) followed by a direct function call to whatever function was last seen. These may be composed with chain steps and watchpoints if the access being cached involved walking the scope chain.- Each AST node that is subject to caching has a cached variant that first attempts a fast call into a
cacheobject, whose type is determined byconstructCache<>.
Wachpoints
Say that we have a class Foo inside a lexical scope that has variable x, and one of Foo‘s methods wants to access x. And, let’s say that there are no functions or varialbes called x inside Foo. We should be able to access x without any checkes, right? Well, not quite – someone could subclass Foo and add a getter called x, in which case that access should resolve to the getter, not the outer x. The way that inline caches handle this is by setting Watchpoints` within the runtime. In this example, it’s the was the name overridden
watchpoint.
Putting it together
Each of these three features is large. I chose to implement all of them at once because:
- A new object model would not be meaningfully better unless it allowed inline caching to work well. So, I codeveloped the object model and inline chaces.
- Inline caches wouldn’t provide meaningful benefit unless I also had watchpoints, because so many cacheable conditions require watchpoints.
- The new object model and watchpoints have to work great together.
I started this change by writing a dumb version of CacheRecipe along with what ended up being the mostly final version of Storage and Offsets.
Some of the hardest work involved replacing the old style of intrinsic classes with a new style. Take arrays as an example. Previously, ArrayObject::tryCallMethod implemented all ArrayObject methods by simply intercepting the virtual Object::tryCallMethod call. But in the new object model, Object has no vtable and no virtual methods; instead Object::tryCallMethod forwards to object->classObject()->tryCallMethod(object, ...). So, for Array to have methods, we need to create a class for Array that has those methods. Hence, this change shifts a lot of intrinsic functionality from being spread throughout the implementation to being focused inside makerootcontext.cpp. This is a good outcome, because it means that all of the inline caching machinery just works for native/intrinsic functions on objects!
This massive change has a massive win: 4.55x faster! At this point, Zef is 5.2x slower than CPython 3.10, 11.7x slower than Lua 5.4.7, and 3.3x slower than QuickJS-ng 0.14.0. In other words, Zef compiled with Fil-C++’s margin of loss against those other interpreters is right around what Fil-C costs (those other interpreters are compiled with Yolo-C).
We’re now 6.8x faster than where we started.
Before this change, the Zef interpreter would pass arguments to functions using const std::optional<:vector>>&. The optional was needed because in some corner cases we have to distinguish between:
o.getter
and:
o.function()
In most cases, in Zef, these two things are the same: they are a function call. Here’s an exception:
o.NestedClass
vs:
o.NestedClass()
The first case gets the NestedClass object, while the second case instantiates it.
Therefore, we need to tell if we’re passing an empty arguments array because this is a function call with zero arguments, or an empty arguments array because this was a getter-like call.
In any case, this is wildly inefficient because it means that the caller is allocating a vector and then the callee is allocating an arguments scope that is a copy of that vector.
This change introduces the Arguments type, which is shaped exactly like the arguments scope that the callee would have allocated. So, now we have the caller allocate these directly. This more than halves the number of allocations needed to make a call:
A lot of this change is just changing function signatures to take Arguments* instead of the optional vectors.
This is a 1.33x speed-up. At this point, Zef is 3.9x slower than CPython 3.10, 8.8x slower than Lua 5.4.7, and 2.5x slower than QuickJS-ng 0.14.0. We’re now 9.05x faster than where we started.
Like Ruby and many other object oriented languages, Zef has private instance fields by default. They are private in the sense that only that instance can see them. Take this code:
class Foo {
my f
fn (inF) f = inF
}
This is a class Foo that takes a value for f in its constructor, and stores it to a local variable scoped just to instances. For example, this wouldn’t work:
class Foo {
my f
fn (inF) f = inF
fn nope(o) o.f
}
println(Foo(42).nope(Foo(666)))
The o.f expression in nope cannot access o‘s f even though o is of the same type. This is just an outcome of the fact that fields work by appearing in the scope chain of class members. When we do something like o.f, we’re asking to call a method called f. Hence, we get lots of code like:
class Foo {
my f
fn (inF) f = inF
fn f f # method called f that returns local variable f
}
Or, more tersely:
class Foo {
readable f # shorthand for `my f` and `fn f f`
fn (inF) f = inF
}
Hence, lots of method calls end up being calls to getters. It’s super wasteful to have all of those calls evaluate the AST of the getter along with everything this entails!
So the next change is to specialize getters.
The heart of this change is in UserFunction, which uses the new Node::inferGetter method to infer whether the body of the function is just a getter. The important bits of this are:
This is a 5.6% speed-up. At this point, Zef is 3.7x slower than CPython 3.10, 8.3x slower than Lua 5.4.7, and 2.4x slower than QuickJS-ng 0.14.0. We’re now 9.55x faster than where we started.
Now let’s do the same thing to setters!
The inference is a bit more complex this time, since we have to pattern match:
fn set_fieldName(newValue) fieldName = newValue
This means:
This is a 3.4% speed-up. At this point, Zef is 3.6x slower than CPython 3.10, 8x slower than Lua 5.4.7, and 2.3x slower than QuickJS-ng 0.14.0. We’re now 9.87x faster than where we started.
This is a one line change to inline an important function..
It’s a 3.2% speed-up. At this point, Zef is 3.5x slower than CPython 3.10, 7.8x slower than Lua 5.4.7, and 2.2x slower than QuickJS-ng 0.14.0. We’re now 10.2x faster than where we started.
Before this change, if the inline cache for a method call missed, then we’d have to descend through ClassObject::tryCallMethod and ClassObject::TryCallMethodDirect, which are quite large and complex.
Also, that’s O(hierarchy depth), or more precisely, two hashtable lookups per level in the hierarchy:
This next change introduces a global hashtable keyed by receiver class and symbol that directly gives the callee in one lookup..
In classobject.h, we perform a lookup in this global table before descending into the full tryCallMethodSlow.
In classobject.cpp, we record successful lookups in the global table..
The global hashtable is itself quite simple.
This is a 15% speed-up. At this point, Zef is 3x slower than CPython 3.10, 6.8x slower than Lua 5.4.7, and 1.9x slower than QuickJS-ng 0.14.0. We’re now 11.8x faster than where we started.
In Fil-C++, the std::optional needs to be heap allocated because of a compiler pathology involving unions.
Normally, LLVM plays fast and loose with the types of memory accesses used for unions, but this breaks invisicaps: pointers in unions will sometimes – quite unpredictably to the programmer – lose their capabilities. This leads to Fil-C panics saying that you’re dereferencing an object with a null capability even though the programmer did nothing wrong. To mitigate this, the Fil-C++ compiler inserts intrinsics that force LLVM to be conservative about handling local variables of union type. Later, the FilPizlonator pass performs its own escape analysis to try to allow union typed locals to be register allocated – however, this analysis is nowhere near as complete as the usual LLVM SROA analysis.
The result: passing around classes like std::optional that have a union in them often leads to memory allocations in Fil-C++!
So, this change avoids a code path that leads to std::optional on a hot path..
This is a 1.7% speed-up. At this point, Zef is 3x slower than CPython 3.10, 6.65x slower than Lua 5.4.7, and 1.9x slower than QuickJS-ng 0.14.0. We’re now 12x faster than where we started.
All of the built-in functions in Zef take one or two arguments, and the native implementations have no need for an Arguments object to be allocated to hold those; they could just as easily be taking those arguments directly.
Setters all take one argument, and if we infer the setter, then the specialized setter implementation has no need for an Arguments object to be allocated; again, that code could just take the value argument directly.
This next change introduces specialized arguments types called ZeroArguments, OneArgument, and TwoArguments to allow the caller to avoid allocating the Arguments object in cases where the callee doesn’t need it.
At the heart of this change are the new specialized arguments types. Note that we need ZeroArguments to distinguish from passing a (Arguments*)nullptr. We already used (Arguments*)nullptr to mean getter call
and we leave that logic alone, so now ZeroArguments means function call with no arguments
.
A lot of this change is about templatizing functions that take arguments and then explicitly instantiating them for ZeroArguments, OneArgument, TwoArguments, and Arguments*. A lot of code was already using Value::getArg as a helper to extract arguments, so this change just adds overloads for the argument specializations. Consequently, changes to native code that uses arguments are quite straight-forward.
This is a 3.8% speed-up. At this point, Zef is 2.9x slower than CPython 3.10, 6.4x slower than Lua 5.4.7, and 1.8x slower than QuickJS-ng 0.14.0. We’re now 12.4x faster than where we started.
(This next change gets a big speed-up from working around another Fil-C pathology.](diff-viewers/14-valueslowpaths.html)
Before this change, Value out-of-line slow paths were member functions of Value, meaning that they took an implicit const Value* argument. This means that the caller must stack-allocate a Value.
In Fil-C++, all stack allocation is a heap allocation. Hence, code that called slow paths would allocate Value in the heap!
This changes those methods to be static and take Value by value, obviating the need for any allocation.
10% speed-up. At this point, Zef is 2.6x slower than CPython 3.10, 5.8x slower than Lua 5.4.7, and 1.65x slower than QuickJS-ng 0.14.0. We’re now 13.6x faster than where we started.
This change removes some duplicated code.
I was hoping it would be a speed-up because less machine code is better – especially in a template function that gets specialized by constructCache<>.
But it’s not a speed-up. This has no effect on performance.
Inline caches are good at routing calls to exactly the function you want, but they only work for objects. For non-objects, we’re relying on the fact that Binary<>, Unary<>, and Value::callRMW<> take us to fast paths that check if the receiver is an int or double.
But this only works for operators that are recognized by the parser. It doesn’t work if we do value.sqrt, for example.
So this next change teaches Dot to specialize for value.sqrt..
This is a 1.6% speed-up. At this point, Zef is 2.6x slower than CPython 3.10, 5.75x slower than Lua 5.4.7, and 1.6x slower than QuickJS-ng 0.14.0. We’re now 13.8x faster than where we started.
This is almost exactly like the previous optimization, but for toString.
There’s a bit of extra logic in this change to reduce the number of allocations that happen when converting an int to a string.
This is a 2.7% speed-up. At this point, Zef is 2.5x slower than CPython 3.10, 5.6x slower than Lua 5.4.7, and 1.6x slower than QuickJS-ng 0.14.0. We’re now 14.2x faster than where we started.
Say that we do:
my whatever = [1, 2, 3]
This has to allocate a new array, since arrays are aliasable and mutable in Zef. But worse, prior to this change, this would recurse down the AST every single time to evaluate 1, 2, and 3.
This change specializes the ArrayLiteral node in case it’s allocating a constant array..
This is a 8.1% speed-up. At this point, Zef is 2.3x slower than CPython 3.10, 5.2x slower than Lua 5.4.7, and 1.5x slower than QuickJS-ng 0.14.0. We’re now 15.35x faster than where we started.
Previously, we got a bit speed-up from not passing Value by reference. This is the same optimization, just for the callOperator slow path.
This is a 6.5% speed-up. At this point, Zef is 2.2x slower than CPython 3.10, 4.9x slower than Lua 5.4.7, and 1.4x slower than QuickJS-ng 0.14.0. We’re now 16.3x faster than where we started.
This changes disable RTTI and libc++ hardening, since it’s superfluous in Fil-C++.
There aren’t any C++ changes here; it’s just a build system configuration change.
This is a 1.8% speed-up. At this point, Zef is 2.1x slower than CPython 3.10, 4.8x slower than Lua 5.4.7, and 1.35x slower than QuickJS-ng 0.14.0. We’re now 16.6x faster than where we started.
This last optimization turns off assertions by default.
Previously, the code used the Fil-C-specific ZASSERT macro, which means assert all the time
. Now the code uses the internal ASSERT macro, which means only assert if `ASSERTS_ENABLED` is set
.
This change also includes some other changes to make the code build in Yolo-C++.
I was hoping that this would be a speed-up, but it wasn’t.
Finally, I tried compiling the code with Yolo-C++. This yielded a 4x speed-up, though it’s both unsound and suboptimal:
- It’s unsound because we’re turning what used to be calls into the Fil-C++ GC into calls to
calloc, which means that memory is never freed. This interpreter will run out of memory for sufficiently long-running workloads. It happens to not run out of memory on ScriptBench1 because those tests run for only a short time. - It’s suboptimal because real GC allocators are faster than the
callocthat comes with glibc 2.35.
So, if I did add a real GC to Zef as part of the Yolo-C++ port, I’d probably get more than a 4x speed-up.
I used GCC 11.4.0 for this experiment.
At this point, Zef is 1.9x faster than CPython 3.10, 1.2x slower than Lua 5.4.7, and 3x faster than QuickJS-ng 0.14.0. We’re 67x faster than where we started!
Here’s raw data, showing the execution times in seconds of all of the benchmarks, as well as the geomean of those benchmarks, on each interpreter.
<a href