Why German Strings are Everywhere

Strings are conceptually very simple: it’s essentially just a sequence of characters, right? Why, then, does each programming language have its own slightly different string implementation? It turns out there’s a lot more to a string than “just a sequence of characters”,

We are no different and have created our own custom string type that is highly optimized for data processing. Although we didn’t expect it when we first wrote about it in our inaugural Umbra research paper, many new systems adopted our format. They are now implemented in DuckDB, Apache Arrow, Pollers, and Facebook Velox.

In this blog post, we would like to tell you more about the advantages of our so-called “German strings” and the tradeoffs we have made.

But first, let’s take a step back and look at how strings are typically implemented.

How does C do this?

In C, strings are just a sequence of bytes with ambiguous promises. \0 The byte will terminate the string at some point.

cstring

This is a very simple model conceptually, but very cumbersome in practice:

  • What if your string doesn’t end? If you’re not careful, you can read beyond the intended end of the string, which is a huge security problem!
  • Simply calculating the length of the string forces you to iterate over the whole thing.
  • What if we want to extend the string? We have to allocate new memory, move it there and free the old space ourselves.

How does C++ do this

C++ exposes strings very well through its standard library. The C++ standard doesn’t enforce any specific implementation, but here’s how libc++ Is it:

stdstring

Each String object stores its size (already better than C!), a pointer to the actual payload, and the capacity of the buffer in which the data is stored. As long as the resulting string is still smaller than the buffer capacity, you are free to add to the string “for free” and the string will take care of allocating a larger buffer and freeing the old one if it grows too much:
std::stringAre variable,

This string implementation also allows very important “short string optimizations”: a short enough string can be stored “in place”, that is, we set a specific bit in it. capacity fields and balances capacity as well as size And ptr Become the string itself. This way we avoid allocating a buffer and pointer dereference every time we access the string. By the way, an optimization that is impossible in Rust 😉,

If you’re interested in a more in-depth overview, take a look at Raymond Chen’s very detailed blog post about various std::string execution

Can we do better?

C++, especially with short string optimizations, already performs quite well. However, if you know your use case, it turns out you can do much better.,

When building CedarDB, we made the following observations:

Most strings are short

Despite being able to store arbitrary amounts of text, most store fairly small and predictable data in their strings (as Vogelsgesang et al report in their brilliant “Get Real” paper).

Examples of such short strings are:

  • ISO country code (USA, DEU, GBR), 3 letters
  • IATA airport code (LHR, MUC), 3 letters
  • enms (MALE/FEMALE/NONBINARY, YES/NO/MAYBE), usually less than 10 characters
  • isbn (0465062881), 10 or 13 digits

We definitely want to optimize for such short strings wherever possible.

Wires are not usually changed that often

Most data is written only once, but read multiple times. libc++ If one wants to extend the string then the approach of reserving 64 bits per string just for storing capacity seems wasteful when the size of the string does not change often.

Also, if we don’t use expensive locking techniques or don’t think very carefully about our application, simultaneously accessing and modifying a string can lead to data races.

For these two reasons, we would like adamant Strings whenever we can avoid it.

We usually only look at a small subset of the string

Take a look at the following for the SQL query:

select * from messages where starts_with(content, 'http');

We only want to see the first four characters of each string. It seems kind of wasteful to always dereference a pointer just to compare the first four characters.

Note that short string optimization libc++ Here we cannot save: if the string is not short, we have to follow the pointer, even if we only care about the prefix.

Let’s look at another question:

select sum(p.amount)
from purchases p, books b
where p.ISBN = b.ISBN and b.title = 'Gödel, Escher, Bach: An Eternal Golden Braid';

Here, we need to compare all the ISBNs to each other and all the book titles with a string constant. Although we need to compare the entire string to make sure we have a match, most books will probably not be called “Gödel, Escher, Bach: An Eternal Golden Braid” (dear non-German reader, how many strings do you know starting with “Go”?). If a character at the beginning of the string is already different, we can discard it without checking the rest of the string.

german strings

To solve these problems, CedarDB’s research predecessor, Umbra, invented what Andy Pavlow now lovingly (we assume ;)) calls “German-style strings”.

Let’s Learn: Anatomy of the German String

The first major change is that each string is represented by a single 128-bit structApart from the obvious advantage of saving one third of overhead std::string by knocking down capacity fields, it also allows us to pass strings through two registers in a function call instead of placing them on the stack.

Do you want to know how this affects the function call? We cover the benefits of the 16B layout for conference calling in our in-depth study.

it struct Contains one of the following two representations:

short string representation

Here is the memory layout for short strings:

shortstring

As long as the string to be stored is 12 characters or less, we store the content directly in the same location.

Accessing the content, or just a prefix, is easy: just start reading straight away len Field, no pointer dereference needed!

long string representation

Strings longer than 12 characters require some additional consideration:

longstring

Like C++ strings, we also store len A pointer to fields and data, but with some changes:

Length

To squeeze the entire string into 128 bits, we truncate the length field to 32 bits. This limits the string size to 4 GiB, a compromise we were willing to make for our use case.

Prefix

Right after the length we also store a four character prefix. This greatly speeds up tasks like prefix equality/dissimilarity comparisons, lexicographic sorting, or prefix comparisons, because we save a pointer dereference.

indicator

The pointer points to a memory area of ​​exactly the size of the string. no buffer with some left capacity Here!

The obvious advantage is that we save 64 bits for capacity area and can tightly pack the payload of different strings without gaps.

Since the data we are pointing to is now immutable, we can even read it without acquiring a read lock, because the data is guaranteed to never change as long as the string lives.

We also need to be aware of the downside: appending data to a string is now a relatively expensive operation, because a new buffer must be allocated and the payload copied. However, in our use case this is not a big problem, since database systems rarely update data anyway.

storage class

While developing our string concept, we noticed that developers have different requirements for the lifetime of strings depending on where they use them. We call these “storage classes”, which you can specify when you create a string. can be a string persistent, transientOr temporaryTo encode this storage class, we steal two bits from the pointer,

Let’s start with the cases you might already know from your favorite programming language:

  • temporary Strings behave like their C++ counterparts: when constructing a temporary string, the string constructor itself allocates some memory, stores the payload, and sets the pointer correctly. When the string goes out of scope, the memory is freed, RAII style.
  • persistent Strings behave like string constants: they remain valid forever. All short strings are always contiguous strings, because you can only pass them through the stack or CPU registers. Long strings can also be persisted, for example, when referencing C++ string literals. When your program starts the memory holding the string literal data is automatically statically allocated and is never deleted, so referencing the string data is always valid.

But there is a third pattern: the data we need to store in database systems is often larger than RAM, and so some of it is diverted to disk. With traditional strings, if we load a page containing a string from disk, we have to

  • First load the page into memory,
  • And then initialize a new string that internally copies the data to the new initial memory.

This process copies the string data twice. This is useless, because many times we want to access the string only once. For example, consider the following query:

select * from books where starts_with(title,'Tutorial')

If we filter all books for matches, most strings will not qualify. We will never need to show them to the person issuing the query, so why copy them if we don’t need to access them later?

We want a string that is very cheap to construct and points to an area of ​​memory currently Valid, but may become invalid later without control over the string.

right here transient Strings come. They point to data that is currently valid, but may become invalid later, for example, when we swap the page to disk after releasing the lock on the page on which the payload is stored.

There is virtually no overhead in creating them: they simply point to an externally managed memory location. No memory allocation or data copying required during creation! When you access a transient string, the string itself will not know whether the data it points to is still valid, so as a programmer you need to make sure that every transient string you use is actually still valid. So if you need to access it later, you have to copy it to memory you control. If you don’t need it later, we’ve just accessed one string without doing any expensive initialization!

distinguish between representations

How do we know whether we are dealing with a short string or a long string? It’s actually quite simple! If its size is 12 characters or less, it is a short string. Since our strings are immutable, there is no special case where a longer string becomes shorter or vice versa.

If we only want to access the prefix, we don’t even need to check whether the string is long or short. In either case, bits 32-63 will be the first four characters.

conclusion

German strings have many advantages: you get great performance due to space savings and less allocation and data movement. Since the data is always treated as immutable, it is also very easy to parallelize your string handling code. Thanks to its concept of storage classes, you can tightly manage the lifetime of your strings, trading performance for ease of use when necessary.

Of course, nothing comes without challenges: German strings require you to think more deeply about your application: What is the lifetime of my string? Can I avoid this? transient String, or do I have to copy it? Will my strings be updated frequently? Am I ok with immutable strings?

If you’re willing to ask yourself these questions, you can benefit a lot from German strings, even if you’re not creating a database.

Do you want to harness the power of German strings? Don’t be shy and join our waiting list to be among the first to get access to CedarDB!



<a href

Leave a Comment