Compressed Filesystems á la Language Models

Every systems engineer has been eager to write a file system at some point in their journey. It seems daunting at first – and writing a battle-tested file system Is The minimum surface area for difficult – but “working” FS is surprisingly small, simple, and distributable to coding agents.

In fact, one of my smoke tests for new coding newbies is to see how many good file systems they can use in one go! At some point, I had so many file systems lying around – and coding models that were getting pretty good – that I started wondering if the models were actually intelligent enough to model the file system engine?

A file system is the perfect black-box API to model with quirky backends (see “Harder Drives”), and apart from the joy of LLM training for fun – there were some deeper truths about language models that I wanted to explore.

So I started training on a file system. Building on top of one of my discarded FUSEs, a few rounds with the cloud and reusing it in a loopback against the host with additional logging, I needed two things to generate reference fine-tuning data:

class LoggingLoopbackFS(LoggingMixIn, Operations):
    """
    A loopback FUSE filesystem that logs all operations for training data.
    
    This implementation delegates all filesystem operations to a real directory
    on the host filesystem, ensuring perfect semantic correctness while logging
    every operation for LLM training data.
    """

I then wrote a file system interaction simulator, sampling various operations against the sandbox LoggingLoopbackFS To generate various FUSE prompt/completion pairs. Simply put, I captured only the minimal set of operations required for R/W-ish capability (no opens, xattrs, fsync, etc.).

Along with the FUSE operation, I captured the full file system state at every turn. I experimented with different formats, including ASCII-art representation, but ultimately decided on XML because it clearly enforced prompt limitations and had canonical parsers available.

With signals including FUSE operations + XML file system trees, the model learned two forms of completion:

  • reads () Requested content/metadata as per operation (getattr , readdir , read,
  • writes () After modification, the model was requested to output the full file system tree state (unlink , chmod , truncate , write,

Example prompt (read):


read('/usr14/log767.rs', size=4096, offset=0, fh=4) 
---

  mtime="2025-01-01T00:00:00">
    mtime="2025-01-01T00:00:00">
      group="root" mtime="2025-01-01T00:00:01" size="276">
        fn main() {
    match process(7) {
        Ok(result) => println!("Result: {}", result),
        Err(e) => eprintln!("Error: {}", e),
    }

      
      group="root" mtime="2025-01-01T00:00:01" size="268">
        #!/bin/bash 
         echo "temp912" || exit 1
       
      
    
  

Closing:

fn main() {
    match process(7) {
        Ok(result) => println!("Result: {}", result),
        Err(e) => eprintln!("Error: {}", e),
    }
}

Fine Tuning #

Once I had clean, representative and diverse file system simulation data, it was pretty straightforward to actually run SFT on the model. Over a few iteration cycles spanning free time, I ended up with ~98% accuracy on hold-out evaluation after 8 epochs of SFT on the N=15000 dataset with Qwen3-4b.

Most of my time here was spent cleaning the generated data and making sure we adequately represented each FUSE operation + generated enough “complex” trees to learn from.

At this point, I wrote… possibly the smallest file system I’ve seen… to rotate my model into the real world. Each FUSE operation was a passthrough to the LLM, for example:

class LLMFuse(LoggingMixin, Operations):
    ...
    def chmod(self, path, mode):
        """Change file permissions."""
        response = self._query_llm_for_operation('chmod', path, mode=oct(mode))
        if not self._handle_llm_response(response):
            raise FuseOSError(ENOENT)
        return 0
    ...

Good! Now I had a mountable FUSE that was completely “implemented” by a language model. As you can see below, I was able to do this ls from all around, echo in files, and cat They came back out.

Moving around a Docker container with mounted llmfuse.

Perhaps the biggest glaring inefficiency in this set up is the sheer verbosity of the XML-based representation. I was using many bytes to represent attributes and tree structure that could have been encoded far more efficiently (~O(bits)) in a standard C structure.

However, as I was fine-tuning the XML file system tree representation, I was baking this same structure into the weights and probability distributions of my Quen fork! If there was a way to leverage this to compress state…

Two sides of the same coin #

As it turns out, compression and AI are closely related. Using LLM to compress text losslessly is one of the most common applications, so it is not entirely intuitive. However, one researcher (Marcus Hutter) claimed in 2006 that they are counterpart (And actually bet $500K on this claim!).

Currently, Hutter appears to be absolutely right. His enwik8 And enwik9The benchmark dataset of, today, the 169M parameters is best compressed by the LLM (trained by none other than Fabrice Ballard in 2023).

At first glance it’s a little confusing. Surely LLM compression is not reversible? What kind of voodoo magic was going on here?

Arithmetic Coding #

The algorithm that enables reversible compression using LLM is called “arithmetic coding” and is based on a 1948 result by Claude Shannon.

DeepMind researchers (including Hutter himself) have explained the math in detail, so I’ll direct the most curious of you readers there, but for a simple understanding of what’s going on, forget everything you know about working with LLMs today. There are no prompts involved!

ac

Let us assume that the following is true for some predictive model \(M\)

  • Lorem’s first-word probability = 0.57.
  • Second-word conditional probability in Ipsum = 0.67 (combined 0.38).
  • Doller’s third term is conditional probability = 0.5 (combined 0.19).

,

So on and so forth until you reach the end of the string you want to compress and you end up with some “final interval width” \(P(m)\) over the actual interval \([0,1]\) which represents your string.

Let’s say in our example it turns out to be 0.012. We can represent this decimal roughly in \(- \log_{2}{P(m)} = 6.4\) bits, which is our final compression size.

There are some great things about this algorithm:

  • Any The numbers within this interval are uniquely determined by detecting arithmetic coding algorithms through the weights of specific probabilistic models. “Decoding” is simply a retracing operation (see the line through the probability distribution above)
  • The inverse log relationship between predictive power \(P(M)\) and compression pushes the burden of the “hard compression problem” onto deep learning machinery that can encode high-dimensional text patterns within the model weights, providing far better compression ratios than deterministic algorithms.

Looks ok! But really so good Is it compression? Supported when comparing arithmetic coding Qwen3-4B against gzip For lipsum.txtWe are already seeing quite dramatic results:

Method size (bytes) compression effect
basic(plain) 446 ,
gzip 298 ~33% smaller
llmencode 13 ~97% smaller

(Comment: llmencode Is my implementation of arithmetic coding)

22 times better compression than gzip Very ridiculous! There is a caveat here lipsum.txt It is widely represented in the training data, but the 5-20x efficiency gain is valid for roughly all text data (what looks like) it has on the Internet.

self-compression #

Now, let’s go back to our file system. The XML overhead we were worried about can now be “compressed” by the fine-tuned model. Using the same toy file system from the Docker container demo above:


   path="/" name="/" mode="755" owner="root" group="root" mtime="2025-01-01T00:00:00">
     path="testdir" name="testdir" mode="755" owner="root" group="root" mtime="2025-01-01T00:00:00" />
     path="testfile.txt" name="testfile.txt" mode="644" owner="root" group="root" mtime="2025-01-01T00:00:01" size="14">
      hello llmfuse

    
  

Sample original (bytes) compressed (bytes) Ratio
Base Qwen3-4B 394 38 10.4x
Fine-Tuned Qwen3-4B 394 21 18.8x

Streamlined model achieves 44.7% better compression XML on the file system tree – the same format it was trained to predict. This is the “self-compression” effect: by baking the XML structure into the model load during fine-tuning, the arithmetic coder can represent that structure in fewer bits.

Self-compression is not a new concept in file systems. For example, there exists
squashfs Tool for creating R/O compressed file systems (created in 2002). Squashfs compresses files, inodes, and directories together, not unlike what we’re doing here!

under the hood, squashfs just wraps gzip,zstd/Your favorite compression algorithm. So for plain-text data, squashfs fades out before compression figures llmfuse,

Method compressed size notes
Squashfs (gzip) 171 bytes gzip-compressed file contents, inodes, directory tables
llmfuse (fine-tuned) 21 bytes Arithmetic Coded XML Status

For the same file system tree (a directory, a 14-byte text file), LLMfuse achieves ~8x better compression Compared to squashes (see methodology in the appendix).

there is a difference llmencode is much better than gzip On text data + XML structure – especially when the model is fine-tuned to the exact same structure.

What started as a little experiment evolved into a full-blown nerd snip and intellectual adventure primarily to get my hands dirty with training and guesswork. Thanks for making it this far!

I fully admit that this is a “toy” experiment under a very specific setup; That said, the numbers above are very attractive, and the question I’m trying to answer as I write this is: Does it have any real-world potential?

Of course, in the short term, there are a lot of caveats: you need an LLM, possibly a GPU, all your data is in the context window (which we know poorly), and it only works on text data.

Still, one wonders whether the same engines that would go on to dominate all “text generation” could be used to compress their own data? Perhaps in the distant future, where running LLM at the edge might make sense, or for specific types of workflows where data is rarely read.

Overall, I’m grateful to Peyton at Modal for the compute credit. Running such a somewhat unorthodox experiment would not have been possible without full control over the training and inference code, and would have been extremely tedious without the simplicity of running ML infra on the model! It’s really amazing to be able to do this modal deploy and get your personal estimate endpoint, or just modal run To prototype some code on the cloud.

source code #

Especially all the source code of this experiment llmfuse And
llmencode Are open-source under MIT.

llmencode Is contained in a CLI utility that you can run locally. Inference is slow on 4B models, but entirely feasible on consumer hardware. Before going into production on the model, I prototyped most of this code by running it on a 2021 MacBook Pro.

A fun experiment/party trick to identify how “common” a certain string is in the training data is to see llmencode compression ratio!

SquashFS comparison method #

Raw .sqsh The file is 4096 bytes due to block alignment padding. To find the actual compressed size, I used xxd To inspect the binary and found the last non-zero byte at offset 266 (267 bytes total). Subtracting the fixed 96-byte superblock header gives us 171 bytes of actual gzip-compressed content – ​​everything needed to rebuild the file system.

Compression as a metric #

It is equally interesting to think about compression as a metric. The angle I considered is doing some kind of RL on arithmetically coded compression numbers.

Is it just equivalent to the pre-training objective (due to prediction-compression conflict)? Or does the “sequence-level” objective add something more… interesting to the mix. If you have any ideas please get in touch!



<a href

Leave a Comment