Context Compression Finally Works In Production: New Research Cuts LLM Input 16x Without The Accuracy Hit

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens it accumulates from retrieved documents, logic traces, and conversation history, and the more memory and computation it takes on as increasing context demands. Most existing solutions either reduce model accuracy, require the full context to be loaded before compression begins, or produce memory savings that do not translate into real speedups in standard service infrastructure.

A research team from NYU, Columbia, Princeton, the University of Maryland, Harvard and Lawrence Livermore National Laboratory published a paper this week that proposes a new solution. Researchers introduce the concept of latent context language models, or LCLMs, which are a family of encoder-decoder compression models that compress the input context before it reaches the decoder. The models are open-source on HuggingFace.

In contrast to KV cache compression methods – the dominant approach in the field, which still materializes the full KV cache before evicting entries – the LCLM decoder compresses the input token sequence before prefill, so higher compression ratios directly reduce decoder-side computation and memory. The paper reports that LCLM at 16x compression produced 8.8x faster output than the KV cache baseline on the Ruler long-context benchmark.

"These ballooning references occupy memory and computation, and they are becoming a computational bottleneck for LLM," Micah Goldblum, co-lead adviser on the project and a researcher at Columbia University, told VentureBeat. "Our goal was to train end-to-end language models that could handle very long contexts efficiently and accurately. If you could create a language model like this, everything would become cheaper and faster."

What can LCLM do?

LCLM lets models process longer contexts than would otherwise be practical, without the degradation in accuracy, at a fraction of the memory and computation cost that makes most compression methods a poor tradeoff in production.

At 4x compression, Paper Ruler reports an accuracy of 91.76% on the benchmark, compared to 94.41% without any compression. This is less than a 3 point drop for cutting the reference down to a quarter of its original size. At 16x compression, where 93.75% of input tokens are removed, the accuracy dropped to 75.06%. Every KV cache method tested at the same compression ratio had a low score.

The benefits persist even at low inputs. On GSM8K math word problems, where the full prompt is compressed rather than just the retrieved documents, LCLM outperformed every other method tested regardless of compression ratio.

How was it built?

The architecture combines a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into smaller sequences of latent embeddings. The decoder processes them in place of the original tokens. The training lasted for over 350 billion tokens.

The training recipe blends three data types:

Continuous pre-training data is concatenated with compressed and uncompressed spans
Supervised fine-tuning of data covering logic and long-context functions
An auxiliary reconstruction function that causes the encoder to preserve fine details

The combination addresses a tradeoff that previously limited the compression task, where preserving reconstruction accuracy comes at the expense of general task performance.

An architecture search identified the optimal configuration. The paper found that scaling the decoder makes more sense than scaling the encoder.

Where it fits into an agentic stack

LCLM is not an abstract research concept. It is designed to work with existing stacks. "You can easily swap the LCLM for any existing LLM," Goldblum said. "Whenever you retrieve data like documents and want to dump it into the context of your model, first run those documents through LCLM’s compressor."

He said that in the paper, the researchers demonstrated how to build agents that selectively decompress useful text.

"Think of it like a human skimming content before zooming in on relevant details," Goldblum said.

Goldblum also cautioned that teams integrating the approach into existing agentive pipelines will need to tune their RAG systems accordingly.

"We have also not worked on online compression of logic fragments," He said. "Sometimes the simple approach of compressing the trace while generating it may work, but this remains to be determined."

What does this mean for enterprises

Reference windows are growing faster than the estimating infrastructure, and enterprises are already spending to get it right. VB Pulse Q1 2026 survey data from over 100 employee organizations shows that intent to adopt hybrid recovery has increased from 10.3% in January to 33.3% in March. Recovery optimization overtook valuations as the top investment priority by March, reaching 28.9% of qualified respondents.

Three things are important for teams evaluating production capacity:

Estimate cost scales with reference length. At 1 million tokens, the uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports that at 16x compression LCLM remains within the memory limit at that reference length.
RAG pipeline integration requires tuning. Teams with existing RAG pipelines will need to validate compression behavior against their recovery quality metrics before deploying at scale.
Reasoning trace compression is unresolved. For agents running long logic chains, context growth from traces is a different problem from document retrieval. Goldblum acknowledged the difference directly: the naive approach of periodic trace compression may work but has not been tested.

The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM.

"The great thing about our architectures is to give your models access to very large contexts, but they also unlock multiscale approaches where your model can skim large amounts of text or code very quickly and then zoom in and only fully read a small portion of the most useful text," Goldblum said.

<a href

Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

What can LCLM do?

How was it built?

Where it fits into an agentic stack

What does this mean for enterprises

Like this:

Related

Leave a Comment Cancel reply

What can LCLM do?

How was it built?

Where it fits into an agentic stack

What does this mean for enterprises

Share this:

Like this:

Related

Leave a Comment Cancel reply