
Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the cost increases. Researchers at Tsinghua University and Z.ai have created a technology called IndexCache that cuts up to 75% of redundant computation in sparse attention models, delivering 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.
This technique is applicable to models using the DeepSeq sparse attention architecture, including the latest DeepSeq and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in initial tests on a 744-billion-parameter GLM-5 model.
dsa barrier
Large language models rely on self-attention mechanisms, a process where the model calculates the relationship between each token and all previous tokens in its context to predict the next token.
However, self-focus has a serious limitation. Its computational complexity scales quadratically with the sequence length. For applications requiring extended context windows (for example, large document processing, multi-step agentive workflows, or long chain-of-thought logic), this quadratic scaling leads to sluggish inference speeds and significant computation and memory costs.
rare attention Provides a theoretical solution to this scaling problem. Instead of computing the correlation between each token and all preceding tokens, LESS optimizes the process by selecting each query and paying attention only to the most relevant subset of tokens.
deepseek sparse attention (DSA) is a highly efficient implementation of this concept, which was first introduced DeepSeek-V3.2. DSA introduces a lightweight scale to determine which tokens matter most "lightning indexer module" At each layer of the model. This indexer scores all the preceding tokens and selects a small batch to process the main main attention mechanism. By doing so, DSA reduces the heavy core attention computation from quadratic to linear, dramatically speeding up the model while preserving output quality.
But the researchers identified a long-standing flaw: the DSA indexer still operates at quadratic complexity at every single layer. Even though the indexer is computationally cheaper than the main attention process, as the context length increases, the time it takes for the model to run these indexers skyrockets. This seriously slows down the model, especially early on. "prefill" The stage where the prompt is processed first.
Attracting attention with IndexCache
To solve the indexer bottleneck, the research team discovered an important feature of how DSA models process data. The subset of significant tokens selected by the indexer remains remarkably stable as the data moves through consecutive Transformer layers. Empirical tests on the DSA model showed that adjacent layers share between 70% and 100% of their selected tokens.
To take advantage of this cross-layer redundancy, researchers developed IndexCache. The technique divides the layers of the model into two categories. A small number of full (F) layers maintain their own indexers, actively scoring tokens and choosing the most important ones to cache. The rest of the layers become shared(s), do no indexing and reuse the cached index from the nearest preceding F layer.
During inference, the model only checks the layer type. If it reaches the F layer, it calculates and caches fresh indices. If it’s an S layer, it skips the math and copies the cached data.
There is a wide range of optimization techniques that attempt to overcome the attention barrier. Compressing KV CacheWhere the calculated attention values are stored. Instead of shrinking the memory footprint like standard KV cache compression, IndexCache attacks the computation bottleneck.
“Index cache is not a traditional KV cache compression or sharing technique,” paper co-author Yushi Bai told VentureBeat. “It eliminates this redundancy by reusing indices across layers, thereby reducing computation rather than just memory footprint. It complements existing approaches and can be combined with them.”
The researchers developed two deployment approaches for IndexCache. (It’s worth noting that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest family of GLM models.)
For developers working with off-the-shelf DSA models where retraining is impossible or too expensive, they created a training-free method relying on a “greedy layer selection” algorithm. By running a small calibration dataset through the model, this algorithm automatically determines the optimal location of the F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely remove 75% of the indexers while matching the downstream performance of the original model.
For teams that heavily pre-train or fine-tune their own foundation models, the researchers propose a training-aware version that optimizes network parameters to natively support cross-layer sharing. This approach introduces “multi-layer distillation loss” during training. This forces each created indexer to learn how to select a consensus subset of tokens that will be highly relevant for all subsequent layers it serves.
Real-world speed on production models
To test the impact of IndexCache, researchers applied it to 30-billion-parameters GLM-4.7 Flash Created the model and compared it to the standard baseline.
At 200K reference length, removing 75% of indexers reduced prefill latency from 19.5 seconds to just 10.7 seconds, yielding a 1.82x speedup. Researchers say these speeds are expected to be even greater in longer terms.
During the decoding phase, where the model generates its response, IndexCache increased per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, achieving a 1.48x speedup. When the server’s memory is completely saturated with requests, the total decode throughput increases by 51%.
For enterprise teams, these efficiency gains translate directly into cost savings. “In terms of ROI, index caches provide consistent benefits across all scenarios, but the benefits are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” Bai said. “In these cases, we see at least a 20% reduction in deployment costs and a similar improvement in user-perceived latency.” He said that for tasks with very little context, the gain is about 5%.
Remarkably, these efficiency gains did not compromise reasoning abilities. Using a training-free approach to eliminate 75% of the indexers, the 30B model matched the average score of the original baseline on the long-reference benchmark, scoring 49.9 against the original’s 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually performed better than the original baseline, scoring 92.6 compared to 91.0.
The team also ran preliminary experiments on a production-scale 744-billion-parameter GLM-5 model. They found that eliminating 75% of its indexers with a training-free approach resulted in at least a 1.3x speedup on references over 100K tokens. Also, the model maintained almost identical quality averages on long-context tasks.
Putting IndexCache into production
For development teams looking to implement a training-free approach today, the process is straightforward but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.
“We recommend using domain-specific data as a calibration set so that the discovered layer-sharing patterns align with real workloads,” Bai said.
Once calibrated, the optimization is highly accessible to production environments. Open-source patches already exist Available on GitHub For major serving engines. “Integration is relatively simple – developers can apply the patch to an existing inference stack, such as VLLM or SGLang, and enable the index cache with minimal configuration changes,” Bai said.
While IndexCache provides an immediate solution to today’s computation bottlenecks, its underlying philosophy points to a broader shift in how the AI industry approaches model design.
“Future foundation models will be designed with downstream estimation constraints in mind from the beginning,” Bai concluded. “This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency, rather than treating these as post-hoc concerns.”
<a href