Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

Spase attention
Researchers at Nvidia have developed a technology that can reduce the memory cost of large language model logic by up to eight times. His technique, is called dynamic memory sparseness (DMS), compress key value (KV) caches, generate and store temporary memory LLMs as they process signals and reason through problems and documents.

Although researchers have previously proposed various methods for compressing this cache, most have difficulty doing so without degrading the intelligence of the model. Nvidia’s approach manages to discard most of the cache while maintaining (and in some cases improving) the model’s logic capabilities.

Experiments show that DMS enables LLM "Thinking" Find more solutions without taking longer and the usual penalty in speed or memory cost.

problem of logic

LLMs improve their performance by generating complex tasks "chain of thought" token, essentially writing out your reasoning steps before arriving at a final answer. Inference-time scaling techniques take advantage of this by giving the model a larger budget to generate these thinking tokens or explore many possible reasoning paths in parallel.

However, this improved logic comes with a significant computational cost. As the model generates more tokens, it creates a KV Cash.

For real-world applications, KV cache is a major bottleneck. As the logic chain grows, the cache grows linearly, consuming a large amount of memory on the GPU. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also limits the number of users a system can serve simultaneously, as running out of VRAM causes the system to crash or slow to a crawl.

Nvidia researchers see this not only as a technical hurdle, but as a fundamental economic hurdle for the enterprise.

"The question is not just about hardware quantity; It’s about whether your infrastructure is processing 100 reasoning threads or 800 threads at the same cost," Piotr Nawrot, senior deep learning engineer at Nvidia, told VentureBeat.

Previous attempts to solve this focused on estimation-based approaches. These methods use strict rules, such as a "sliding window" It caches only the latest token and deletes the rest. Although this reduces memory usage, it often forces the model to discard important information needed to solve the problem, reducing the accuracy of the output.

"Standard eviction methods attempt to select old and unused tokens for eviction using heuristics," The researchers said. "They simplify the problem, hoping that if they guess the internal mechanics of the model, the answer will be correct."

Other solutions use paging to offload unused parts of the KV cache to slower memory, but the constant swapping of data introduces latency overhead that makes real-time applications sluggish.

dynamic memory sparseness

DMS takes a different approach "recombination" Existing LLM to manage its own memory intelligently. Instead of enforcing a fixed rule for what to remove, DMS trains the model to identify which tokens are essential for future logic and which are disposable.

"It does not merely estimate importance; It learns a policy that explicitly preserves the final output distribution of the model," Navrot said.

This process transforms a standard, pre-trained LLM such as Llama3 or Quen3 into a self-computing model. Importantly, this does not require training the model from scratch, which would be prohibitively expensive. Instead, DMS reuses existing neurons within the attention layers of the model for output. "Keep" Or "Evict" Hint for each token.

For teams concerned about the complexity of retrofitting, the researchers noted that the process is designed to be lightweight. "To improve the efficiency of this process, the model’s weights can be frozen, making the process similar to low-rank optimization (LoRA)," Navrot said. This means a standard enterprise model like Qwen3-8B "A DGX H100 can be retrofitted with DMS in just a few hours."

One of the important parts of DMS is a mechanism called "Delayed removal." In standard sparseification, if a token is deemed unimportant, it is immediately removed. This is risky because the model may need a split of a second to integrate the context of that token into its current state.

DMS mitigates this by marking a token for eviction, but keeping it accessible for a short period of time (for example, a few hundred steps). This delay allows the model to "Removal" Any necessary information remaining from the token and merge it into the current context before the token is erased from the KV cache.

“The ‘delayed removal’ mechanism is important because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between – they hold some information, but not enough to justify occupying the entire slot in memory,” Nawrot said. “This is where the redundancy lies. By keeping these tokens in a local window for a short period of time before eviction, we allow the model to attend to them and redistribute their information to future tokens.”

The researchers found that this retrofitting process is highly efficient. They can equip a pre-trained LLM with a DMS in just 1,000 training steps, which is a small fraction of the computation required for the original training. The resulting models use standard kernels and can go directly into existing high-performance inference stacks without custom hardware or complex software rewriting.

DMS in action

To validate the technique, the researchers applied DMS to multiple logic models, including the QWEN-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested them on tough benchmarks such as AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).

The results show that DMS effectively pursues the Pareto frontier, which is the optimal compromise between cost and performance. On the AIME 24 math benchmark, a Queue-R1 32B model equipped with DMS achieved a score 12.0 points higher than a standard model when constrained to the same memory bandwidth budget. By compressing the cache, the model could afford "Thinking" Deeper and broader than the standard model for the same memory and compute budget.

Perhaps most surprising is that DMS rejected the common wisdom that compression hurts the understanding of longer context. In "needle in a haystack" In the test, which measures a model’s ability to find a specific piece of information hidden in a large document, the DMS variant actually performed better than the standard model. By actively managing its memory rather than passively accumulating noise, the model maintained a cleaner, more useful context.

For enterprise infrastructure, efficiency gains translate directly into throughput and hardware savings. Because the memory cache is much smaller, the GPU spends less time fetching data, reducing latency for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model while providing 5 times more throughput. This means that a server can handle five times more client queries per second without degradation in quality.

future of memory

Nvidia has released DMS as part of its KVPress Library. Regarding how one can start an enterprise with DMS, Navrot emphasized that the barrier to entry is low. "The ‘minimum viable infrastructure’ is the standard Hugging Face pipeline – no custom CUDA kernel required," Nawrot said, noting that the code is fully compatible with standard FlashAttention.

Looking ahead, the team sees DMS as part of a larger shift where memory management becomes a distinct, intelligent layer of the AI ​​stack. Navrot also confirmed that DMS is "fully compatible" Like with new architecture Multi-Head Latent Meditation (MLA) have been used in DeepSeek’s models, suggesting that combining these approaches may yield even greater efficiency gains.

As enterprises move from simple chatbots to complex agentic systems that require extended logic, the cost of inference is becoming a primary concern. Technologies like DMS provide a way to permanently enhance these capabilities.

"We’ve barely scratched the surface of what’s possible," Navrot said, "And we expect estimation-time scaling to evolve further."



<a href

Leave a Comment