Researchers Baked 3x Inference Speedups Directly Into LLM Weights

As agentic AI workflows increase the cost and latency of long logic chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI has found a way to achieve a direct 3x throughput gain in the weight of a model.

Unlike speculative decoding, which requires a separate formatting model, this approach does not require any additional infrastructure – only a special token is added to the existing architecture of the model.

Limitations of next-token prediction

Next-token prediction – generating one token text per forward pass – creates a throughput ceiling that becomes very expensive when the model needs to produce thousands of tokens. This bottleneck is particularly problematic in logic models, which often generate thousands.chain of thoughttoken before the final response is given, making the user experience slow and expensive.

Multi-token prediction (MTP) provides an alternative training paradigm that allows a language model to generate multiple tokens simultaneously in a single forward pass. For example, the model can be trained to predict a block of all tokens rather than just the immediately next token.

John Kirchenbauer, a doctoral candidate in computer science at the University of Maryland and co-author of the paper, told VentureBeat that as we move toward agentic workflows, the focus is shifting from overall throughput to single-user speed. "Today, as ultra-long thinking traces have become the norm and agentic outer loops are driving those costs even higher, latency is becoming as important a dimension of overall service efficiency as gross tokens per second per hardware unit (TPS/GPU)." Kirchenbauer said. He said that while standard batched next-token prediction is already optimal for overall throughput, the new approach "try[s] Saturating the GPU with queries from only one user to reduce latency for that single user."

Other methods exist, but they also have shortcomings. "It is worth noting that speculative decoding and diffusion LLM, as efficiency-focused alternatives to next token prediction (NTP), are both latency-focused acceleration techniques." Kirchenbauer said. But speculative decoding requires deployment and management of an assistant "draft" Model, which spends more absolute calculations for draft and verification. On the other hand, MTP "takes advantage of a similar type of tradeoff, is much easier to serve, and is scientifically interesting in its own right."

However, current MTP paradigms have limitations. The standard objective of training a language model for MTP involves comparing its predictions with ground truth text from the dataset. The danger is that this standard training teaches the model to independently predict the probability of a token being at a specific location, rather than caring about the joint relationship between sequences of tokens.

If a model attempts to predict multiple tokens simultaneously using this standard method, two major problems arise. The first is grammatical mismatch. For example, if a model predicts two words followed by a prefix "The zookeeper fed us," It can sample independently and generate a mismatched phrase "panda meat" Or "lion bamboo" instead of "panda bamboo" And “lion’s meat.”

The second issue is that of degenerate recursion. Since specific text is unpredictable, a model trying to predict 100 future token situations against a standard dataset will only predict "The," Since it is the most common word in English. This results in the model producing output that looks like crap "…the the the…" For distant future posts.

Multi-token prediction through self-distillation

To solve the issues of generating multiple tokens, the researchers propose a novel training technique that uses student-teacher planning. A student model, which is a model that learns to predict multiple tokens, generates a deterministic multi-token block. A teacher model, acting as a robust standard next-token prediction language model, evaluates that block. The teacher acts as a critic, calculating how probable and coherent the student’s proposed sequence is. If the student proposes a mismatched phrase e.g. "lion bamboo," The teacher calls this a high loss, and teaches the student to avoid that construct.

This paradigm is inspired by on-policy reinforcement learning because the student model is not simply memorizing static text. It generates a complete rollout (sequence of actions in RL language) instantly in parallel on a single forward pass and the teacher receives rewards based on how well it understands. Unlike static supervised methods where training pairs are fixed in advance, here the feedback is dynamic, generated from the student’s own outputs in real time. The strong teacher also verifies the coherence of the tokens, which prevents the student model from learning distorted outputs such as repeated words.

For developers, the beauty of this approach lies in its simplicity. "There is really no modification to the architecture other than adding a special token," Kirchenbauer said. an unused slot in the model’s existing embedding matrix By co-opting to act as mask tokens, the technique converts sequential operations into parallel ones. "Any standard next token prediction language model can be adapted this way… the internal implementation – MOE, windowed attention, SSM layers, etc. – is left untouched and there are no barriers to adaptation."

For engineering teams, this means optimizations can be applied to models already in production without rebuilding pipelines.

Generating multiple tokens at the same time may affect the accuracy of the response at inference time. To maximize the generation speed without compromising the output quality, the authors introduce an adaptive decoding strategy called ConfAdapt.

ConfAdapt evaluates confidence limits such as 90% at each step. The model generates a block of tokens, but it only keeps those tokens that meet or exceed this high-confidence threshold. When the upcoming text is highly predictable or structured, the confidence of the model is very high. It will accept and output a large portion of tokens at once, saving significant computational time on simpler tokens. It then focuses its expensive single-token passes on difficult tokens that require more computational effort.

Testing a Multi-Token Prediction

To see how the training paradigm performs in practice, the researchers applied their method to the popular open-weight instruction-tuned model. They tested the robust general-purpose model llama-3.1-8b-magpie and the smaller, efficient queen3-4b-instruct-2507, which is often chosen for cost-sensitive enterprise deployments. Both models were tuned on MetaMathQA, a dataset of synthetic grade school math problems that rely heavily on reasoning traces.

The experiments revealed a clear sweet spot between speed and accuracy. Using the ConfAdapt strategy, the Llama-3.1-8b model achieved a 3x speedup with less than 3% drop in accuracy on the math benchmark. The Qwen3-4B model achieved a similar 3x speedup with a slightly larger 7% drop in accuracy. More aggressive settings can reach up to 5x speedup, although they come with a greater accuracy penalty.

How this translates into real-world actions depends on predictability. "Since the ConfAdapt approach naturally optimizes the acceleration of the underlying entropy in the domain, when the model ‘knows’ exactly what will happen next it can simulate it in a single pass," It was ramped up extensively on predictable tasks, he said, using more steps for uncertain outputs.

The speedup also transferred to domains that were not included in the multi-token prediction training phase. This included tasks from the same domains as the training data, such as mathematics and logic, as well as open-ended tasks such as creative writing and summarization.

Despite this transfer learning, enterprises deploying these models for specific tasks should not rely solely on it. "Our recommendation would be to tune/optimize the model for MTP using samples from particular industrial domains," Kirchenbauer said. "Best performance is likely to be achieved if MTP optimization is performed using signals from the deployment domain."

Service compatibility and the way forward

The research team released them Model trained on hugging face and will be released soon Code for their MTP framework. Infrastructure teams integrating these models into VLLM or SGLang will need to account for changes in the way batching and KV caching are handled – but this is a one-time engineering investment, not an ongoing burden. However, Kirchenbauer sees "No obvious barriers to integration" and confirmed that the team is "Working with some systems experts to identify the shortest path to integration."

Kirchenbauer’s advice for teams looking to test released models: Start with toy signals like counting or repeating a phrase to see the benefits of ConfAdapt in action, then optimize the model using samples from your specific deployment domain for best results. "Overall we expect that a production-ready implementation of our approach can simplify the lifecycle of building and deploying low-latency agent models," Kirchenbauer concluded. "While existing acceleration techniques for NTP models focus almost entirely on inference harnesses and reasoning, our approach introduces some complexities into the model, making it largely complementary to existing work."

<a href

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

Limitations of next-token prediction

Multi-token prediction through self-distillation

Testing a Multi-Token Prediction

Service compatibility and the way forward

Like this:

Related

Leave a Comment Cancel reply

Limitations of next-token prediction

Multi-token prediction through self-distillation

Testing a Multi-Token Prediction

Service compatibility and the way forward

Share this:

Like this:

Related

Leave a Comment Cancel reply