EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

The EAGLE series – including EAGLE 1, EAGLE 2, and EAGLE 3 – has become one of the most widely adopted and practically deployed families of speculative decoding algorithms in both research and production systems.

Today, the EAGLE team, the vLLM team and the TorchSpec team are excited to jointly introduce Eagle 3.1 – A major step towards the robustness, efficiency and deployment of speculative decoding.

Eagle 3.1 Innovation

While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context input, or out-of-distribution system signals.

The EAGLE team observed this fragility as a phenomenon we call attention drift – as the depth of speculation increases, the drafter gradually shifts attention away from the sink token towards their own generated tokens.

We identified two underlying issues. First, the fused input representation becomes increasingly unbalanced as higher-layer hidden states dominate the drafter input. Second, the magnitude of the hidden state increases in the speculation stages due to the abnormal residual path. Together, these effects make the drafter progressively less stable as the depth of speculation increases.

Figure 1: Eagle 3 vs Eagle 3.1 architecture comparison. EAGLE 3.1 adds FC normalization after each target hidden state and feeds the post-normalized hidden state to the next decoding stage.
Figure 1: Eagle 3 vs Eagle 3.1 architecture comparison. EAGLE 3.1 adds FC normalization after each target hidden state and feeds the post-normalized hidden state to the next decoding stage.

To address this issue, EAGLE 3.1 introduces two major architectural improvements:

  • FC normalization after each target hidden state and before the FC layer
  • Feeding post-norm hidden states to the next decoding stage

Intuitively, the post-norm design method behaves like simply applying the drafter recursively to the decoding steps rather than adding additional layers to the target model.

These changes significantly improve robustness across deployment scenarios. Compared to Eagle 3, Eagle 3.1 shows:

  • Improved training-time for inference-time extrapolation
  • Strong strength in long term
  • High flexibility towards chat template and system prompt variations
  • More stable acceptance times in diverse service environments

In long context workloads, EAGLE 3.1 achieves 2× more allowable length than EAGLE 3.

Eagle 3.1 training with TorchSpec

TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms.

By reducing training overhead and simplifying experiment workflow, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment.

Based on TorchSpec and VLLM, we have also trained and open-sourced an EAGLE 3.1 draft model for KM K2.6:

https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla

The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and VLLM serving support on a real-world serving model.

Integration with Eagle 3.1 VLLM

EAGLE 3.1 comes to vLLM as a configuration-driven extension of the existing EAGLE 3 implementation.

Integration includes:

  • FC normalization support
  • Post-Norm Hidden-State Feedback
  • The goal is to remove hardcoded assumptions around hidden states.

Additionally, backward compatibility with existing EAGLE 3 checkpoints is fully preserved. As a result, EAGLE 3.1 draft models can be plugged in directly via the same speculative-decoding code path, for example:

vllm serve nvidia/Kimi-K2.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --attention-backend tokenspeed_mla \
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
  --language-model-only

This makes draft-model upgrades in production VLLM smooth and easy.

Support has already been merged into the current main branch of VLLM and will be available through nightly releases of VLLM as well as upcoming v0.22.0 release.

As an initial data point, we benchmarked the km2.6 EAGLE 3.1 draft model on km-k2.6-nvfp4 with VLLM (TP=4, GB200, non-disag) on ​​the speed-bench coding dataset. Eagle distributes 3.1 2.03× higher per-user output throughput at 1 concurrencyAnd the speedup remains meaningful as concurrency scales (1.71× at c=4, 1.66× at c=16).

Figure 2: Per-user output throughput (TPS) on KM-K2.6-NVFP4 with VLLM, TP=4, GB200 on speed-bench coding. Eagle 3.1-MLA vs no-spec baseline.
Figure 2: Per-user output throughput (TPS) on KM-K2.6-NVFP4 with VLLM, TP=4, GB200 on speed-bench coding. Eagle 3.1-MLA vs no-spec baseline.

Open-source collaboration across the ecosystem

This collaboration between the EAGLE team, the vLLM team, and the TorchSpec team represents a strong example of open-source collaboration in algorithm research, system optimization, and training infrastructure.

The EAGLE team is advancing speculative decoding algorithms, VLLM helps bring these innovations to large-scale production inference systems, and TorchSpec enables efficient training and rapid experimentation for future speculative decoding algorithms.

Together, we hope to continue to raise the overall baseline for speculative decoding and further improve token efficiency across the broader LLM ecosystem.



<a href

Leave a Comment