Researchers automated LLM reasoning strategy design and cut token usage by 69.5%

test time scaling strategy
Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them additional computation cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to determine the rules of the model’s logic.

To overcome this obstacle, researchers at Meta, Google, and several universities have introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocations without manually tuning estimates.

By implementing the optimization strategies discovered by AutoTTS, organizations can directly reduce token usage and operational costs of deploying advanced logic models in production environments. In experimental tests, AutoTTS efficiently managed the estimation budget, successfully reducing token consumption by 69.5% without compromising accuracy.

Manual constraint on test-time scaling

Test-time scaling enhances LLM by providing additional computation when generating answers. This additional computation allows the model to generate multiple logic paths or evaluate its intermediate steps before arriving at the final response.

The primary challenge for designing TTS strategies is to determine how to optimally allocate this additional computation. Historically, researchers have designed these strategies manually, relying on guesswork to generate rigorous estimates. Engineers must envisage rules and boundaries for when a model should branch into new reasoning paths, delve deeper into an existing path, prune a disappointing branch, or stop reasoning altogether.

Since this manual tuning process is constrained by human intuition, a large amount of possible approaches remain unknown. This often results in a sub-optimal trade-off between model accuracy and computing cost.

Current TTS algorithms can be mapped to the width-depth control space – "Width" being the number of logic branches explored, "depth" How far each develops. Self-consistency (SC) samples a fixed number of trajectories and votes the majority answer. Adaptive-stability (ASC) saves computation by stopping early after reaching a confidence threshold. Parallel-probing takes a more granular approach, pruning ineffective branches while deepening the remaining branches. All three are built by hand, and AutoTTS is designed to break that barrier.

While some of the more advanced methods employ rich constructs like tree search or external validators, they all share one key characteristic: they are carefully hand-crafted. This manual approach limits the scope of strategy search, leaving a large portion of the potential resource-allocation space untouched.

Automating Strategy Search with AutoTTS

AutoTTS reframes the way you optimize test-time scaling. Instead of treating strategy design as a human task, AutoTTS views it as an algorithmic search problem in a controlled environment.

This framework redefines the roles of both the human engineer and the AI ​​model. Instead of hand-crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer’s role shifts to building the search environment. Defines human boundaries, including the control space of states and actions, optimization objectives that balance accuracy versus cost, and specific response mechanisms.

An Explorer LLM, like Cloud Code, formulates strategies. This explorer acts as an autonomous agent that iteratively makes proposals to TTS “controllers”. These controllers are code-defined policies or algorithms that determine how the AI ​​allocates its computational budget during model inference. The explorer tests and refines these controllers based on feedback until it finds an optimal resource-allocation policy.

To make this automated discovery computationally economical, AutoTTS relies on an “offline replay environment”. If Explorer LLM had to apply the base reasoning model to generate new tokens every time it wanted to test a new strategy, the computation cost would be very high. Instead, it relies on thousands of logic trajectories pre-stored from the base LLM. These trajectories include "probe signal," which are intermediate answers that help the controller evaluate progress in different logic branches.

During the discovery loop, the Explorer Agent proposes a controller and evaluates it against this offline data. The agent inspects the execution traces of the proposed controller which show the compute allocated to it over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting whether the controller has pruned branches too aggressively in a specific scenario. This offers benefits compared to just seeing the end result. The agent then iteratively rewrites its code to improve the accuracy-cost tradeoff.

Inside the AI-designed controller

Because the search agent is not constrained by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never code by hand. An optimal controller discovered by AutoTTS, named the Confidence Momentum Controller, takes advantage of several non-obvious mechanisms to manage the computation:

  • trend-based stopping: Hand-crafted strategies often instruct the model to stop reasoning after reaching a certain instantaneous confidence threshold. The AutoTTS agent found that instantaneous confidence can be misleading due to temporal spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and stops only when the overall confidence level is high and the trend is not actively decreasing.

  • coupled width-depth control:Manually designed algorithms usually treat this "Wide" new logic paths and "deep" Current paths as individual decisions. AutoTTS discovered a closed feedback loop where two actions are linked. If the trust of current branches stops or goes back, the controller automatically starts spawning new branches.

  • alignment-aware depth allocation: Instead of giving an equal computation budget to all active logic branches, the controller dynamically identifies which branches agree with the current dominant answer. It then gives priority to those branches "Burst" Of additional calculation. This focuses the computational budget on the emerging consensus so that it can be instantly verified whether it is correct or not.

Cost savings and accuracy gains in real-world benchmarks

To test whether AI can autonomously discover a better test-time scaling strategy, the researchers set up a rigorous evaluation framework. The main experiments were conducted on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system’s ability to generalize on a distilled 8b version of the DeepSeek-R1 model.

The Explorer AI agent was initially tasked with searching for an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was tested on two organized mathematics benchmarks, AIME25 and HMMT25, as well as the undergraduate general reasoning benchmark GPQA-Diamond.

The controller discovered by AutoTTS was pitted against four manually designed test-time scaling algorithms in the industry. These baselines included self-consistency with 64 parallel logic paths (SC@64), adaptive-consistency (ASC), parallel-checking, and early-stopping self-consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when an answer appears stable.

When set to balanced, cost-aware mode, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64. Additionally, the controller maintained the same average accuracy across the four Quen models. When the estimation budget was increased, AutoTTS pushed peak accuracy beyond all handmade baselines in five of the eight test cases.

This efficiency translated into other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS version reduced the guess token cost from 510K tokens to only 151K tokens while slightly improving the overall accuracy. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while almost halving token spend.

For practitioners building enterprise AI applications, these experiments highlight two key operational benefits:

  • Peak Performance Enhancement: AutoTTS doesn’t just save money on token consumption. This actively increases the peak attainable performance of the base model. The AI-designed controller is remarkably good at instantly detecting noisy or unproductive logic branches and continually redirecting its computation budget toward the branches that produce the most useful logic signals.

  • cost effective custom development: Because the framework relies on the offline replay environment, the entire search process cost only $39.90 and took 160 minutes. For enterprise teams, this means that proprietary models and customized reasoning strategies tailored to internal functions are now within reach – without any dedicated research budget.

Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; CMC can be used as a drop-in replacement for other TTS controllers.



<a href

Leave a Comment