
one in new paper Researchers at Google and UC Santa Barbara studying tool-use in large language model (LLM) agents have developed a framework that enables agents to make more efficient use of tools and calculate budgets. Researchers introduce two new techniques: a simple "budget tracker" and a more comprehensive framework called "Budget Aware Test-Time Scaling." These techniques make agents explicitly aware of their remaining logic and device-use allowances.
As AI agents rely on tool calls to do work in the real world, test-time scaling has become less about smart models and more about controlling costs and latency.
For enterprise leaders and developers, budget-conscious scaling techniques provide a practical path to deploying effective AI agents without facing unexpected costs or diminishing returns on compute expenditure.
The challenge of scaling the use of the tool
traditional test-time scaling The model focuses on delivering "Thinking" Now. However, for agentic tasks such as web browsing, the number of tool calls directly determines the depth and breadth of exploration.
This introduces significant operational overhead for businesses. "Tool calls such as webpage browsing result in more token consumption, increased context length and additional time latency," Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. "The tool calls themselves introduce additional API costs."
The researchers found that simply providing agents with more test-time resources does not guarantee better performance. "In a deep research task, if the agent has no understanding of the budget, he often works blindly," Wang and Liu explained. "It finds a somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the whole path was a dead end."
Optimize Resources with Budget Tracker
To evaluate how they could optimize tool-usage budgets, the researchers first tried a lightweight approach called "Budget Tracker." This module acts as a plug-in that provides a continuous indication of resource availability to the agent, enabling the use of budget-aware tools.
The team estimated that "Providing explicit budget cues enables the model to internalize resource constraints and adapt its strategy without the need for additional training."
Budget Tracker operates entirely at the Accelerated level, making it easy to implement. (The paper provides complete information on the prompts used for the budget tracker, making it easy to implement.)
In Google’s implementation, the tracker provides a brief policy guideline describing budgeting arrangements and related recommendations for the use of the tool. At each step of the feedback process, the budget tracker makes the agent explicitly aware of its resource consumption and remaining budget, enabling it to schedule subsequent logic steps on the updated resource state.
To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its outputs, and parallel scaling, where multiple independent runs are conducted and aggregated. They ran experiments on search agents equipped with search and browse tools following a React-style loop. React (reasoning + acting) is a popular method where the model alternates between internal thinking and external actions. To explore the true cost-performance scaling trend, they developed an integrated cost metric that jointly accounts for the costs of both internal token consumption and external tool interactions.
They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as. gemini 2.5 proGemini 2.5 Flash, and cloud sonnet 4Experiments show that this simple plug-in improves performance despite various budget constraints,
"Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and a 31.3% reduction in total costs." the authors told VentureBeat. Finally, as the budget grew, the budget tracker scaled out, while plain React stagnated after a certain threshold.
BATS: A Comprehensive Framework for Budget-Conscious Scaling
To further improve tool-use resource optimization, researchers introduced Budget Aware Test-Time Scaling (BATS), a framework designed to maximize agent performance under any budget. BATS maintains a continuous signal of remaining resources and uses this information to dynamically adapt the agent’s behavior as it formulates its response.
BATS uses several modules to organize agent activities. A planning module adjusts the phasing effort to match the current budget, while a validation module decides what to do. "dig deep" under a promising leadership or "spindle" Alternative routes depending on resource availability.
Given the information-seeking question and the tool-calling budget, BATS begins by using the planning module to prepare a structured action plan and decide which tools to implement. When the instruments are used, their responses are added to the reasoning sequence to provide context with new evidence. When the agent proposes a candidate answer, the validation module verifies it and decides whether to continue the current sequence or start a new attempt with the remaining budget.
The iterative process ends when budgetary resources are exhausted, at which point the LLM-A-Judge selects the best answer among all verified answers. Throughout the execution, the budget tracker continuously updates both the resource usage and the remaining budget at each iteration.
The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard React and various training-based agents. Their experiments show that BATS achieves higher performance while using fewer tool calls and incurring lower overall costs than competing methods. Using Gemini 2.5 Pro as the backbone, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% for standard ReAct, and 27.0% accuracy on HLE-Search compared to 20.5% for ReAct.
BATS not only improves effectiveness under budget constraints, but also gives better cost-performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of about 23 cents compared to the parallel scaling baseline, which required more than 50 cents to achieve similar results.
According to the authors, this efficiency makes previously expensive workflows viable. "This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due diligence checks, competitive landscape research, compliance audits and multi-step document analysis," He said.
As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy with cost will become a critical design requirement.
"We believe that the relationship between logic and economics will become inseparable," Wang and Liu said. "In the future, [models] One must reason about value."
<a href