How Google’s 'internal RL' could unlock long-horizon AI agents

LLM reasoning
Researchers at Google have developed a technology that makes it easier for AI models to learn complex reasoning tasks that typically cause LLMs to hallucinate or break down. Instead of training LLMs through next-token prediction, their technique was called internal reinforcement learning (Intrinsic RL), drives the internal activation of the model towards developing high-level step-by-step solutions to the input problem.

Ultimately, this could provide a scalable path to creating autonomous agents that can handle complex logic and real-world robotics without the need for constant, manual guidance.

Limitations of next-token prediction

reinforcement learning Post-training plays an important role in LLM, especially for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate one token sequence at a time. When these models discover new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces the model to find solutions at the wrong level of abstraction, making long-horizon reasoning inefficient, even if the model “knows” what to do.

This token-by-token approach works well for basic language modeling but fails in long-horizon tasks where rewards are small. If the model relies on completely random token-level sampling, the chances of stumbling upon the correct multi-step solution are very low, "On the order of one in a million," According to researchers.

The issue is not just that models get confused; This is because they get confused at the wrong level. In comments provided to VentureBeat, paper co-author Yannick Schimpf wrote that in a 20-step task, an agent could get lost in the minute details of a single step, or it could lose track of the overall goal.

"We argue that when faced with a problem with an abstract structure… [goal-oriented exploration] This is what you want," Schimpf said. By solving the problem at the first abstraction level, the agent commits to a path, ensuring that it does not "get lost in one of the stages of the argument" and fails to complete comprehensive workflows.

To address this, the field has long been looking towards hierarchical reinforcement learning. HRL attempts to solve complex problems by temporarily decomposing a task into a hierarchy of abstract actions (high-level subroutines that represent different steps of the solution) rather than managing it as a string of tokens.

However, the discovery of these suitable subroutines remains a long-standing challenge. Current HRL methods often fail to discover appropriate policies. "convert to degenerate alternatives" Which do not represent meaningful behavior. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Operating the inner thoughts of LLM

To overcome these limitations, the Google team proposed internal RL. Advanced Autoregressive Models Already "Know" How to perform complex, multi-step tasks internally, even if they have not been explicitly trained to do so.

Because these complex behaviors are hidden inside the model’s residual stream (i.e., the numerical values ​​that carry information through the layers of the network), researchers introduced a "internal neural network controller," Or metacontroller. Instead of monitoring and changing output tokens, the metacontroller controls the model’s behavior by applying changes to the model’s internal activations in the middle layers.

This nudge takes the model into a uniquely useful position. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining.

The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, researchers use a self-supervised framework where the model analyzes the full sequence of behavior and works backwards to infer the hidden, higher-level intention that best explains the actions.

During the internal RL phase, updates are applied to the metacontroller, which shifts the training from next-token prediction to learning higher-level actions that can lead to solutions.

To understand its practical value, consider an enterprise agent working with code generation. Today, there’s a tough compromise: You should "low temperature" (predictability) to correct syntax, but "high temperature" (Creativity) To solve logic puzzles.

"Internal RL can facilitate this by allowing the model to explore the space of abstract actions, i.e., structuring arguments and method calls, while delegating the token-level realization of those actions to the robust, low-temperature distribution of the base model." Schimpf said. The agent finds the solution without breaking the syntax.

The researchers investigated two methods for implementing this controller. In the first, the base autoregressive model is pre-trained on a behavioral dataset and then frozen, while the metacontroller is trained to run the residual stream of the frozen model. In the second, the metacontroller and the base model are jointly optimized, with the parameters of both networks updated simultaneously.

Internal RL in action

To evaluate the effectiveness of internal RL, researchers ran experiments in hierarchical environments designed to deter traditional learners. These include a different grid world and a continuous control action where a quadruped "ant" Robots must coordinate joint activities. Both environments used sparse rewards with very long action sequences.

While baselines like GRPO and Compile failed to learn the tasks within a million episodes due to the difficulty of credit assignments in the long term, internal RL achieved higher success rates with a lower number of training episodes. By choosing high-level goals instead of small steps, the metacontroller significantly reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.

Specifically, researchers found that "frozen" The approach was excellent. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, aligning its internal switching mechanisms perfectly with the ground-truth moments when an agent completed one subtask and started the next.

As the industry is currently focusing on logical models that deliver actionable results "chains of thought" To solve the problems, Google’s research points to a different, perhaps more efficient future.

"Our study adds to a growing body of work that suggests that ‘internal reasoning’ is not only possible but potentially more efficient than token-based approaches." Schimpf said. "Furthermore, these tacit ‘thoughts’ can be dissociated from specific input modalities – a property that may be particularly relevant to the future of multi-modal AI."

If internal reasoning can be directed without externalizing it, then the future of AI agents may depend less on motivating strategies and more on how well we can access and operate on the models that are already represented internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift may matter more than any new logic benchmark.



<a href

Leave a Comment