
Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. supervised reinforcement learning (SRL) reframes problem-solving as a sequence of logical “actions”, providing rich learning cues during the training process.
This approach enables small models to learn complex problems that were previously out of reach of other common training techniques. Experiments show that SRL not only excels on mathematical reasoning benchmarks but also generalizes effectively to agentive software engineering tasks.
SRL is a versatile training framework that can scale small and less expensive models to higher reasoning capabilities.
Limitations of current LLM logic training
Recent progress in training large language models (LLMs) for reasoning has been largely inspired by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly attempting to solve problems and receiving feedback on the end result, the model gradually learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model’s ability to find the correct solution within a limited number of attempts, or "roll out." Since each rollout is computationally expensive, models cannot try indefinitely. When problems become so difficult that the model is rarely, if ever, able to find the right answer within its budget.
This poses a significant barrier to learning. In many multi-step reasoning problems, a model may solve several steps correctly but gets derailed by a mistake, leading to the wrong answer. With RLVR, all this effort gets a negative reward, and the model learns nothing from its partially perfect work. This is an all-or-nothing approach that fails to provide detailed feedback and offers little reward.
An alternative method is supervised fine-tuning (SFT), where the model learns from examples with a complete reasoning process set by experts. While SFT can yield reasoning capabilities, it often leads to overfitting (the model learns to mimic trajectories in the training data rather than simply learning to generalize to problems beyond observed examples). This problem is made worse by the fact that producing high-quality, human-generated training data is both scarce and expensive.
As the paper notes, these limitations go away "An important distinction for training small open-source models to learn hard problems effectively."
How does supervised reinforcement learning work?
SRL has introduced a framework that reframes problem-solving as a "sequential decision making process," Striking a balance between pure outcome-based RL and pure simulation learning. Rather than optimizing only for the final answer or forcing the model to mimic an expert’s entire thought process, SRL teaches the model to reproduce the sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to act like an expert while developing its own internal reasoning style.
In the SRL framework, expert performances are divided into a series of intermediate, concrete actions, each of which represents a meaningful step. For a math problem, an action could be an algebraic manipulation. For a software engineering agent, this could be a command executed in a code repository. To generate training data, the SRL solution uses a powerful teacher model to generate trajectories, which is then used to train a smaller model.
According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-of-the-road approach is the key to its effectiveness in real-world scenarios. "SRL sits in the middle: it captures the structured flexibility of real-world problem solving, where there are many valid strategies but also clear notions of what ‘good reasoning’ looks like at each step." Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or possibly supply chain optimization – tasks that reward solid intermediate logic rather than just final answers."
During training, the model first generates a "internal monologue" (its internal logic process, <थिंक> Enclosed in the tag) before committing to an action. At each step, SRL provides rewards based on the similarity between the model’s predicted action and the expert’s action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution is not perfect. This solves the sparse reward problem faced by RLVR.
SRL in action
The researchers’ experiments show that SRL outperforms robust baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improves the quality of solutions without lengthening the output.
For enterprise leaders, performance gains are valuable only if they do not come with excessive costs. HSU clarifies that SRL-trained models are more efficient in their reasoning. "The benefits come from better argument quality and structure, not from verbosity," He said. "In terms of efficiency, SRL-trained models are equivalent to the base model in token usage… While SRL is not designed to reduce inference cost, it achieves strong reasoning performance without increasing it."
For math tests, the team was improved Qwen2.5-7B-instructions On a dataset of 1,000 difficult math questions. They compared its performance to models trained with SFT and RLVR (using the general GRPO algorithm in such models) DeepSeek-R1) on four competition-level mathematics benchmarks. The SRL-trained model achieved a substantial 3.0% increase in average performance compared to other methods.
The team extended SRL to agentic software engineering, a domain critical to enterprise automation. They trained a coding-specific model, qwen2.5-coder-7b-instructionsOver 5,000 expert trajectories of agents interacting with the coding environment. The SRL-trained model was benchmarked against the original base model and SWE-GIM-7B, a robust baseline fine-tuned with SFT. SRL achieved a 14.8% task resolution rate, representing a 74% relative improvement over the SFT-based model. This demonstrates SRL’s ability to train more capable AI agents for complex, real-world programming tasks.
A New Standard for High-Stakes AI?
The paper’s strongest results came from a combination of methods: first, using SRL to teach basic logic, then using RLVR to hone that skill. In their experiments, when the researchers used SRL as pre-training and applied RLVR after training, they observed an average increase of 3.7%, demonstrating a powerful curriculum learning strategy.
This raises the question whether this could become a new blueprint for building specialized AI.
"We see SRL as a strong foundation," Sue said. "In a sense, SRL provides a curriculum – teaching the model to think and act step by step – before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the subsequent RL stage, but also makes the logic more interpretable and generalizable, which is important for high-risk applications."
Looking ahead, HSU acknowledges that there are still challenges to scaling this pipeline, particularly the high cost and complexity of end-to-end RLVR for agentive tasks. However, he is optimistic about the road ahead. "While high quality specialist trajectories remain important," He concluded, "We believe the next big leap will come from automating their generation and filtering – leveraging strong teacher models or even self-improving student models to bootstrap new data."
