Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

LLM agent using tools
Researchers at the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems like mathematics and coding.

their outline, agent-r1Compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval steps and multi-turn interactions with tools.

The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interactions with evolving environments and imperfect information. This framing is similar to real-world applications and may have important uses for agentic tasks in enterprise settings.

Rethinking reinforcement learning for agents.

RL has become the cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, models get a clear signal: the answer is either right or wrong. This makes it relatively simple to reward or punish his behavior.

But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories during interactions, perform multi-step reasoning, and respond to unexpected feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.

To address these challenges, researchers at the University of Science and Technology revisited the fundamental framework of RL, known as markov decision process (MDP). An MDP models decision making using four key components: a state space (the set of possible states an agent can be in); a task space (what the agent can do); a state transition probability (the state to which an action will potentially proceed); and an awards ceremony (whether the outcome is good or bad). The paper proposes to extend this framework to better suit LLM agents.

In the new formulation, the state space is expanded to include not only the current state (the current sequence of tokens generated by the model) but also the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools such as API calls. state changes become unexpected, or "stochastic," Because the outcome depends not only on the tokens predicted by the model, but also on the reaction of the environment, which depends on external factors. Eventually, the reward system becomes more elaborate, incorporating intermediate "process award" For successfully completing steps along the way, not just for a reward at the end. This provides more consistent and accurate guidance to the agent during training.

This last bit is particularly important and solves the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps taken along the way. Process rewards solve this problem by providing feedback signals at these intermediate steps, making the learning process more efficient.

“These extensions are critical to enabling reinforcement learning algorithms to train sophisticated agents capable of engaging in complex, multi-step reasoning and interacting in dynamic environments,” the researchers write in their paper.

Agent-R1 Framework

Based on the expanded MDP definition, researchers developed agent-r1A flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing seamless integration with diverse environments.

The most important difference lies in "rollout phase," Where the agent produces responses. In single-turn RL, the model generates the response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.

Agent-R1 achieves this flexible multi-turn rollout with two main modules: Tool and ToolNv. The tool module acts as an executor to perform specific tasks such as calling an API or accessing a database. When invoked, a tool performs its action and returns the direct, raw result. In contrast, ToolEnv ​​is the module orchestrator and interpreter. It takes the output from the tool and determines how that result affects the agent’s state and overall task progress. ToolNV manages state changes, calculates reward signals based on tool results, and packages the new state information to the agent.

In short, the tool reports when an action is completed "What happened," While ToolEnv ​​dictates "What does this result mean for the agent and the task."

Agent-R1 in action

The researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents, and multi-step decision making. They trained Qwen2.5-3B-Instruct on the QA dataset and evaluated its performance HotpotQA And 2WikiMultiHopQA Dataset They also tested it on the Music dataset, which was outside the scope of the tasks the agent was trained on.

They compared different RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where the LLM produces answers based on a set of retrieved documents, and Base Tool Call, which uses the model’s native function-calling capability without special RL training.

The results showed that all RL-trained agents performed significantly better than the baseline. GRPO, an RL algorithm used in advanced logic models DeepSeek-R1Best performance overall.

“These results strongly validate the efficacy of Agent-R1 in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baseline across different datasets and RL algorithms,” the researchers wrote.

These findings may be important for the enterprise, where there is a strong pressure to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments could pave the way for new agents capable of solving complex problems in real-world settings.

“We hope that Agent-R1 will provide a foundation for future work on scalable and integrated RL training for agentic LLM,” the researchers concluded.



<a href

Leave a Comment