How Sakana Trained A 7B Model To Orchestrate GPT-5, Claude Sonnet 4 And Gemini 2.5 Pro

Every Langchain pipeline hardcoded on your team starts breaking as soon as the query distribution changes – and it’s always changing. Sakana AI aims to overcome this hurdle.

Sakana AI researchers have started "rl conductor," A small language model trained via reinforcement learning to automatically organize a diverse pool of worker LLMs. The conductor dynamically analyzes the input, distributes labor among workers, and coordinates between agents.

This automated coordination achieves state-of-the-art results on difficult logic and coding benchmarks, outperforming individual frontier models such as GPT-5 and Cloud Sonnet 4, as well as expensive human-designed multi-agent pipelines. It achieves this performance at a lower cost than competitors and with fewer API calls. RL Conductor Fugu is the backbone of Sakana AI’s commercial multi-agent orchestration service.

Limitations of Manual Agentic Framework

Large language models have strong latent capabilities. But making full use of these capabilities is a big challenge. Extracting this level of performance relies heavily on manually designed agentic workflows, which serve as critical components in commercial AI products.

However, these structures fall short because they are inherently rigid and constrained. In comments to VentureBeat, paper co-author Yujin Tang explained the exact breaking point of current systems: "While using frameworks with hard-coded pipelines like Langchain and Mixture-of-Agents can work well for specific use cases… in production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands."

Tang said that the achievement has been achieved "Real-world generalization to such heterogeneous applications naturally requires going beyond human-hardcoded designs."

Another obstacle to building robust agentic systems is that no single model is optimal for all tasks. Different models are fine-tuned to specialize in different domains. One model may excel at scientific reasoning, while another excels at code generation, mathematical reasoning, or high-level planning.

Since models have these different characteristics and complementary skills, it is practically impossible to manually predict and hard-code the ideal combination of models for each query. An optimal agentic framework should be able to analyze a problem and assign subtasks to the most appropriate expert in the pool.

conducting an orchestra of agents

RL Conductor is designed to overcome the limitations of rigid, human-designed frameworks. As the name suggests, it operates an orchestra of agents by dividing challenging problems, assigning targeted subtasks, and designing communication topologies for a set of worker LLMs.

Instead of relying on fixed code or static routing, Conductor orchestrates these models by generating a customized workflow. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to complete it, and defines a "access list" Which determines which previous sub-actions and reactions of other agents are included in the context of that agent.

By defining everything in natural language, Conductor creates flexible workflows tailored to every input. It can create simple sequential chains, parallel tree structures, or even recursive loops depending on the demands of the problem.

Importantly, the model learns these strategies not by human design but through reinforcement learning (RL) and reward maximization. During training, the model is given a task, a pool of workers, and a reward signal based on whether its answer and output format are correct.

Through a simple trial-and-error RL algorithm, the model systematically discovers which combinations of instructions and communication structures yield the highest reward. As a result, it automatically adopts advanced orchestration strategies such as targeted prompt engineering, iterative refinement, and meta-prompt optimization.

The model learns to dynamically adjust its strategies and take advantage of the specific strengths of its worker agents without having to hard-code the process by any human developer.

conductor in action

To test the action of the RL conductor, the researchers fine-tuned the 7-billion parameter Qwen2.5-7B using the framework. During training, conductors were tasked with designing agentic workflows consisting of up to five steps. It was given access to a worker pool consisting of seven different models: three closed-source giants (Gemini 2.5 Pro, Cloud-Sonet-4, and GPT-5) and four open-source models (including DeepSeq-R1-Distill-Qwen-32b, Gemma3-27b, and Qwen3-32b).

The team evaluated Conductor against a variety of highly challenging benchmarks, comparing it to individual frontier models acting alone, self-reflecting agents iteratively motivated to improve their own answers, and state-of-the-art multi-agent routing frameworks such as MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie. The little 7B conductor set new standards across the board. According to the researchers, it achieved an average score of 77.27% across all tasks, 93.3% on the AIME25 math benchmark, 87.5% on GPQA-Diamond, and 83.93% on LiveCodeBench.

Remarkably, it achieved these marks while remaining highly efficient. While baseline models like MOA burned 11,203 tokens per query, Conductor used only 1,820 tokens on average, taking an average of only three steps per workflow.

A closer look at the experimental details reveals why the framework is so effective. The conductor learned to automatically gauge the difficulty of the task. For simple factual recall questions, this often solves the problem in a single step or uses a basic two-agent setup. However, for complex coding problems, it created a comprehensive workflow involving four agents with dedicated planning, implementation, and verification stages.

The conductor also learned that Frontier models have different strengths. To achieve record scores on coding benchmarks, Conductor often assigned Gemini 2.5 Pro and Cloud Sonnet 4 to act as high-level planners, and only finally brought in GPT-5 to write the final optimized code. In a particularly clever display of adaptability, the conductor sometimes relinquishes his role entirely, delegating the entire planning process to Gemini 2.5 Pro and allowing him to set sub-tasks for the rest of the pool.

Beyond math and coding benchmarks, Sakana AI is already putting the underlying architecture to work in front-office utility. "We are using our Fugu model based on Conductor technology internally for a variety of practical enterprise applications: software development, in-depth research, strategy development and even visualization tasks like slide creation," Tang said.

Bringing Orchestration into the Enterprise: Sakana Fugu

While the 7B model described in the research paper was an exploratory blueprint and is not publicly available, Sakana AI has adapted the Conductor framework into its flagship commercial AI product, Sakana Fugu. Now in its beta phase, Fugu serves as a multi-agent orchestration system accessible through a standard OpenAI-compliant API.

Tang notes fugu goals "There are large markets of industries where AI adoption has not yet led to large productivity gains due to generalization limitations of current hard-coded pipelines, such as finance and defense."

For enterprise developers, this allows seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks between different vendors. Behind the API interface, Fugu automates complex collaboration topologies and role assignments across a pool of models. To support varying business needs, Sakana released two variants: the Fugu Mini, designed for low-latency operation, and the Fugu Ultra, designed for maximum performance on demanding workloads.

Addressing governance concerns around autonomous agents promoting invisible workflows, Tang explained that interpretability risks are functionally similar to the hidden logic traces of current top-level closed APIs, and the system is managed with guardrails installed to reduce hallucinations.

For enterprise architects considering when to deploy RL-orchestration compared to traditional routing, the decision often depends on engineering resources. "We believe the most enjoyable moments come when users and their teams feel like they’re spending too much time guiding their built-in agents," Tang said. However, he cautioned that the framework is not necessary for everything. "For simple queries it is hard to beat the economic proposition of a local model running directly on the user’s machine."

As the diversity of specific open- and closed-source AI models continues to grow, static hardcoded pipelines will inevitably become obsolete. Looking ahead, this dynamic orchestration will likely extend beyond text and code environments. "There is indeed a huge potential to fill this gap with the cross-modal conductor framework, which will become the foundation of more autonomous, self-coordinated physical AI systems." Tang said.

<a href

How Sakana trained a 7B model to orchestrate GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro

Limitations of Manual Agentic Framework

conducting an orchestra of agents

conductor in action

Bringing Orchestration into the Enterprise: Sakana Fugu

Like this:

Related

Leave a Comment Cancel reply

Limitations of Manual Agentic Framework

conducting an orchestra of agents

conductor in action

Bringing Orchestration into the Enterprise: Sakana Fugu

Share this:

Like this:

Related

Leave a Comment Cancel reply