
As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of scaling sub-tasks to the right tools and skills. Agents may have hundreds of tools and skills and become confused about which one to use for each step of the workflow.
To address this challenge, Alibaba researchers developed SkillWeaver, a framework that builds an execution graph for a given task and chooses the right skill for each node. They also introduce Skill-Aware Decomposition (SAD), a new technique that uses feedback loops to enable the agent to iteratively fetch and examine relevant tool candidates. This creative approach and feedback loop mechanism differentiates Skillweaver from other tool-routing frameworks that choose tools in a one-shot fashion.
Skillweaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems such as Model Context Protocol (MCP) to execute multi-step business operations such as downloading datasets, transforming information, and creating visual reports.
In practice, the researchers’ experiments with Skillweaver show that applying this retrieve-and-root approach significantly increases accuracy while reducing token consumption by more than 99% compared to naively exposing agents to the entire tool library.
For practitioners building AI agents, the main takeaway is that the granularity of task decomposition is the biggest obstacle to accurate device retrieval.
challenge of skill path
Skills are a key pattern in modern LLM agent architectures. Skills are a modular, reusable tool specification that uses structured natural language documentation.
As enterprise agents integrate with larger tool ecosystems, accurately routing user queries to the right skill becomes a daunting task. Exposing the entire library to LLM to find the right tool is highly inefficient, quickly exceeds reference limits, and consumes hundreds of thousands of tokens.
Most existing tool-usage frameworks attempt to solve this through API retrieval, documentation matching, or hierarchical structures that treat routing strictly as a single-skill selection or per-step problem.
However, this single-skill paradigm is inadequate for enterprise environments because real-world questions are inherently creative. A standard business request such as "Download datasets, transform and create visual reports" Cannot be accomplished with one tool. This requires breaking up the prompt and sequencing an API client, a data processor, and a visualization tool into a coherent, multi-step execution plan.
How do Skilweavers and SAD work?
To deal with this, researchers formulate the problem of handling complex tasks that require multiple skills. "Composition Routing Skills." Given a complex user prompt and a vast library of tools, an agent must figure out simultaneously how to break down the request into a sequence of atomic subtasks, how to map each subtask to the single best available skill, and how to compile those skills into an executable plan.
Skillweaver organizes this process through three distinct steps: decompose, retrieve, and compose. In the first phase, the LLM acts as a task decomposer, which breaks the user’s complex query into a sequence of subtasks, each of which requires a skill. Once the sub-tasks are clearly defined, the system uses an embedding model to compare each sub-task against the skills library to extract a shortlist of top candidate tools for each step.
In the final stage, a planner evaluates the retrieved candidates based on how well they work together. It checks inter-skill compatibility to ensure that the output of one tool flows naturally into the input of the next tool. It then creates a final execution plan in the form of a Directed Acyclic Graph (DAG) that maps dependencies so that independent tasks can potentially execute in parallel.
For example, consider a user asking an AI agent to do "Download datasets, transform them, and create visual reports." In the decompose phase, the decomposed LLM breaks it down into three distinct subtasks: downloading the dataset, transforming the data, and generating reports.
In the recovery phase, the system searches the library and finds candidates such as “api-client” or “http-fetch” for task one, “csv-parser” or “ETL-pipeline” for task two, and so on. Finally, the compose phase evaluates these options, selects the specific combination of “api-client,” “csv-parser,” and “chart-gen” that are most compatible, and combines them together into a final, execution-ready workflow.
A major challenge of this pipeline is that LLMs often produce generic step descriptions that fail to match the specific, technical terminology of the actual skills available in the library. To fix this, Skillweaver has introduced Iterative Skill-Aware Decomposition (SAD), a new feedback loop. SAD works by having the LLM draft an initial plan, conduct an initial search to find loosely matching skills, and then feed those retrieved skills back to the LLM as prompts. This allows the LLM to rewrite its decomposition so that the granularity and terminology align perfectly with the actual tools that exist.
Skilweaver in action
To evaluate how Skillweaver performs in realistic enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It contains 300 multi-step questions of different difficulty levels. To reflect real-world environments, they used a library of 2,209 real-world skills obtained from the public MCP ecosystem, covering 24 functional categories such as cloud infrastructure, finance, and databases.
For the core engine, the researchers mainly used a lightweight 7-billion parameter model (Qwen2.5-7B-Instruct) for task decomposition, combined with a standard semantic search retriever (MiniLM with FAISS index) to find tools. Skillweaver was evaluated against three main setups: a brute-force "LLM-Direct" The method where they filled all the tool names into a big model prompt, a vanilla LLM-based decomposition without SAD, and a React-style agent loop.
Experiments indicate that task disruption is the main obstacle. Standard LLM behavior falls short when working with large instrument libraries, but the SAD feedback loop moves the needle dramatically. In the vanilla setup, the 7B model achieved decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, the accuracy increased to 67.7% (with the larger Quen-Max model, the accuracy reached 92%). But "difficult" In tasks requiring four to five different skills, SAD improved accuracy by 50%.
One fascinating finding was that larger models can actually perform worse without guidance. When tested in a vanilla setup, the accuracy of a large 14-billion parameter model fell below that of the 7B model because it decomposed the tasks into microscopic, redundant steps. Once SAD was introduced, the recovered tool signals brought the model back to reality and increased its accuracy. This suggests that aligning an agent with the terminology of specific tools is often more impactful than paying for larger, more expensive LLMs.
Another important measure is token saving. The LLM-Direct baseline, which used a much larger Quen-Max model, showed that feeding all tools into the prompt of a larger model fails. Despite nearly perfect task analysis capabilities, the giant model was only able to retrieve the correct tool category 21.1% of the time when the tool was loaded with options. Skillweaver’s targeted retrieval-and-root approach significantly outperformed it in accuracy, while reducing context window consumption from an estimated 884,000 tokens to approximately 1,160 tokens per query, a 99.9% reduction. For practitioners, this directly translates to a significant reduction in API costs and faster response times.
Finally, the traditional React baseline completely failed to achieve 0% decomposition accuracy. Its loop naturally collapses multi-step plans into separate tasks rather than explicitly mapping out a coherent, multi-tool sequence.
Ideas for developers
Although the researchers have not yet released the source code for Skillweaver, their work was built on an off-the-shelf tool that can be easily reproduced.
Skill-Aware Decomposition (SAD), the key innovation at the heart of the framework, is an intelligent prompt-engineering and retrieval loop. The authors have shared the prompt template in their paper, and developers can easily implement it themselves using standard orchestration libraries like Langchain, Lindex, or even raw Python scripts.
For the retrieval component, the authors built the core framework using an open-source embedding model, All-MiniLM-L6-v2. They found that swapping to a slightly stronger off-the-shelf encoder (BGE-Base-N-v1.5) immediately increased accuracy without any fine-tuning. While an off-the-shelf bi-encoder is great at getting a relevant device in the top 10 candidates about 70% of the time, it consistently struggles to rank the correct device at number one, and only gets there 37% of the time. To bridge this gap, teams will need to implement a secondary cross-encoder or LLM-based reranker to re-order those top 10 candidates.
An advance preparation required is to vectorize the tool library and create the FAISS index in advance. In practice, this is a negligible obstacle. It took just 15 seconds to embed and index all 2,209 skills in the benchmark. Once built, retrieving tools from the index adds less than 15 milliseconds of latency per query. For enterprise environments, syncing the tool index is a minor background task.
A current limitation in Skillweaver is the lack of error recovery. While Skillweaver successfully maps a consistent DAG to execution, the authors’ pilot study revealed the challenges of a multi-step tool chain. For example, if an API call in step two fails, the entire chain breaks. The main contribution of the paper is limited to the routing and planning phase. For actual production deployment, practitioners must build their own error recovery, fallback, and retry mechanisms on top of the compose stage to handle real-world API timeouts or malformed output.
<a href