You’re probably using Agent Skills wrong

But thank you I’m smart enough to show you how it’s done

Closeup of a hive with some intense orange light. When I originally picked it up I knew what I was going to do with it.
photo by lin dai / unsplash

The entire ecosystem around cloud code is quite confusing, the naming conventions are messy and the speed of change is beyond any production tool I’ve seen. However the skill is probably the most misused. I see tons of work on this but there’s a paper on Hacker News right now:

SkillsBench: Benchmarking how well agent skills work across different tasks

Agent skills are structured packages of procedural knowledge that LLM agents enhance at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We introduce SkillsBench, a benchmark of 86 tasks across 11 domains, paired with curated skills and deterministic validators. Each task is evaluated under three conditions: no skill, curated skill, and self-created skill. We test 7 agent-model configurations on 7,308 trajectories. Curated skills increase the average pass rate by 16.2 percentage points (pp), but the impact varies widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 out of 84 tasks show a negative delta. Self-generated skills provide no benefit on average, indicating that models cannot reliably capture the procedural knowledge they benefit from consuming. Focused skills with 2-3 modules outperform extensive documentation, and smaller models with skills can match larger models without them.

arxiv logo fb

The newspaper encouraged me to write this post.

HN title is editorialized for some reason “Study: Self-built agent skills are useless”But it immediately grabbed me because I get massive value from the skills written by agents, but I also see them constantly misused by their peers. The concept is great, I’ve been looking at benchmarking specific parts of the Agentic ecosystem myself so it was highly relevant to me. Overall the paper is good but one bullet invalidates the whole thing:

self-made skills: No skill is provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the effect of LLM’s latent domain knowledge.

So what they’re doing is taking a problem that a model can’t solve well on its own, and asking it to write about the task before attempting it. They just reshaped the thinking blocks, but worse!

anti-skill pattern

What they did is a very common mistake that I see all the time. My agent thought this thing was bad so I asked the agent to write a skill on this thing. I will repeat that this is similar to a thinking block. To create something meaningful for your agent, you need to make sure they can see the flaws. I see it as the classic CS intro where when you ask someone to write down the steps to make a PB&J, you don’t really understand what makes the problem hard until you struggle to solve it.

This leads directly to the biggest mistake of the AI ​​era, simply asking an LLM’s question to someone else verbatim, and pasting the LLM’s answer as your response. If I ask you how you did something good with an agent, and you suddenly have a new agent, create a SKILL.md for me on my question, I will kill you.

What are the skills?

Before I get into proper usage, I just want to explain what skills are. As a primitive they are just Markdown files with some metadata on top to help agents/tools know when to use them, and then the rest is document skill. Each skill has its own folder so it can not only teach your agent how to do something but also give him better tools.

.claude/skills/
└── monitor-gitlab-ci/
    ├── SKILL.md # The file metioned above
    ├── monitor_ci.sh # Complicated command
    └── references/ # Additional references 
        ├── api_commands.md
        ├── log_analysis.md
        └── troubleshooting.md

The above is a skill I used to get older versions of Cloud to work on my GitLab CI. It’s a folder with a simple Markdown skill that explains setup and what the agent needs to see the CI until the job fails or everything passes, a simple CLI to prevent the agent from writing scripts, and additional context for edge cases.

skills for context

Agents are completely stateless, meaning every new interaction is like meeting the model for the first time, they have no idea what your project is or what you were working on 10 minutes ago. CLAUDE.md does a lot to fix this, but for a large project it may not cover everything. If I open a monorepo and ask the cloud to run a SIL test it has to scramble around to figure out how to do it. It’ll have to figure out what language the project is in, then look for common test patterns for that language, it’ll look for a complex Docker Compose setup, it’ll look to see that the containers require x86 but we’re running on a Mac, then it’ll look for CI, etc.

All this can be solved by writing skills for common patterns, but not for universal patterns. Whenever a model struggles to do something in your project that you know is simple and basic, ask them to create a skill that covers the knowledge gap to accomplish that task.

skills for repetition

Another simple use of the skill is to explain tasks you perform frequently. For example, I often ask my agents to make sure that my docs/, MR details, points, and codebase are all aligned. So, I created a simple skill for this to prevent me from having to type it all the time.

skills for difficult problems

The cloud can solve some difficult problems, but it may take $500 in tokens and you may have to yell at it for reward hacking several times. Almost any time I have to intervene on a problem, once the agent has solved the problem I ask him what was missing that prevented him from solving it on his own. Sometimes it’s something silly, but sometimes it’s something really enlightening and I’ve given Claude a skill to fill the gap.

conclusion

I edited the original benchmark to make the skills work my way and the results were as I suspected, with agents passing the test with the appropriate skills. I don’t have the money to spend to fully verify this result, but the first pass was good enough for me to be happy. I think this essentially doubles the amount of dataset needed for this benchmark, so I guess that’s why the authors didn’t include this method.

Remember, there are two reasons to build a skill – to remember a new problem, and to avoid repetition. If you are having a new session with your agent and are seeking skills x So it probably has no value. This requires knowing something that isn’t in the fresh model, which can come from quickly explaining a common process by yourself, compiling knowledge gained from a difficult problem, or even switching off and doing your own research on something that isn’t new.

Happy hacking.



<a href

Leave a Comment