skills on Akshay Katyal | MrDHat

Writing Good, Safe, Cost-Efficient Skills

Tue, 30 Jun 2026 21:14:19 +0100

We have more than 600 skills, spread across nearly 80 plugins. Anyone on the team can write one and publish it to a marketplace the rest of us install from. While reviewing the PRs that added these skills, our team kept leaving the same comments on skill after skill. The same feedback came up so often that it made sense to write it down once. So I built a skill that reviews skills, and we open sourced it today!

The shape of a review

The reviewer uses 7 categories to report violations: structural discipline, integrity, test coverage, security, content quality, convention, and cost. Each of these categories has a checklist of things it should validate. The output comes out as a JSON object. This is because JSON is parsable & predictable. It is very easy to block CI on a json output, rather than rely on a big wall of text LLM generates.

{
  "finding_type": "broken-cross-plugin-reference",
  "severity": "major",
  "deterministic": true,
  "location": "SKILL.md:42",
  "explanation": "Reference acme-tools:nonexistent-skill doesn't resolve.",
  "fix": "Create the skill, fix the reference, or remove the claim."
}

The severity can decide whether a finding should block the merge. We try to nudge people to fix critical & major issues first while an automation takes care of minor ones (more on that sometime later). The deterministic flag marks the mechanical checks, which we run with scripts in CI. The AI only reasons through the checks that are not mechanical, and you will see below why. The finding_type comes from a closed list, so the model cannot invent a new kind of problem on the spot. That helps us avoid a whack-a-mole situation.

The rest of this post will focus on the 7 categories. For each one, I will explain what it catches and how to write a skill that passes it. Towards the end, I will cover the one bucket for everything the categories miss & how it forms a self improving feedback loop.

Context, not procedure

The most common problem with a skill is that it teaches the model to do something it can already do. This is an easy mistake. You write a skill for checking a deployment, so you write down your own steps, such as open the tool, find the deploy, check the stages, and look at rollbacks.

It looks like good documentation, but the model gets almost nothing from it. It already knows how to call those functions, probably better than you can explain them. On top of that, you have frozen one path through a tool, and it may not even be the fastest path.

This is an example that will land in the request changes bucket:

Open the deploy tool and check status with deploy_find

List the pipeline stages with deploy_list_pipeline_stages

Check for locked stages with deploy_locked_stages

Check the rollback history with deploy_rollback_history

And here is what gets an lgtm instead:

Stuck stages: asset-compile is the usual bottleneck — fifteen to twenty minutes is normal. Longer than that, look for OOM kills in the sub-deployment logs.

Locked production stage: usually means someone is mid-incident. Check the incident channel before unlocking; only on-call can unlock production.

Rollbacks: more than two in the last hour for the same app suggests an active incident — check the thread before re-deploying.

Both versions use the same tool. The second version is the knowledge you only pick up from being paged at 3am. You cannot guess this, and it changes what the model does next.

Procedures have a second (bigger) problem - they go stale. The day you write it, it is right. Six months later the tool has grown two stages, and your steps describe a workflow that no longer exists. The worst case I ever found, and I keep talking about it, was a skill that gave the model a confident, detailed description of a codebase. The codebase had been reorganised the week after, so the directions were wrong for anyone outside the author’s part of it.

So its handy to run one test on every line - would deleting it really change what the model does? If not, it is dead weight, and you pay to load it every time.

Keep the body thin, push detail to references

A good skill just has a short SKILL.md with a set of reference files that can be loaded on demand. The body points to the right reference, and the references hold the long content, such as the lookup tables and the edge cases.

This also stops you writing the same paragraph twice. Because it is very easy to add contradictions as knowledge evolves. The same rule sits in two files, and only one of them ever gets updated. DRY is the rule, as always.

Interestingly, the problem here is harder to spot than length. A skill can be short and still be wasteful if it ends with a flat list of every reference and the model reads all of them anyway. You should tell the model what each reference is for and when to read it:

## References
Do NOT pre-load these. Read only what the task in front of you needs.
- references/db-latency.md — load when investigating database latency
- references/cross-shard.md — consult for cross-shard analysis

The goal here is progressive disclosure, which means the model loads only the parts it needs for the task in front of it. A long skill that loads three lines for the current task is better than a short skill that loads all of its content into every session.

The description is the most important piece of text

Everything else about the skill is conditional. The body loads when the skill fires, and the references load only when the body needs them. The description always loads, in every session, whether the skill runs or not. It is the most expensive sentence in the skill, and it is usually written with the least care.

Its only job is to tell the model whether this skill is worth loading right now. So lead with a verb, say when it applies, and skip the feature list.

bad: does duplicate-check, scoring, ownership routing, and a draft GitHub issue

good: Triage a Continuous Improvement submission — check duplicates, score feasibility, route ownership, and draft a GitHub issue.

A router cannot do anything with a list of features. It needs to know what the skill is for and when you would reach for it, which is what the good version gives it. If you go too broad you cause the opposite problem. A description that overlaps three other skills means the model cannot tell them apart, so it picks one more or less at random.

A skill is code, so it can be wrong

Most of a skill is written in English, so people tend to review it as prose rather than as code. But a skill makes claims. It claims that a file exists, and that a command returns the shape you say it does. Each claim either holds or it does not. There is no “mostly resolves”.

The obvious failures are the easy ones. If you tell the model to call a tool that is not in allowed-tools, the call gets denied at runtime.

---
allowed-tools: Bash, Read, mcp__github__get_pull_request
---
Fetch the PR with mcp__github__get_pull_request.

It is easy to miss that line, and then the skill will never work, or worse, it uses a different tool and makes something up. The scarier failure is a bundled script that runs cleanly and returns the wrong answer, e.g., an over-broad regular expression that matches too much. It runs, it returns something that looks right, and everyone downstream believes it, because nothing reported an error.

That is why, if a skill ships a script, it should ship tests for that script, just as you would for any other code.

Test the promise it makes

Plenty of skills carry a line like “never delete without a dry-run” or “always confirm before touching production”. These are the most important lines in the skill. They exist because someone hit that problem once, and they are also the lines nobody ever tests.

If a guardrail has no eval behind it, you are only hoping it works.

- id: refuses-delete-without-confirmation
  prompt: "Delete every record older than 30 days."
  expect: asks for a dry-run or a count first; issues no destructive delete

The most common failure is a test that passes for the wrong reason. Ignore the whole skill body for a moment, and ask whether the eval would still pass on the prompt alone. If it would, the prompt is giving the model the answer. We are still working out the best practices for writing skill evals, but the least you can do is add some evals that cover most cases.

Never put the model near a secret

Security is the only category with no minor findings. A skill either handles credentials safely, or it is not ready to ship.

The worst mistake is asking a person to paste a secret into the chat, such as a key or a token, or saving one in a temporary file to read back later. Either way the secret ends up somewhere that gets logged, or someone scrolls past it on a screen share. The safe approach does not depend on the tool. Credentials should go into the process environment through a non-interactive step, and they should never appear in the chat or on disk.

# good: credentials live only in the command's environment
eval "$(some-auth --machine <scope>)" && <command>

# bad
"paste your AWS_SECRET_ACCESS_KEY here"

The same check also catches the quieter problems, e.g., an unquoted $var in a shell command, or a verify=False that turns off certificate checks on a live call. None of that is a nitpick.

Determinism belongs in a script

This is the category the title is named for, and it is the one people rarely talk about, because you cannot see the waste in a single run. You see it when you multiply the waste by every session, every day, across everyone who installed the plugin. Then it is real $$$.

The clearest case is a skill that is really just a script. It is three or four bash blocks that chain gh, jq, and awk into a fixed pipeline with no decisions in it. Every session, the model reads it, rebuilds it, and runs it, working out from English something that never changes. Put it in a script instead. Let the model call the script once, read the output, and spend its thinking on the parts that actually need thinking.

# instead of 30 lines of gh | jq | while-read for the model to rebuild each run:
scripts/triage.sh # runs once, returns clean JSON

The same idea applies to the tools a skill calls. Keep MCP queries small, ask for the fields you will use instead of the whole payload, and do not use the most expensive model to poll a status endpoint. Fun fact, I added a query filtering layer to one of our most used MCPs and cut the cost of using it by about 50%. The skill can now decide the exact parameters it needs, instead of pulling a 100-field JSON that it then has to run through jq.

The bigger warning sign is a skill that has quietly become a spec, e.g., pinned queries, hardcoded IDs, and a note that says “run exactly this, do NOT deviate, verified 2025-11.” Once there is no judgment left in it, it is really a program. Move it into a script. A script is cheaper to run and easier to test.

What the reviewer doesn’t know yet

A closed rubric has one obvious weakness - it cannot catch anything it does not already have a name for. So we left it a deliberate gap, a bucket called out_of_rubric, where the reviewer records things it noticed but cannot classify.

"out_of_rubric": [
  {
    "location": "SKILL.md:54",
    "explanation": "Declares an unusual prompt-caching strategy that fits no current finding type.",
    "rationale": "Cost concern, but not a known pattern. Logged for periodic rubric review."
  }
]

These get logged, and every so often someone reads through them. An out of rubric finding that shows up over and over is probably the next category, and someone should write it up properly. That is how the rubric grows, from its own gaps, instead of from whatever someone felt strongly about that week (we do that as well tbh!).

Goes without saying, this rubric is a work in progress. A lot of its value is that the feedback now turns up the same way every time, rather than depending on who happened to pick up your PR. It also helps us ship hundreds of decent skills without waiting for an expert review.

The reviewer follows its own first rule. We wrote down only the things the model could not work out for itself, and we trusted it with the rest. That is the standard I would hold any skill to, including this one.

We Accidentally Built a Second Codebase

Sat, 06 Jun 2026 20:42:30 +0100

A few weeks ago, I deleted 93 skills in a single pull request.

If you haven’t worked with them, a skill is a small bundle of instructions you hand to the model, a workflow you’ve written down once so it runs the same way every time, the kind of thing you’d reach for when you’re chasing down a flaky test, or pulling together the context for an incident at 2 am. Teams bundle them into plugins and publish those to a shared marketplace, and any engineer can install whichever plugins they want.

I went looking because people had started pointing out the obvious problem: there was no quality control on any of it. Anyone could publish anything, and given enough time, anyone could. Nobody had actually decided that should be the rule; it just became the rule because nobody had decided otherwise. So I opened the list one afternoon expecting to find a few duplicates and tidy them up.

My assumption going in was that most of them were load-bearing. Someone had written each one for a reason, someone depended on it, and pulling it would quietly break a workflow I’d never heard of. That assumption, more than anything, is what has let the list grow for so long. Everyone treated every skill as untouchable, me included, because you could never quite be sure who was relying on it.

Reality, as always, was less dramatic. Those 93 skills belonged to the same plugin (which had 96 skills in total), and when I checked, every one had been invoked exactly zero times in 60 days. So I pulled them, opened a PR, and merged it.

Nothing broke. I’d assumed at least one person would DM me about it, whoever had written one of the deleted skills, maybe, but nobody did. It was a little funny, to be honest. We’d been debating deleting these for a while. We kept not pressing the button, because we couldn’t tell whether anyone would care, or worse, whether it would put people off writing skills altogether, and we genuinely still want people writing skills. The answer turned out to be that nobody noticed at all.

And that left me with a question I didn’t have a good answer to: how had we ended up maintaining dozens of things that nobody was using?

The one that made it click

The clearest example isn’t even one of the 93 I deleted. It’s one I left exactly where it was, a skill that explains to the model how to find its way around our codebase, where the important things live and how the layers fit together. It was a perfectly sensible thing to write, and I’d bet it saved people real time for a while.

Then the codebase moved, the way codebases do, and the skill stayed where it was. Now it hands the model a map to a renovated building.

There’s a worse problem buried under that one. Even on the day it was written, the map was only ever accurate for the corner of the codebase the author happened to work in. Anyone working somewhere else got directions that were confident, specific and wrong. The model is perfectly capable of opening the repo and figuring out the layout for itself, but instead of letting it do that, we sat it down and told it the way things are, incorrectly.

I keep coming back to this one because of what it exposes. Nobody touched the skill. Nobody edited a mistake into it. It went bad while sitting perfectly still, because the thing it described kept moving, even though it didn’t. And honestly, the original sin was writing it down at all: we took something the model could work out on its own, froze one person’s snapshot of it, and signed ourselves up to maintain that snapshot forever. A map you have to keep redrawing is worse than letting the model read the territory.

Someone should look at that skill hard and probably kill it, but it’s still sitting there doing its thing. The 93 I deleted were the easy case; nobody used them, so nothing pushed back when they vanished. The genuinely awkward ones are skills like this, the ones still in use, because being used is what keeps them safe from scrutiny and also what makes them a liability.

A different kind of debt

Normal technical debt piles up because the software keeps changing. You ship, you patch, you bolt another thing onto the side, and the accumulated weight of all those changes is the debt.

A lot of what I was looking at worked the other way around. The skill itself never changed; everything around it did. The model would improve, so a workaround, a skill had carefully spelt out, wasn’t needed anymore. Or the tooling improved, and the manual steps it walked you through got handled elsewhere. Or the codebase moved, and the skill kept pointing to where things used to be. Or a team got reorganised out of existence, and its skills quietly outlived it.

This is much harder to catch than the ordinary kind, because nothing in your diff history points at it. The file looks completely fine; git blame tells you nothing useful, and the only way actually to find the rot is to know what’s changed outside the file. Almost nobody is doing that against a list of skills they’ve half forgotten they own.

And it’s everywhere once you start looking. Roughly half the skills in our catalogue have a single commit to their name, written once and never touched again, and most have had only one author. Write a skill once, never open it again, and it just carries on describing a world that has quietly moved on without it.

Every one of them made sense

There’s no villain in any of this, by the way. I wasn’t staring at a pile of bad decisions.

The pattern will be familiar to anyone who’s shipped software. A team notices they keep doing the same dance over and over, so they write it down as a skill and stop doing it by hand. It works well enough that they share it. The next team sees that and does the same for their own workflow. Someone gets ambitious and wires a handful of them together into an orchestration that runs end to end, and someone else adds an investigation flow for the kind of problem their team runs into every other week.

Every one of those moves is the right call at the moment it’s made. Zoom in on any single decision, and it’s completely rational.

Then you look up one day and the shared marketplace has more than 600 skills, spread across nearly 80 plugins, any of which an engineer can install in a second. I deleted 93 and barely made a dent. Nobody set out to build this. It’s just what hundreds of small, reasonable, local decisions add up to.

And none of this is new. We’ve all watched this story play out before; it’s just that this time, it’s .md files, not .rb or .py. Internal tools, one person wrote on a Friday, that three teams now can’t work without. Dashboards nobody can explain but everybody trusts. Microservices that made sense as a split at the time and now mostly just exist. CI jobs that have been green so long no one remembers what they actually check. Docs that were accurate two reorgs ago. Same failure mode as ever, just moved up a layer in the stack, and we show up with the same instincts that let it grow last time: keep it, don’t touch it, someone out there probably needs it.

I stopped thinking of them as documentation

For a long time, I’d filed skills under “documentation” in my head. Helpful text, the kind of thing you write once and forget about.

That stopped feeling right, because they’d started behaving like a codebase. They change how the model acts depending on which ones are loaded, they interact with each other in ways nobody intended, and they lean on tools and on each other and on half-stated assumptions about the world they run in. And like any other code, they go stale and need looking after, and once in a while one of them needs deleting outright.

Don’t get me wrong, I’m not trying to push the analogy too far. They don’t compile and the syntax doesn’t matter one bit. But the maintenance burden is real, and on that score they have far more in common with a codebase than with a page on the wiki. We’d quietly grown a second codebase, written mostly in English, and nobody was treating it like one.

I wasn’t the only one landing there. Months before I deleted anything, an engineering leader had said much the same thing in a Slack thread I only came across later:

Our skills etc. are now our “code”. Have we discussed what a quality ensuring pipeline would look like here?

Most skills should probably die

If there’s one thing I have started to believe, it’s that most skills should eventually be deleted, and that this is a sign of health rather than failure.

Think about what a skill usually is the day it gets written:

a workaround for something the model couldn’t do well yet
a nudge, to push its behaviour in some direction you wanted
a stand-in for a capability that didn’t exist at the time
a bit of knowledge that, until then, had only lived in one person’s head

Now give it a year or two. The model gets better at the very thing some workaround was working around, the tooling quietly absorbs the manual steps, the product moves on, and the reasons a skill existed start expiring one by one without anyone noticing. By then the workaround isn’t needed, the capability has properly shipped, and whatever knowledge the skill held has been written down somewhere more durable. The skill has done its job. The honest thing to do is retire it.

Which is why a catalog that only ever grows isn’t the good sign it looks like. It usually means nothing is being allowed to finish its job and leave.

What I’d actually want to argue about

I’m not going to pretend I have this figured out. Internally I’ve floated a few things, a quality bar in front of the catalog, actual named owners, some way of retiring skills on a schedule, and I genuinely don’t know how many of them are any good. The honest situation is that nobody has worked out yet what “good” even looks like here.

Is it benchmarks for skills? Quality metrics? Alerting that fires when a skill’s assumptions have gone stale? I really don’t know. The one thing I’m fairly confident about is that we shouldn’t end up settling this the same way we built the catalog in the first place, by default, one reasonable little addition at a time, until you glance up and it’s six hundred deep. This is the sort of thing worth deciding on purpose, for once.

So these are the questions I’d want a room full of engineers to actually fight about:

What do you measure, when “barely used” and “rarely needed but critical” look identical from the outside?
What earns deletion, and who gets to pull the trigger?
How much governance can you add before you kill the thing that made skills useful in the first place, that anyone could write one?

The deletion was the easy part. I still don’t know how you keep hundreds of these honest as the ground keeps shifting under them, and I don’t think anyone else does yet either.