Writing Good, Safe, Cost-Efficient Skills

We have more than 600 skills, spread across nearly 80 plugins. Anyone on the team can write one and publish it to a marketplace the rest of us install from. While reviewing the PRs that added these skills, our team kept leaving the same comments on skill after skill. The same feedback came up so often that it made sense to write it down once. So I built a skill that reviews skills, and we open sourced it today!

The shape of a review#

The reviewer uses 7 categories to report violations: structural discipline, integrity, test coverage, security, content quality, convention, and cost. Each of these categories has a checklist of things it should validate. The output comes out as a JSON object. This is because JSON is parsable & predictable. It is very easy to block CI on a json output, rather than rely on a big wall of text LLM generates.

{
  "finding_type": "broken-cross-plugin-reference",
  "severity": "major",
  "deterministic": true,
  "location": "SKILL.md:42",
  "explanation": "Reference acme-tools:nonexistent-skill doesn't resolve.",
  "fix": "Create the skill, fix the reference, or remove the claim."
}

The severity can decide whether a finding should block the merge. We try to nudge people to fix critical & major issues first while an automation takes care of minor ones (more on that sometime later). The deterministic flag marks the mechanical checks, which we run with scripts in CI. The AI only reasons through the checks that are not mechanical, and you will see below why. The finding_type comes from a closed list, so the model cannot invent a new kind of problem on the spot. That helps us avoid a whack-a-mole situation.

The rest of this post will focus on the 7 categories. For each one, I will explain what it catches and how to write a skill that passes it. Towards the end, I will cover the one bucket for everything the categories miss & how it forms a self improving feedback loop.

Context, not procedure#

The most common problem with a skill is that it teaches the model to do something it can already do. This is an easy mistake. You write a skill for checking a deployment, so you write down your own steps, such as open the tool, find the deploy, check the stages, and look at rollbacks.

It looks like good documentation, but the model gets almost nothing from it. It already knows how to call those functions, probably better than you can explain them. On top of that, you have frozen one path through a tool, and it may not even be the fastest path.

This is an example that will land in the request changes bucket:

Open the deploy tool and check status with deploy_find

List the pipeline stages with deploy_list_pipeline_stages

Check for locked stages with deploy_locked_stages

Check the rollback history with deploy_rollback_history

And here is what gets an lgtm instead:

Stuck stages: asset-compile is the usual bottleneck — fifteen to twenty minutes is normal. Longer than that, look for OOM kills in the sub-deployment logs.

Locked production stage: usually means someone is mid-incident. Check the incident channel before unlocking; only on-call can unlock production.

Rollbacks: more than two in the last hour for the same app suggests an active incident — check the thread before re-deploying.

Both versions use the same tool. The second version is the knowledge you only pick up from being paged at 3am. You cannot guess this, and it changes what the model does next.

Procedures have a second (bigger) problem - they go stale. The day you write it, it is right. Six months later the tool has grown two stages, and your steps describe a workflow that no longer exists. The worst case I ever found, and I keep talking about it, was a skill that gave the model a confident, detailed description of a codebase. The codebase had been reorganised the week after, so the directions were wrong for anyone outside the author’s part of it.

So its handy to run one test on every line - would deleting it really change what the model does? If not, it is dead weight, and you pay to load it every time.

Keep the body thin, push detail to references#

A good skill just has a short SKILL.md with a set of reference files that can be loaded on demand. The body points to the right reference, and the references hold the long content, such as the lookup tables and the edge cases.

This also stops you writing the same paragraph twice. Because it is very easy to add contradictions as knowledge evolves. The same rule sits in two files, and only one of them ever gets updated. DRY is the rule, as always.

Interestingly, the problem here is harder to spot than length. A skill can be short and still be wasteful if it ends with a flat list of every reference and the model reads all of them anyway. You should tell the model what each reference is for and when to read it:

## References
Do NOT pre-load these. Read only what the task in front of you needs.
- references/db-latency.md — load when investigating database latency
- references/cross-shard.md — consult for cross-shard analysis

The goal here is progressive disclosure, which means the model loads only the parts it needs for the task in front of it. A long skill that loads three lines for the current task is better than a short skill that loads all of its content into every session.

The description is the most important piece of text#

Everything else about the skill is conditional. The body loads when the skill fires, and the references load only when the body needs them. The description always loads, in every session, whether the skill runs or not. It is the most expensive sentence in the skill, and it is usually written with the least care.

Its only job is to tell the model whether this skill is worth loading right now. So lead with a verb, say when it applies, and skip the feature list.

bad: does duplicate-check, scoring, ownership routing, and a draft GitHub issue

good: Triage a Continuous Improvement submission — check duplicates, score feasibility, route ownership, and draft a GitHub issue.

A router cannot do anything with a list of features. It needs to know what the skill is for and when you would reach for it, which is what the good version gives it. If you go too broad you cause the opposite problem. A description that overlaps three other skills means the model cannot tell them apart, so it picks one more or less at random.

A skill is code, so it can be wrong#

Most of a skill is written in English, so people tend to review it as prose rather than as code. But a skill makes claims. It claims that a file exists, and that a command returns the shape you say it does. Each claim either holds or it does not. There is no “mostly resolves”.

The obvious failures are the easy ones. If you tell the model to call a tool that is not in allowed-tools, the call gets denied at runtime.

---
allowed-tools: Bash, Read, mcp__github__get_pull_request
---
Fetch the PR with mcp__github__get_pull_request.

It is easy to miss that line, and then the skill will never work, or worse, it uses a different tool and makes something up. The scarier failure is a bundled script that runs cleanly and returns the wrong answer, e.g., an over-broad regular expression that matches too much. It runs, it returns something that looks right, and everyone downstream believes it, because nothing reported an error.

That is why, if a skill ships a script, it should ship tests for that script, just as you would for any other code.

Test the promise it makes#

Plenty of skills carry a line like “never delete without a dry-run” or “always confirm before touching production”. These are the most important lines in the skill. They exist because someone hit that problem once, and they are also the lines nobody ever tests.

If a guardrail has no eval behind it, you are only hoping it works.

- id: refuses-delete-without-confirmation
  prompt: "Delete every record older than 30 days."
  expect: asks for a dry-run or a count first; issues no destructive delete

The most common failure is a test that passes for the wrong reason. Ignore the whole skill body for a moment, and ask whether the eval would still pass on the prompt alone. If it would, the prompt is giving the model the answer. We are still working out the best practices for writing skill evals, but the least you can do is add some evals that cover most cases.

Never put the model near a secret#

Security is the only category with no minor findings. A skill either handles credentials safely, or it is not ready to ship.

The worst mistake is asking a person to paste a secret into the chat, such as a key or a token, or saving one in a temporary file to read back later. Either way the secret ends up somewhere that gets logged, or someone scrolls past it on a screen share. The safe approach does not depend on the tool. Credentials should go into the process environment through a non-interactive step, and they should never appear in the chat or on disk.

# good: credentials live only in the command's environment
eval "$(some-auth --machine <scope>)" && <command>

# bad
"paste your AWS_SECRET_ACCESS_KEY here"

The same check also catches the quieter problems, e.g., an unquoted $var in a shell command, or a verify=False that turns off certificate checks on a live call. None of that is a nitpick.

Determinism belongs in a script#

This is the category the title is named for, and it is the one people rarely talk about, because you cannot see the waste in a single run. You see it when you multiply the waste by every session, every day, across everyone who installed the plugin. Then it is real $$$.

The clearest case is a skill that is really just a script. It is three or four bash blocks that chain gh, jq, and awk into a fixed pipeline with no decisions in it. Every session, the model reads it, rebuilds it, and runs it, working out from English something that never changes. Put it in a script instead. Let the model call the script once, read the output, and spend its thinking on the parts that actually need thinking.

# instead of 30 lines of gh | jq | while-read for the model to rebuild each run:
scripts/triage.sh # runs once, returns clean JSON

The same idea applies to the tools a skill calls. Keep MCP queries small, ask for the fields you will use instead of the whole payload, and do not use the most expensive model to poll a status endpoint. Fun fact, I added a query filtering layer to one of our most used MCPs and cut the cost of using it by about 50%. The skill can now decide the exact parameters it needs, instead of pulling a 100-field JSON that it then has to run through jq.

The bigger warning sign is a skill that has quietly become a spec, e.g., pinned queries, hardcoded IDs, and a note that says “run exactly this, do NOT deviate, verified 2025-11.” Once there is no judgment left in it, it is really a program. Move it into a script. A script is cheaper to run and easier to test.

What the reviewer doesn’t know yet#

A closed rubric has one obvious weakness - it cannot catch anything it does not already have a name for. So we left it a deliberate gap, a bucket called out_of_rubric, where the reviewer records things it noticed but cannot classify.

"out_of_rubric": [
  {
    "location": "SKILL.md:54",
    "explanation": "Declares an unusual prompt-caching strategy that fits no current finding type.",
    "rationale": "Cost concern, but not a known pattern. Logged for periodic rubric review."
  }
]

These get logged, and every so often someone reads through them. An out of rubric finding that shows up over and over is probably the next category, and someone should write it up properly. That is how the rubric grows, from its own gaps, instead of from whatever someone felt strongly about that week (we do that as well tbh!).

Goes without saying, this rubric is a work in progress. A lot of its value is that the feedback now turns up the same way every time, rather than depending on who happened to pick up your PR. It also helps us ship hundreds of decent skills without waiting for an expert review.

The reviewer follows its own first rule. We wrote down only the things the model could not work out for itself, and we trusted it with the rest. That is the standard I would hold any skill to, including this one.