Writing skills
How to write SKILL.md instruction packs that trigger when needed and stay useful as the capability grows.
A skill that the agent never invokes — or invokes for the wrong job — is dead weight. This page covers the craft of writing skills that trigger reliably, use context efficiently, and stay useful as the capability evolves.
For the file format and frontmatter reference, see Skills.
The progressive disclosure ladder
Section titled “The progressive disclosure ladder”Every installed skill has three loading layers. Each layer’s budget is a hard constraint to design around.
| Layer | When loaded | Budget | What goes here |
|---|---|---|---|
Metadata (name + description) | Always, for every conversation | ~100 tokens per skill — and every installed skill contributes | Trigger conditions only |
SKILL.md body | On trigger, when the agent decides the skill applies | Aim under ~500 lines | Strategic guidance, decision points, pointers |
Bundled references/, scripts/, assets/ | On demand, when the agent reads or executes them | Effectively unlimited | Reference detail, deterministic logic, templates |
The metadata budget is the one most authors miss. With dozens of skills installed, descriptions compete for the same trigger budget — bloated descriptions hide each other.
Descriptions: the single most important field
Section titled “Descriptions: the single most important field”The description determines whether the agent invokes the skill at all. It is read for every user turn. Treat it like a search query, not a summary.
Describe when to use it, not what it does. The agent isn’t browsing a catalog; it’s matching a user request to a tool.
| Weak | Strong |
|---|---|
| ”Helps with security testing" | "Use when running container registry security research, analyzing Docker images for leaked secrets, or mapping build infrastructure through image metadata" |
| "A guide for analyzing Docker registries" | "Use when asked to run red team assessments against LLMs, test model safety guardrails, or evaluate prompt injection resistance" |
| "Capability to format reports" | "Use when finalizing a security assessment, exporting findings to PDF, or producing client-ready report markdown” |
Front-load trigger keywords. The first half of the description carries the most weight. Lead with the verbs and nouns the user is likely to type.
Cover formal and casual phrasings. “Database migration” and “update the db schema.” Users don’t write the way docs do.
Be slightly pushy. Agents tend to undertrigger. If a skill is genuinely the right move for a class of tasks, say so plainly: “Use this skill whenever the user asks for X” reads better than “may help with X-adjacent tasks.”
Keep it under ~200 characters. Every installed skill’s description sits in the same shared budget. A 400-character description pushes other skills’ triggers below the model’s attention.
Body structure: match the kind of work
Section titled “Body structure: match the kind of work”Different jobs want different skill shapes. Forcing a checklist onto research, or hypotheses onto rote process, both fail.
| Kind of work | Body shape | Agent freedom |
|---|---|---|
| Domain research (security assessment, threat modeling) | Hypotheses and approaches, each with “how to test” and “when to abandon” | High — the agent forms theories and pivots on findings |
| Tool integration (wrapping Semgrep, Nmap, a CLI) | Workflow patterns, common invocations, output interpretation | Medium — the agent follows patterns, adapts to context |
| Process automation (report generation, NDA review) | Step-by-step recipe with validation gates | Low — the agent follows the recipe |
Hybrids are fine. A security-tool integration has tool-mechanics on top and domain-research strategy underneath; reflect both.
Explain why, not what
Section titled “Explain why, not what”The model already knows what to do for most things. What it doesn’t have is your domain context — why one approach works in a specific situation. Skills add value where they encode that context.
| Heavy-handed | Reasoned |
|---|---|
| ”ALWAYS use a try-catch around database calls" | "DB calls fail on connection loss, timeouts, or constraint violations — wrap them so users see a clear message instead of a stack trace" |
| "NEVER skip the verification step" | "Skip verification only when running interactively — the verifier is what gates publish, so skipping it in CI hides real bugs" |
| "MUST run the linter before commit" | "The linter catches the same patterns reviewers flag manually; running it first cuts review cycles in half” |
Heavy MUST/ALWAYS/NEVER is a code smell. Each one constrains the model’s ability to adapt to context. Save them for genuinely invariant rules — security gates, output contracts, things that must never bend.
What goes in the body vs. references vs. scripts
Section titled “What goes in the body vs. references vs. scripts”The body is loaded every time the skill triggers. Anything not needed every time should live elsewhere.
Body — workflow, decision points, pointers to references and scripts.
References — depth the agent reaches for selectively. Domain-specific data, framework-specific instructions, long examples, edge case documentation. In your skill body, name each reference and say when to read it.
Scripts — deterministic work that should produce the same output every time: validation, formatting, data transformation. Scripts are more reliable than asking the model to do mechanical work, save tokens, and work consistently across model sizes. They can be executed without being read into context.
| Use a script when | Use instructions when |
|---|---|
| Same input → same output | Output depends on context |
| Programmatically verifiable | Needs human or model judgment |
| Costs significant tokens to walk through | Token cost is negligible |
Multi-domain organization
Section titled “Multi-domain organization”When one skill genuinely supports multiple variants — frameworks, cloud providers, target systems — split the variant detail into references and route from the body:
cloud-deploy/ SKILL.md # workflow + which-reference-to-read references/ aws.md gcp.md azure.md## Provider-specific guidance
Read the matching reference based on the user's target:
- AWS / EC2 / Lambda / S3 → `references/aws.md`- GCP / GCE / Cloud Run → `references/gcp.md`- Azure / VMs / Functions → `references/azure.md`
Read only the file for the current target. Do not pre-load.The body stays compact; the agent reads only what it needs.
Iterating against real prompts
Section titled “Iterating against real prompts”A skill you haven’t tested against a real prompt is a guess.
- Draft. Write a first pass. Don’t polish.
- Test with realistic prompts. Pick three things a real user would actually say — not abstract test inputs.
- Read the transcripts, not just the outputs. Intermediate steps reveal whether the skill is making the agent waste time or skip important things.
- Cut what isn’t pulling weight. If the agent ignores a section, remove it. Shorter skills are better skills.
- Sharpen at decision points. If the agent went off-track at a specific step, that step’s guidance was unclear. Add a sentence explaining why, not a paragraph of new rules.
- Bundle repeated work. If every test run independently produces the same helper script, drop it in
scripts/. Write it once.
Complexity should decrease over iterations. If the skill grows with each round, you’re patching rather than fixing root causes.
For evaluation-driven scaling — formal datasets, scorers, the optimization loop — see the capability optimization loop.
Common failure modes
Section titled “Common failure modes”- Description summarizes the skill instead of triggering it. “Helps with X” tells the agent what the skill is, not when to use it. Rewrite as “Use when…”.
- Body duplicates reference material. If something is in
--helpor a file the agent can read, point to it; don’t restate it. Duplicated content drifts and wastes tokens. - Heavy MUST/ALWAYS/NEVER everywhere. Reframe each one as reasoning. The model adapts better to “X works because Y” than to “X is required.”
- One giant body for a multi-variant skill. Split into references and route from the body. The agent reads only what’s relevant.
- Skill never tested against real prompts. Run two or three realistic asks before declaring done. Read the transcripts.
- Skill grows on every iteration. Healthy iteration cuts; unhealthy iteration patches. If the body is getting longer, look for the section that should be a reference or a script.