Writing skills

How to write SKILL.md instruction packs that trigger when needed and stay useful as the capability grows.

A skill that the agent never invokes — or invokes for the wrong job — is dead weight. This page covers the craft of writing skills that trigger reliably, use context efficiently, and stay useful as the capability evolves.

For the file format and frontmatter reference, see Skills.

The progressive disclosure ladder

Every installed skill has three loading layers. Each layer’s budget is a hard constraint to design around.

Layer	When loaded	Budget	What goes here
Metadata (`name` + `description`)	Always, for every conversation	~100 tokens per skill — and every installed skill contributes	Trigger conditions only
`SKILL.md` body	On trigger, when the agent decides the skill applies	Aim under ~500 lines	Strategic guidance, decision points, pointers
Bundled `references/`, `scripts/`, `assets/`	On demand, when the agent reads or executes them	Effectively unlimited	Reference detail, deterministic logic, templates

The metadata budget is the one most authors miss. With dozens of skills installed, descriptions compete for the same trigger budget — bloated descriptions hide each other.

Descriptions: the single most important field

The description determines whether the agent invokes the skill at all. It is read for every user turn. Treat it like a search query, not a summary.

Describe when to use it, not what it does. The agent isn’t browsing a catalog; it’s matching a user request to a tool.

Weak	Strong
”Helps with security testing"	"Use when running container registry security research, analyzing Docker images for leaked secrets, or mapping build infrastructure through image metadata"
"A guide for analyzing Docker registries"	"Use when asked to run red team assessments against LLMs, test model safety guardrails, or evaluate prompt injection resistance"
"Capability to format reports"	"Use when finalizing a security assessment, exporting findings to PDF, or producing client-ready report markdown”

Front-load trigger keywords. The first half of the description carries the most weight. Lead with the verbs and nouns the user is likely to type.

Cover formal and casual phrasings. “Database migration” and “update the db schema.” Users don’t write the way docs do.

Be slightly pushy. Agents tend to undertrigger. If a skill is genuinely the right move for a class of tasks, say so plainly: “Use this skill whenever the user asks for X” reads better than “may help with X-adjacent tasks.”

Keep it under ~200 characters. Every installed skill’s description sits in the same shared budget. A 400-character description pushes other skills’ triggers below the model’s attention.

Body structure: match the kind of work

Different jobs want different skill shapes. Forcing a checklist onto research, or hypotheses onto rote process, both fail.

Kind of work	Body shape	Agent freedom
Domain research (security assessment, threat modeling)	Hypotheses and approaches, each with “how to test” and “when to abandon”	High — the agent forms theories and pivots on findings
Tool integration (wrapping Semgrep, Nmap, a CLI)	Workflow patterns, common invocations, output interpretation	Medium — the agent follows patterns, adapts to context
Process automation (report generation, NDA review)	Step-by-step recipe with validation gates	Low — the agent follows the recipe

Hybrids are fine. A security-tool integration has tool-mechanics on top and domain-research strategy underneath; reflect both.

Explain why, not what

The model already knows what to do for most things. What it doesn’t have is your domain context — why one approach works in a specific situation. Skills add value where they encode that context.

Heavy-handed	Reasoned
”ALWAYS use a try-catch around database calls"	"DB calls fail on connection loss, timeouts, or constraint violations — wrap them so users see a clear message instead of a stack trace"
"NEVER skip the verification step"	"Skip verification only when running interactively — the verifier is what gates publish, so skipping it in CI hides real bugs"
"MUST run the linter before commit"	"The linter catches the same patterns reviewers flag manually; running it first cuts review cycles in half”

Heavy MUST/ALWAYS/NEVER is a code smell. Each one constrains the model’s ability to adapt to context. Save them for genuinely invariant rules — security gates, output contracts, things that must never bend.

What goes in the body vs. references vs. scripts

The body is loaded every time the skill triggers. Anything not needed every time should live elsewhere.

Body — workflow, decision points, pointers to references and scripts.

References — depth the agent reaches for selectively. Domain-specific data, framework-specific instructions, long examples, edge case documentation. In your skill body, name each reference and say when to read it.

Scripts — deterministic work that should produce the same output every time: validation, formatting, data transformation. Scripts are more reliable than asking the model to do mechanical work, save tokens, and work consistently across model sizes. They can be executed without being read into context.

Use a script when	Use instructions when
Same input → same output	Output depends on context
Programmatically verifiable	Needs human or model judgment
Costs significant tokens to walk through	Token cost is negligible

Multi-domain organization

When one skill genuinely supports multiple variants — frameworks, cloud providers, target systems — split the variant detail into references and route from the body:

cloud-deploy/
  SKILL.md            # workflow + which-reference-to-read
  references/
    aws.md
    gcp.md
    azure.md

## Provider-specific guidance

Read the matching reference based on the user's target:

- AWS / EC2 / Lambda / S3 → `references/aws.md`
- GCP / GCE / Cloud Run → `references/gcp.md`
- Azure / VMs / Functions → `references/azure.md`

Read only the file for the current target. Do not pre-load.

The body stays compact; the agent reads only what it needs.

Iterating against real prompts

A skill you haven’t tested against a real prompt is a guess.

Draft. Write a first pass. Don’t polish.
Test with realistic prompts. Pick three things a real user would actually say — not abstract test inputs.
Read the transcripts, not just the outputs. Intermediate steps reveal whether the skill is making the agent waste time or skip important things.
Cut what isn’t pulling weight. If the agent ignores a section, remove it. Shorter skills are better skills.
Sharpen at decision points. If the agent went off-track at a specific step, that step’s guidance was unclear. Add a sentence explaining why, not a paragraph of new rules.
Bundle repeated work. If every test run independently produces the same helper script, drop it in scripts/. Write it once.

Complexity should decrease over iterations. If the skill grows with each round, you’re patching rather than fixing root causes.

Common failure modes

Description summarizes the skill instead of triggering it. “Helps with X” tells the agent what the skill is, not when to use it. Rewrite as “Use when…”.
Body duplicates reference material. If something is in --help or a file the agent can read, point to it; don’t restate it. Duplicated content drifts and wastes tokens.
Heavy MUST/ALWAYS/NEVER everywhere. Reframe each one as reasoning. The model adapts better to “X works because Y” than to “X is required.”
One giant body for a multi-variant skill. Split into references and route from the body. The agent reads only what’s relevant.
Skill never tested against real prompts. Run two or three realistic asks before declaring done. Read the transcripts.
Skill grows on every iteration. Healthy iteration cuts; unhealthy iteration patches. If the body is getting longer, look for the section that should be a reference or a script.