Capability improvement
Use `dn capability improve` to optimize a local capability against a local dataset and land a promotable candidate.
dn capability improve is the on-machine optimization loop for capabilities you haven’t published
yet. You point it at a capability directory, a local dataset, and one or more scorers; it runs a
GEPA search over the capability’s own prompt and skill files, keeps or discards the winner based
on a holdout score, and writes an audit-friendly ledger to disk.
dn capability improve ./capabilities/support-agent \ --dataset ./datasets/support-train.jsonl \ --holdout-dataset ./datasets/support-holdout.jsonl \ --scorer ./scorers.py:answer_contains_expected \ --model openai/gpt-4o-mini \ --objective "Make answers more specific without getting longer." \ --max-metric-calls 40The command runs in-process, so an LLM key (not a Dreadnode workspace) is all you need. Use it
while the capability is still local — before you dn capability push — to keep the search loop
fast and the scoring logic editable.
When to use this loop
Section titled “When to use this loop”Reach for capability improve when:
- the capability lives on your machine as a directory with
capability.yamland friends - you can express “better” with one or more scorers you already wrote
- you want the winning candidate to be a drop-in replacement for the source files, not a prompt
For a published capability, move to hosted jobs. For a plain string prompt, use local search.
The four surfaces
Section titled “The four surfaces”By default the optimizer can edit four things in the capability:
| Surface | What it covers |
|---|---|
agent_prompt | The agent’s instructions field. |
capability_prompt | The capability-level prompt text. |
skill_descriptions | The description string on each skill. |
skill_bodies | The body of each skill file. |
Use --surface to narrow the allowed edits — --surface agent_prompt to only change the agent
instructions, for example. Pass it repeatedly to allow more than one.
Scorers and the dataset
Section titled “Scorers and the dataset”The dataset is a local file (JSONL or a dataset directory). Each row becomes a task invocation. Scorers receive the agent output and the row and return a numeric score.
Pass scorers with --scorer PATH:NAME (module path plus callable name) — repeatable for multiple
metrics. When you pass more than one, add --score-name to pick the one the optimizer should
actually maximize.
dn capability improve ./capabilities/support-agent \ --dataset ./datasets/support-train.jsonl \ --scorer ./scorers.py:answer_contains_expected \ --scorer ./scorers.py:answer_under_120_chars \ --score-name answer_contains_expected \ --goal-field questionIf your dataset fields don’t line up with the agent’s task parameters, map them with repeatable
--dataset-input DATASET_KEY=TASK_PARAM flags. --goal-field picks the column that becomes the
agent goal.
Holdout gating
Section titled “Holdout gating”--holdout-dataset is what turns an optimization result into a promotable one. The optimizer
accepts the best candidate only when:
- the training score improves over the baseline (or ties while shrinking the edited surface), and
- the holdout score does not regress against the baseline (within a small tolerance).
A candidate that only ties on training is rejected — a flat metric is not evidence of improvement. Without a holdout, the optimizer can only judge fit to the training set. That’s fine while you’re exploring — not enough to justify overwriting the capability’s files.
The proposer capability
Section titled “The proposer capability”By default, proposals come from the GEPA backend’s own reflection. You can override that with a local proposer capability:
dn capability improve ./capabilities/support-agent \ --dataset ./datasets/support-train.jsonl \ --scorer ./scorers.py:answer_contains_expected \ --proposer-capability dreadnode/capability-improver \ --proposer-model openai/gpt-4o-miniThe proposer is a capability that suggests candidate edits; the CLI still owns scoring and the
accept/reject decision. Use --proposer-agent when the proposer capability exports more than one
agent.
The loader resolves --proposer-capability against the directories in
DREADNODE_CAPABILITY_DIRS (or DREADNODE_CAPABILITIES_DIR). When the ref can’t be resolved
locally, the run falls back to the backend’s own reflection without a warning — install the
proposer capability into one of those directories first if you need it active.
Reading the output
Section titled “Reading the output”Each run writes to <capability>/.dreadnode/improve/<timestamp>/ (override with --output-dir).
The output directory must not already exist — pick a new path when rerunning.
| File | What’s in it |
|---|---|
ledger.json | Run metadata, baseline and best scores, accept/reject decision. |
baseline-candidate.json | The starting candidate before optimization. |
best-candidate.json | The best candidate the search found. |
winner-candidate.json | Baseline or best, depending on the gating decision. |
history.json | Every trial the search evaluated. |
best-capability/ | A materialized capability directory with the winning edits applied. |
ledger.json’s decision block spells out accept or reject with a human-readable reason. The
terminal output prints the same summary.
Hand best-capability/ to dn capability push once you’ve read the diff. Don’t push
automatically — the ledger tells you the candidate cleared the holdout gate, but it can’t tell you
whether the new instructions are ones you’d want to ship.
Budget flags
Section titled “Budget flags”| Flag | Default | What it bounds |
|---|---|---|
--max-metric-calls | 40 | Total evaluator calls the search can make. |
--max-trials | 8 | Number of candidate trials. |
--max-trials-without-improvement | 3 | Stop after this many finished trials without a new best. |
All three are upper bounds — the search stops at whichever hits first. For short runs, keep the
defaults; raise --max-metric-calls when the search is still finding new bests at the end.
Other useful flags not covered above: --agent (pick which capability agent to optimize when the
capability exports more than one), --reflection-model (override the model GEPA uses for
reflection proposals), --seed, and --json. Run dn capability improve --help for the full
list.
- Move to hosted jobs when the capability is ready to publish.
- Read custom search loops for the
Study/Samplerprimitives the improvement adapter drives. - Scorers and datasets cover the inputs this loop feeds on.