Capability improvement

Use `dn capability improve` to optimize a local capability against a local dataset and land a promotable candidate.

dn capability improve is the on-machine optimization loop for capabilities you haven’t published yet. You point it at a capability directory, a local dataset, and one or more scorers; it runs a GEPA search over the capability’s own prompt and skill files, keeps or discards the winner based on a holdout score, and writes an audit-friendly ledger to disk.

dn capability improve ./capabilities/support-agent \
  --dataset ./datasets/support-train.jsonl \
  --holdout-dataset ./datasets/support-holdout.jsonl \
  --scorer ./scorers.py:answer_contains_expected \
  --model openai/gpt-4o-mini \
  --objective "Make answers more specific without getting longer." \
  --max-metric-calls 40

The command runs in-process, so an LLM key (not a Dreadnode workspace) is all you need. Use it while the capability is still local — before you dn capability push — to keep the search loop fast and the scoring logic editable.

When to use this loop

Reach for capability improve when:

the capability lives on your machine as a directory with capability.yaml and friends
you can express “better” with one or more scorers you already wrote
you want the winning candidate to be a drop-in replacement for the source files, not a prompt

For a published capability, move to hosted jobs. For a plain string prompt, use local search.

The four surfaces

By default the optimizer can edit four things in the capability:

Surface	What it covers
`agent_prompt`	The agent’s `instructions` field.
`capability_prompt`	The capability-level prompt text.
`skill_descriptions`	The description string on each skill.
`skill_bodies`	The body of each skill file.

Use --surface to narrow the allowed edits — --surface agent_prompt to only change the agent instructions, for example. Pass it repeatedly to allow more than one.

Scorers and the dataset

The dataset is a local file (JSONL or a dataset directory). Each row becomes a task invocation. Scorers receive the agent output and the row and return a numeric score.

Pass scorers with --scorer PATH:NAME (module path plus callable name) — repeatable for multiple metrics. When you pass more than one, add --score-name to pick the one the optimizer should actually maximize.

dn capability improve ./capabilities/support-agent \
  --dataset ./datasets/support-train.jsonl \
  --scorer ./scorers.py:answer_contains_expected \
  --scorer ./scorers.py:answer_under_120_chars \
  --score-name answer_contains_expected \
  --goal-field question

If your dataset fields don’t line up with the agent’s task parameters, map them with repeatable --dataset-input DATASET_KEY=TASK_PARAM flags. --goal-field picks the column that becomes the agent goal.

Holdout gating

--holdout-dataset is what turns an optimization result into a promotable one. The optimizer accepts the best candidate only when:

the training score improves over the baseline (or ties while shrinking the edited surface), and
the holdout score does not regress against the baseline (within a small tolerance).

A candidate that only ties on training is rejected — a flat metric is not evidence of improvement. Without a holdout, the optimizer can only judge fit to the training set. That’s fine while you’re exploring — not enough to justify overwriting the capability’s files.

The proposer capability

By default, proposals come from the GEPA backend’s own reflection. You can override that with a local proposer capability:

dn capability improve ./capabilities/support-agent \
  --dataset ./datasets/support-train.jsonl \
  --scorer ./scorers.py:answer_contains_expected \
  --proposer-capability dreadnode/capability-improver \
  --proposer-model openai/gpt-4o-mini

The proposer is a capability that suggests candidate edits; the CLI still owns scoring and the accept/reject decision. Use --proposer-agent when the proposer capability exports more than one agent.

The loader resolves --proposer-capability against the directories in DREADNODE_CAPABILITY_DIRS (or DREADNODE_CAPABILITIES_DIR). When the ref can’t be resolved locally, the run falls back to the backend’s own reflection without a warning — install the proposer capability into one of those directories first if you need it active.

Reading the output

Each run writes to <capability>/.dreadnode/improve/<timestamp>/ (override with --output-dir). The output directory must not already exist — pick a new path when rerunning.

File	What’s in it
`ledger.json`	Run metadata, baseline and best scores, accept/reject decision.
`baseline-candidate.json`	The starting candidate before optimization.
`best-candidate.json`	The best candidate the search found.
`winner-candidate.json`	Baseline or best, depending on the gating decision.
`history.json`	Every trial the search evaluated.
`best-capability/`	A materialized capability directory with the winning edits applied.

ledger.json’s decision block spells out accept or reject with a human-readable reason. The terminal output prints the same summary.

Hand best-capability/ to dn capability push once you’ve read the diff. Don’t push automatically — the ledger tells you the candidate cleared the holdout gate, but it can’t tell you whether the new instructions are ones you’d want to ship.

Budget flags

Flag	Default	What it bounds
`--max-metric-calls`	40	Total evaluator calls the search can make.
`--max-trials`	8	Number of candidate trials.
`--max-trials-without-improvement`	3	Stop after this many finished trials without a new best.

All three are upper bounds — the search stops at whichever hits first. For short runs, keep the defaults; raise --max-metric-calls when the search is still finding new bests at the end.

Other useful flags not covered above: --agent (pick which capability agent to optimize when the capability exports more than one), --reflection-model (override the model GEPA uses for reflection proposals), --seed, and --json. Run dn capability improve --help for the full list.

Move to hosted jobs when the capability is ready to publish.
Read custom search loops for the Study/Sampler primitives the improvement adapter drives.
Scorers and datasets cover the inputs this loop feeds on.