Skip to content

Capability improvement

Use `dn capability improve` to optimize a local capability against a local dataset and land a promotable candidate.

dn capability improve is the on-machine optimization loop for capabilities you haven’t published yet. You point it at a capability directory, a local dataset, and one or more scorers; it runs a GEPA search over the capability’s own prompt and skill files, keeps or discards the winner based on a holdout score, and writes an audit-friendly ledger to disk.

Terminal window
dn capability improve ./capabilities/support-agent \
--dataset ./datasets/support-train.jsonl \
--holdout-dataset ./datasets/support-holdout.jsonl \
--scorer ./scorers.py:answer_contains_expected \
--model openai/gpt-4o-mini \
--objective "Make answers more specific without getting longer." \
--max-metric-calls 40

The command runs in-process, so an LLM key (not a Dreadnode workspace) is all you need. Use it while the capability is still local — before you dn capability push — to keep the search loop fast and the scoring logic editable.

Reach for capability improve when:

  • the capability lives on your machine as a directory with capability.yaml and friends
  • you can express “better” with one or more scorers you already wrote
  • you want the winning candidate to be a drop-in replacement for the source files, not a prompt

For a published capability, move to hosted jobs. For a plain string prompt, use local search.

By default the optimizer can edit four things in the capability:

SurfaceWhat it covers
agent_promptThe agent’s instructions field.
capability_promptThe capability-level prompt text.
skill_descriptionsThe description string on each skill.
skill_bodiesThe body of each skill file.

Use --surface to narrow the allowed edits — --surface agent_prompt to only change the agent instructions, for example. Pass it repeatedly to allow more than one.

The dataset is a local file (JSONL or a dataset directory). Each row becomes a task invocation. Scorers receive the agent output and the row and return a numeric score.

Pass scorers with --scorer PATH:NAME (module path plus callable name) — repeatable for multiple metrics. When you pass more than one, add --score-name to pick the one the optimizer should actually maximize.

Terminal window
dn capability improve ./capabilities/support-agent \
--dataset ./datasets/support-train.jsonl \
--scorer ./scorers.py:answer_contains_expected \
--scorer ./scorers.py:answer_under_120_chars \
--score-name answer_contains_expected \
--goal-field question

If your dataset fields don’t line up with the agent’s task parameters, map them with repeatable --dataset-input DATASET_KEY=TASK_PARAM flags. --goal-field picks the column that becomes the agent goal.

--holdout-dataset is what turns an optimization result into a promotable one. The optimizer accepts the best candidate only when:

  • the training score improves over the baseline (or ties while shrinking the edited surface), and
  • the holdout score does not regress against the baseline (within a small tolerance).

A candidate that only ties on training is rejected — a flat metric is not evidence of improvement. Without a holdout, the optimizer can only judge fit to the training set. That’s fine while you’re exploring — not enough to justify overwriting the capability’s files.

By default, proposals come from the GEPA backend’s own reflection. You can override that with a local proposer capability:

Terminal window
dn capability improve ./capabilities/support-agent \
--dataset ./datasets/support-train.jsonl \
--scorer ./scorers.py:answer_contains_expected \
--proposer-capability dreadnode/capability-improver \
--proposer-model openai/gpt-4o-mini

The proposer is a capability that suggests candidate edits; the CLI still owns scoring and the accept/reject decision. Use --proposer-agent when the proposer capability exports more than one agent.

The loader resolves --proposer-capability against the directories in DREADNODE_CAPABILITY_DIRS (or DREADNODE_CAPABILITIES_DIR). When the ref can’t be resolved locally, the run falls back to the backend’s own reflection without a warning — install the proposer capability into one of those directories first if you need it active.

Each run writes to <capability>/.dreadnode/improve/<timestamp>/ (override with --output-dir). The output directory must not already exist — pick a new path when rerunning.

FileWhat’s in it
ledger.jsonRun metadata, baseline and best scores, accept/reject decision.
baseline-candidate.jsonThe starting candidate before optimization.
best-candidate.jsonThe best candidate the search found.
winner-candidate.jsonBaseline or best, depending on the gating decision.
history.jsonEvery trial the search evaluated.
best-capability/A materialized capability directory with the winning edits applied.

ledger.json’s decision block spells out accept or reject with a human-readable reason. The terminal output prints the same summary.

Hand best-capability/ to dn capability push once you’ve read the diff. Don’t push automatically — the ledger tells you the candidate cleared the holdout gate, but it can’t tell you whether the new instructions are ones you’d want to ship.

FlagDefaultWhat it bounds
--max-metric-calls40Total evaluator calls the search can make.
--max-trials8Number of candidate trials.
--max-trials-without-improvement3Stop after this many finished trials without a new best.

All three are upper bounds — the search stops at whichever hits first. For short runs, keep the defaults; raise --max-metric-calls when the search is still finding new bests at the end.

Other useful flags not covered above: --agent (pick which capability agent to optimize when the capability exports more than one), --reflection-model (override the model GEPA uses for reflection proposals), --seed, and --json. Run dn capability improve --help for the full list.