Skip to content

Running evaluations

Launch, automate, retry, cancel, export, and compare hosted evaluations — from one-off commands to CI pipelines.

Once you’ve run your first evaluation, the next questions are operational: how do I check this into source control, inject secrets, block CI on completion, retry failures, and compare runs? This page is the playbook.

For the exhaustive command and flag list, see dn evaluation.

Keep the evaluation definition in evaluation.yaml when you want it in source control, when the request grows past a readable command line, or when you need per-row inputs.

evaluation.yaml
name: nightly-regression
project: sandbox
task_names:
- corp-recon
- local-enum
model: openai/gpt-4.1-mini
secret_ids:
- 11111111-2222-3333-4444-555555555555
concurrency: 4
cleanup_policy: on_success
Terminal window
dn evaluation create --file evaluation.yaml

Explicit CLI flags override values from the file. Use secret_ids in the manifest for exact source-controlled configuration; use repeatable --secret flags to resolve names against your user-configured secrets at runtime.

--secret injects user-configured secrets into both the runtime sandbox and the task environment sandbox.

Terminal window
# exact name: strict, must exist
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \
--secret OPENROUTER_API_KEY
# glob: best-effort, zero matches is allowed
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \
--secret 'OPENROUTER_*'
SelectorBehavior
Exact nameStrict — fails fast when the secret isn’t configured.
Glob patternBest-effort — silently skips when nothing matches.
DuplicatesDe-duplicated before the request is submitted.

Use --wait on create or the standalone wait command to gate CI or scripts on results. Both exit non-zero if the evaluation didn’t complete successfully.

Terminal window
# block at creation time
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini --wait
# or wait on an existing evaluation
dn evaluation wait 9ab81fc1 --timeout-sec 3600

--cleanup-policy is easy to ignore until compute is left running.

  • always (default) — clean up even when the evaluation fails. Use for clean automation.
  • on_success — failed runs leave sandboxes up for inspection. Use when you need to drop into a failing item. Expect to clean up with dn sandbox after.
Terminal window
# requeue failed, timed-out, and errored samples without recreating the evaluation
dn evaluation retry 9ab81fc1
# cancel a running evaluation (terminates active sandboxes)
dn evaluation cancel 9ab81fc1

retry is most useful after a terminal run when you want to requeue only the samples that ended in failed, timed-out, cancelled, or infrastructure-error states.

Terminal window
# export samples as JSONL (optionally include transcripts)
dn evaluation export 9ab81fc1 --format jsonl
# compare two evaluations side by side
dn evaluation compare 9ab81fc1 b2c34de5

Use compare to see how a different model, prompt, or task version performs against the same workload.

Terminal window
dn evaluation get-transcript 9ab81fc1/75e4914f

The transcript is available mid-run — the session link is established as soon as the runtime creates it, before the agent begins streaming. Samples without a linked session return 404 (old evaluations, or runtime session-registration failures); export --transcripts skips those with a warning instead of failing. For the payload shape, see dreadnode.sessions.

Sample references use eval/sample slash syntax (for example 9ab81fc1/75e4914f). Both IDs support prefix matching — the first 8 characters are enough.

Evaluation commands use the standard platform context from Authentication: --profile, --server, --api-key, --organization, --workspace, --project.

Terminal window
dn evaluation get 9ab81fc1 --json
dn evaluation list-samples 9ab81fc1
dn sandbox list --state running

That triangulates whether you’re looking at a control-plane problem, a task failure, or a cleanup-policy surprise. For deeper failure triage, see Security Evaluation Operations.