Running evaluations
Launch, automate, retry, cancel, export, and compare hosted evaluations — from one-off commands to CI pipelines.
Once you’ve run your first evaluation, the next questions are operational: how do I check this into source control, inject secrets, block CI on completion, retry failures, and compare runs? This page is the playbook.
For the exhaustive command and flag list, see dn evaluation.
File-backed manifests
Section titled “File-backed manifests”Keep the evaluation definition in evaluation.yaml when you want it in source control, when
the request grows past a readable command line, or when you need per-row inputs.
name: nightly-regressionproject: sandboxtask_names: - corp-recon - local-enummodel: openai/gpt-4.1-minisecret_ids: - 11111111-2222-3333-4444-555555555555concurrency: 4cleanup_policy: on_successdn evaluation create --file evaluation.yamlExplicit CLI flags override values from the file. Use secret_ids in the manifest for exact
source-controlled configuration; use repeatable --secret flags to resolve names against
your user-configured secrets at runtime.
Injecting secrets
Section titled “Injecting secrets”--secret injects user-configured secrets into both the runtime sandbox and the task
environment sandbox.
# exact name: strict, must existdn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \ --secret OPENROUTER_API_KEY
# glob: best-effort, zero matches is alloweddn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \ --secret 'OPENROUTER_*'| Selector | Behavior |
|---|---|
| Exact name | Strict — fails fast when the secret isn’t configured. |
| Glob pattern | Best-effort — silently skips when nothing matches. |
| Duplicates | De-duplicated before the request is submitted. |
Blocking on completion
Section titled “Blocking on completion”Use --wait on create or the standalone wait command to gate CI or scripts on results. Both
exit non-zero if the evaluation didn’t complete successfully.
# block at creation timedn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini --wait
# or wait on an existing evaluationdn evaluation wait 9ab81fc1 --timeout-sec 3600Cleanup policy
Section titled “Cleanup policy”--cleanup-policy is easy to ignore until compute is left running.
always(default) — clean up even when the evaluation fails. Use for clean automation.on_success— failed runs leave sandboxes up for inspection. Use when you need to drop into a failing item. Expect to clean up withdn sandboxafter.
Retry and cancel
Section titled “Retry and cancel”# requeue failed, timed-out, and errored samples without recreating the evaluationdn evaluation retry 9ab81fc1
# cancel a running evaluation (terminates active sandboxes)dn evaluation cancel 9ab81fc1retry is most useful after a terminal run when you want to requeue only the samples that
ended in failed, timed-out, cancelled, or infrastructure-error states.
Export and compare
Section titled “Export and compare”# export samples as JSONL (optionally include transcripts)dn evaluation export 9ab81fc1 --format jsonl
# compare two evaluations side by sidedn evaluation compare 9ab81fc1 b2c34de5Use compare to see how a different model, prompt, or task version performs against the same
workload.
Transcripts
Section titled “Transcripts”dn evaluation get-transcript 9ab81fc1/75e4914fThe transcript is available mid-run — the session link is established as soon as the runtime
creates it, before the agent begins streaming. Samples without a linked session return 404
(old evaluations, or runtime session-registration failures); export --transcripts skips
those with a warning instead of failing. For the payload shape, see
dreadnode.sessions.
Sample references use eval/sample slash syntax (for example 9ab81fc1/75e4914f). Both IDs
support prefix matching — the first 8 characters are enough.
Shared scope
Section titled “Shared scope”Evaluation commands use the standard platform context from
Authentication: --profile, --server, --api-key,
--organization, --workspace, --project.
When a run feels stuck
Section titled “When a run feels stuck”dn evaluation get 9ab81fc1 --jsondn evaluation list-samples 9ab81fc1dn sandbox list --state runningThat triangulates whether you’re looking at a control-plane problem, a task failure, or a cleanup-policy surprise. For deeper failure triage, see Security Evaluation Operations.