Running evaluations

Launch, automate, retry, cancel, export, and compare hosted evaluations — from one-off commands to CI pipelines.

Once you’ve run your first evaluation, the next questions are operational: how do I check this into source control, inject secrets, block CI on completion, retry failures, and compare runs? This page is the playbook.

For the exhaustive command and flag list, see dn evaluation.

File-backed manifests

Keep the evaluation definition in evaluation.yaml when you want it in source control, when the request grows past a readable command line, or when you need per-row inputs.

name: nightly-regression
project: sandbox
task_names:
  - corp-recon
  - local-enum
model: openai/gpt-4.1-mini
secret_ids:
  - 11111111-2222-3333-4444-555555555555
concurrency: 4
cleanup_policy: on_success

dn evaluation create --file evaluation.yaml

Explicit CLI flags override values from the file. Use secret_ids in the manifest for exact source-controlled configuration; use repeatable --secret flags to resolve names against your user-configured secrets at runtime.

Injecting secrets

--secret injects user-configured secrets into both the runtime sandbox and the task environment sandbox.

# exact name: strict, must exist
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \
  --secret OPENROUTER_API_KEY

# glob: best-effort, zero matches is allowed
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini \
  --secret 'OPENROUTER_*'

Selector	Behavior
Exact name	Strict — fails fast when the secret isn’t configured.
Glob pattern	Best-effort — silently skips when nothing matches.
Duplicates	De-duplicated before the request is submitted.

Blocking on completion

Use --wait on create or the standalone wait command to gate CI or scripts on results. Both exit non-zero if the evaluation didn’t complete successfully.

# block at creation time
dn evaluation create my-eval --task corp-recon --model openai/gpt-4.1-mini --wait

# or wait on an existing evaluation
dn evaluation wait 9ab81fc1 --timeout-sec 3600

Cleanup policy

--cleanup-policy is easy to ignore until compute is left running.

always (default) — clean up even when the evaluation fails. Use for clean automation.
on_success — failed runs leave sandboxes up for inspection. Use when you need to drop into a failing item. Expect to clean up with dn sandbox after.

Retry and cancel

# requeue failed, timed-out, and errored samples without recreating the evaluation
dn evaluation retry 9ab81fc1

# cancel a running evaluation (terminates active sandboxes)
dn evaluation cancel 9ab81fc1

retry is most useful after a terminal run when you want to requeue only the samples that ended in failed, timed-out, cancelled, or infrastructure-error states.

Export and compare

# export samples as JSONL (optionally include transcripts)
dn evaluation export 9ab81fc1 --format jsonl

# compare two evaluations side by side
dn evaluation compare 9ab81fc1 b2c34de5

Use compare to see how a different model, prompt, or task version performs against the same workload.

In the App, the evaluation view has a Download all N logs button in the header that bundles every sample’s trajectory into a single ZIP — one <sample>.json per linked sample plus a manifest.json. It uses the same format (ATIF, OpenAI, or Native) you last picked in the per-sample export menu. Samples that never ran (no linked session) are listed in the manifest as skipped rather than failing the download.

Transcripts

dn evaluation get-transcript 9ab81fc1/75e4914f

The transcript is available mid-run — the session link is established as soon as the runtime creates it, before the agent begins streaming. Samples without a linked session return 404 (old evaluations, or runtime session-registration failures); export --transcripts skips those with a warning instead of failing. For the payload shape, see dreadnode.sessions.

Sample references use eval/sample slash syntax (for example 9ab81fc1/75e4914f). Both IDs support prefix matching — the first 8 characters are enough.

Shared scope

Evaluation commands use the standard platform context from Authentication: --profile, --server, --api-key, --organization, --workspace, --project.

When a run feels stuck

dn evaluation get 9ab81fc1 --json
dn evaluation list-samples 9ab81fc1
dn sandbox list --state running

That triangulates whether you’re looking at a control-plane problem, a task failure, or a cleanup-policy surprise.