Evaluations
Batch evaluation of agents against security tasks.
$ dn evaluation <command>Batch evaluation of agents against security tasks — measure capability, track regressions, and compare models.
create
Section titled “create”$ dn evaluation createLaunch an evaluation against one or more security tasks.
Builds the evaluation request from CLI flags, an evaluation.yaml
manifest (--file), or both (flags override the manifest).
Use --wait to block until the evaluation completes and print
a results summary.
Options
<name>,--name— Evaluation name (e.g. my-eval-v3). Optional when set in —file.--task— Security task to evaluate on (repeatable).--file— Path to evaluation.yaml request manifest.--runtime-id— Runtime record ID for tracking; does not select a model.--model— Model identifier. Required unless —capability provides one.--capability— Capability to load. Also pass —model if it has no entry-agent model.--secret— Secret selector to inject into evaluation sandboxes. Repeatable. Exact names are strict; glob selectors are best-effort.--concurrency— Maximum concurrent evaluation samples.--task-timeout-sec— Timeout per task in seconds.--cleanup-policy— Sandbox cleanup policy. [choices: always, on_success]--wait(defaultFalse) — Block until the evaluation reaches a terminal state.--poll-interval-sec(default10.0) — Seconds between status polls when —wait is set.--timeout-sec— Maximum seconds to wait before timing out.--json(defaultFalse) — Output as JSON.
Aliases: ls
$ dn evaluation listShow evaluations in your workspace.
Options
--status— Filter by evaluation status (e.g. running, completed, failed). [choices: queued, running, completed, partial, failed, cancelled]--project-id— Filter by project ID.--limit(default50) — Maximum results to show.--json(defaultFalse) — Output as JSON.
$ dn evaluation get <evaluation-id>Show evaluation configuration, progress, and results.
Displays configuration, current sample progress, and timing. When the evaluation has finished, also shows pass rates, per-task breakdown, and duration percentiles from the analytics snapshot.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID (e.g. 0fe36a23-…).--json(defaultFalse) — Output as JSON.
list-samples
Section titled “list-samples”$ dn evaluation list-samples <evaluation-id>List samples in an evaluation.
Each sample represents one agent run against a security task.
Use --status failed to drill into failures.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID.--status— Filter by sample status (e.g. passed, failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]--json(defaultFalse) — Output as JSON.
get-sample
Section titled “get-sample”$ dn evaluation get-sample <eval/sample>Show details of a single evaluation sample.
Displays the sample’s lifecycle status, timing breakdown, sandbox IDs, error details, and verification result.
Options
<eval/sample>,--eval/sample(Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).--json(defaultFalse) — Output as JSON.
get-transcript
Section titled “get-transcript”$ dn evaluation get-transcript <eval/sample>Download the agent conversation transcript for a sample.
Returns the session transcript linked to this evaluation item as raw JSON.
The payload is a SessionTranscriptResponse with the following top-level
fields:
session: session metadata (id, title, model, agent, project, timestamps)messages: ordered list of messages, each withid,seq,parent_id,role,content,tool_calls,tool_call_id,metadata,agent,model,created_at, andcompacted_atcurrent_system_prompt: the active system prompt for restorehas_more: pagination flag
Returns 404 if the item has no linked session (old evals or items where the runtime’s session registration failed). Available mid-run — the link is established as soon as the runtime creates the session, before the agent begins streaming.
Options
<eval/sample>,--eval/sample(Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).
$ dn evaluation wait <evaluation-id>Block until an evaluation reaches a terminal state.
Polls the evaluation status and exits when it completes, fails, or is cancelled. Exits non-zero if the evaluation did not complete successfully.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID.--poll-interval-sec(default10.0) — Seconds between status polls.--timeout-sec— Maximum seconds to wait before timing out.--json(defaultFalse) — Output as JSON.
cancel
Section titled “cancel”$ dn evaluation cancel <evaluation-id>Cancel a running evaluation.
Requests cancellation and terminates active sandboxes. Samples that are already in progress will be marked as cancelled.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID.--yes,-y(defaultFalse) — Skip the confirmation prompt.--json(defaultFalse) — Output as JSON.
$ dn evaluation retry <evaluation-id>Retry failed and errored samples in an evaluation.
Resets samples that ended in failed, timed_out, or infra_error back to queued so they are picked up by workers again.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID.--json(defaultFalse) — Output as JSON.
export
Section titled “export”$ dn evaluation export <evaluation-id>Export evaluation results, samples, and transcripts.
Writes evaluation metadata, per-sample results, and agent transcripts to a directory. Transcripts are included by default; use —no-transcripts to skip them.
Each transcript file is a SessionTranscriptResponse JSON payload — see
dn evaluation get-transcript --help for the shape. Samples without a
linked session (old evals or items where the runtime’s session
registration failed) are skipped with a warning.
Options
<evaluation-id>,--evaluation-id(Required) — The evaluation ID (full or 8-char prefix).--output,-o— Output directory (default: ./eval-<short-id>/).--transcripts,--no-transcripts(defaultTrue) — Include agent transcripts (default: yes).--status— Only export samples with this status (e.g. failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]--json(defaultFalse) — Dump combined JSON to stdout instead of writing files.
compare
Section titled “compare”$ dn evaluation compare <eval-a> <eval-b>Compare two evaluation runs side by side.
Shows pass rate delta, per-task breakdown, duration changes, and error pattern differences between two evaluations.
Options
<eval-a>,--eval-a(Required) — First evaluation ID (baseline).<eval-b>,--eval-b(Required) — Second evaluation ID (comparison).--json(defaultFalse) — Output as JSON.