Skip to content

Evaluations

Batch evaluation of agents against security tasks.

Terminal window
$ dn evaluation <command>

Batch evaluation of agents against security tasks — measure capability, track regressions, and compare models.

Terminal window
$ dn evaluation create

Launch an evaluation against one or more security tasks.

Builds the evaluation request from CLI flags, an evaluation.yaml manifest (--file), or both (flags override the manifest). Use --wait to block until the evaluation completes and print a results summary.

Options

  • <name>, --name — Evaluation name (e.g. my-eval-v3). Optional when set in —file.
  • --task — Security task to evaluate on (repeatable).
  • --file — Path to evaluation.yaml request manifest.
  • --runtime-id — Runtime record ID for tracking; does not select a model.
  • --model — Model identifier. Required unless —capability provides one.
  • --capability — Capability to load. Also pass —model if it has no entry-agent model.
  • --secret — Secret selector to inject into evaluation sandboxes. Repeatable. Exact names are strict; glob selectors are best-effort.
  • --concurrency — Maximum concurrent evaluation samples.
  • --task-timeout-sec — Timeout per task in seconds.
  • --cleanup-policy — Sandbox cleanup policy. [choices: always, on_success]
  • --wait (default False) — Block until the evaluation reaches a terminal state.
  • --poll-interval-sec (default 10.0) — Seconds between status polls when —wait is set.
  • --timeout-sec — Maximum seconds to wait before timing out.
  • --json (default False) — Output as JSON.

Aliases: ls

Terminal window
$ dn evaluation list

Show evaluations in your workspace.

Options

  • --status — Filter by evaluation status (e.g. running, completed, failed). [choices: queued, running, completed, partial, failed, cancelled]
  • --project-id — Filter by project ID.
  • --limit (default 50) — Maximum results to show.
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation get <evaluation-id>

Show evaluation configuration, progress, and results.

Displays configuration, current sample progress, and timing. When the evaluation has finished, also shows pass rates, per-task breakdown, and duration percentiles from the analytics snapshot.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID (e.g. 0fe36a23-…).
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation list-samples <evaluation-id>

List samples in an evaluation.

Each sample represents one agent run against a security task. Use --status failed to drill into failures.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID.
  • --status — Filter by sample status (e.g. passed, failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation get-sample <eval/sample>

Show details of a single evaluation sample.

Displays the sample’s lifecycle status, timing breakdown, sandbox IDs, error details, and verification result.

Options

  • <eval/sample>, --eval/sample (Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation get-transcript <eval/sample>

Download the agent conversation transcript for a sample.

Returns the session transcript linked to this evaluation item as raw JSON. The payload is a SessionTranscriptResponse with the following top-level fields:

  • session: session metadata (id, title, model, agent, project, timestamps)
  • messages: ordered list of messages, each with id, seq, parent_id, role, content, tool_calls, tool_call_id, metadata, agent, model, created_at, and compacted_at
  • current_system_prompt: the active system prompt for restore
  • has_more: pagination flag

Returns 404 if the item has no linked session (old evals or items where the runtime’s session registration failed). Available mid-run — the link is established as soon as the runtime creates the session, before the agent begins streaming.

Options

  • <eval/sample>, --eval/sample (Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).
Terminal window
$ dn evaluation wait <evaluation-id>

Block until an evaluation reaches a terminal state.

Polls the evaluation status and exits when it completes, fails, or is cancelled. Exits non-zero if the evaluation did not complete successfully.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID.
  • --poll-interval-sec (default 10.0) — Seconds between status polls.
  • --timeout-sec — Maximum seconds to wait before timing out.
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation cancel <evaluation-id>

Cancel a running evaluation.

Requests cancellation and terminates active sandboxes. Samples that are already in progress will be marked as cancelled.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID.
  • --yes, -y (default False) — Skip the confirmation prompt.
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation retry <evaluation-id>

Retry failed and errored samples in an evaluation.

Resets samples that ended in failed, timed_out, or infra_error back to queued so they are picked up by workers again.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID.
  • --json (default False) — Output as JSON.
Terminal window
$ dn evaluation export <evaluation-id>

Export evaluation results, samples, and transcripts.

Writes evaluation metadata, per-sample results, and agent transcripts to a directory. Transcripts are included by default; use —no-transcripts to skip them.

Each transcript file is a SessionTranscriptResponse JSON payload — see dn evaluation get-transcript --help for the shape. Samples without a linked session (old evals or items where the runtime’s session registration failed) are skipped with a warning.

Options

  • <evaluation-id>, --evaluation-id (Required) — The evaluation ID (full or 8-char prefix).
  • --output, -o — Output directory (default: ./eval-<short-id>/).
  • --transcripts, --no-transcripts (default True) — Include agent transcripts (default: yes).
  • --status — Only export samples with this status (e.g. failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]
  • --json (default False) — Dump combined JSON to stdout instead of writing files.
Terminal window
$ dn evaluation compare <eval-a> <eval-b>

Compare two evaluation runs side by side.

Shows pass rate delta, per-task breakdown, duration changes, and error pattern differences between two evaluations.

Options

  • <eval-a>, --eval-a (Required) — First evaluation ID (baseline).
  • <eval-b>, --eval-b (Required) — Second evaluation ID (comparison).
  • --json (default False) — Output as JSON.