Evaluations

Batch evaluation of agents against security tasks.

$ dn evaluation <command>

Batch evaluation of agents against security tasks — measure capability, track regressions, and compare models.

create

$ dn evaluation create

Launch an evaluation against one or more security tasks.

Builds the evaluation request from CLI flags, an evaluation.yaml manifest (--file), or both (flags override the manifest). Use --wait to block until the evaluation completes and print a results summary. When --model requires provider credentials, create fails fast if the required user Secrets are not configured.

Options

<name>, --name — Evaluation name (e.g. my-eval-v3). Optional when set in —file.
--task — Security task to evaluate on, NAME[@VERSION] or org/name@version (e.g. security-bandit-00 or acme/[email protected]). Repeatable.
--file — Path to evaluation.yaml request manifest.
--runtime-id — Runtime record ID for tracking; does not select a model.
--model — Model identifier (e.g. dn/gpt-5 or openai/gpt-4o-mini for BYOK). Required unless —capability provides one. Run dn inference-model list for platform models; pass any LiteLLM-compatible BYOK ID after configuring credentials.
--capability — Capability to load, NAME[@VERSION] or org/name@version (e.g. acme/[email protected]). Also pass —model if it has no entry-agent model. Run dn capability list to discover.
--secret — Secret selector to inject into evaluation sandboxes. Repeatable. Exact names are strict; glob selectors are best-effort. Run dn secret list to discover configured names.
--concurrency — Maximum concurrent evaluation samples.
--task-timeout-sec — Timeout per task in seconds.
--cleanup-policy — Sandbox cleanup policy. [choices: always, on_success]
--wait (default False) — Block until the evaluation reaches a terminal state.
--poll-interval-sec (default 10.0) — Seconds between status polls when —wait is set.
--timeout-sec — Maximum seconds to wait before timing out.
--json (default False) — Output as JSON.

list

Aliases: ls

$ dn evaluation list

Show evaluations in your workspace.

Options

--status, --state — Filter by evaluation status (e.g. running, completed, failed). [choices: queued, running, completed, partial, failed, cancelled]
--project-id — Filter by project ID.
--limit (default 50) — Maximum results to show.
--json (default False) — Output as JSON.

get

$ dn evaluation get <evaluation-id>

Show evaluation configuration, progress, and results.

Displays configuration, current sample progress, and timing. When the evaluation has finished, also shows pass rates, per-task breakdown, and duration percentiles from the analytics snapshot.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID (e.g. 0fe36a23-…).
--json (default False) — Output as JSON.

list-samples

$ dn evaluation list-samples <evaluation-id>

List samples in an evaluation.

Each sample represents one agent run against a security task. Use --status failed to drill into failures.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID.
--status, --state — Filter by sample status (e.g. passed, failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]
--json (default False) — Output as JSON.

get-sample

$ dn evaluation get-sample <eval/sample>

Show details of a single evaluation sample.

Displays the sample’s lifecycle status, timing breakdown, sandbox IDs, error details, and verification result.

Options

<eval/sample>, --eval/sample (Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).
--json (default False) — Output as JSON.

get-transcript

$ dn evaluation get-transcript <eval/sample>

Download the agent conversation transcript for a sample.

Returns the session transcript linked to this evaluation item as raw JSON. The payload is a SessionTranscriptResponse with the following top-level fields:

session: session metadata (id, title, model, agent, project, timestamps)
messages: ordered list of messages, each with id, seq, parent_id, role, content, tool_calls, tool_call_id, metadata, agent, model, created_at, and compacted_at
current_system_prompt: the active system prompt for restore
has_more: pagination flag

Returns 404 if the item has no linked session (old evals or items where the runtime’s session registration failed). Available mid-run — the link is established as soon as the runtime creates the session, before the agent begins streaming.

Options

<eval/sample>, --eval/sample (Required) — Sample reference as EVAL_ID/SAMPLE_ID (e.g. 9ab81fc1/75e4914f).

wait

$ dn evaluation wait <evaluation-id>

Block until an evaluation reaches a terminal state.

Polls the evaluation status and exits when it completes, fails, or is cancelled. Exits non-zero if the evaluation did not complete successfully.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID.
--poll-interval-sec (default 10.0) — Seconds between status polls.
--timeout-sec — Maximum seconds to wait before timing out.
--json (default False) — Output as JSON.

cancel

$ dn evaluation cancel <evaluation-id>

Cancel a running evaluation.

Requests cancellation and terminates active sandboxes. Samples that are already in progress will be marked as cancelled.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID.
--yes, -y (default False) — Skip the confirmation prompt.
--json (default False) — Output as JSON.

retry

$ dn evaluation retry <evaluation-id>

Retry failed and errored samples in an evaluation.

Resets samples that ended in failed, timed_out, or infra_error back to queued so they are picked up by workers again.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID.
--json (default False) — Output as JSON.

export

$ dn evaluation export <evaluation-id>

Export evaluation results, samples, and transcripts.

Writes evaluation metadata, per-sample results, and agent transcripts to a directory. Transcripts are included by default; use —no-transcripts to skip them.

Each transcript file is a SessionTranscriptResponse JSON payload — see dn evaluation get-transcript --help for the shape. Samples without a linked session (old evals or items where the runtime’s session registration failed) are skipped with a warning.

Options

<evaluation-id>, --evaluation-id (Required) — The evaluation ID (full or 8-char prefix).
--output, -o — Output directory (default: ./eval-<short-id>/).
--transcripts, --no-transcripts (default True) — Include agent transcripts (default: yes).
--status, --state — Only export samples with this status (e.g. failed, timed_out). [choices: queued, claiming, provisioning, agent_running, agent_finished, verifying, passed, failed, timed_out, cancelled, infra_error]
--json (default False) — Dump combined JSON to stdout instead of writing files.

compare

$ dn evaluation compare <eval-a> <eval-b>

Compare two evaluation runs side by side.

Shows pass rate delta, per-task breakdown, duration changes, and error pattern differences between two evaluations.

Options

<eval-a>, --eval-a (Required) — First evaluation ID (baseline).
<eval-b>, --eval-b (Required) — Second evaluation ID (comparison).
--json (default False) — Output as JSON.