Runtime and Evaluations
Inspect runtime records in the platform and create, inspect, and retry evaluations from the dn CLI.
This page covers two related but different control-plane surfaces:
dn runtime ...for hosted runtime recordsdn evaluation ...for evaluations and their samples
They are related because evaluations often point at a runtime record, but they answer different questions:
- runtime commands answer “what runtime record exists in the workspace?”
- evaluation commands answer “what happened when the platform ran this workload?”
Runtime records
Section titled “Runtime records”The runtime subcommand is for workspace runtime records, not for starting a local server or
talking to a runtime process directly.
dn runtime list --profile staging --workspace labdn runtime create sandbox --profile staging --workspace labdn runtime create --key analyst --name "Analyst Runtime" --profile staging --workspace labdn runtime start sandbox --profile staging --workspace labdn runtime get <runtime-id> --profile staging --workspace labdn runtime create is an idempotent ensure/create call:
- if you pass
<project>or already have an active project scope, it ensures a runtime in that project - if no project is resolved, pass
--keyand--nameand the platform will create or return the runtime in the workspace default project
The call returns the existing runtime instead of failing when the same runtime key already exists. That matters now that a project may have more than one runtime: the list output includes the runtime name and key so each one is identifiable.
dn runtime create only ensures the durable runtime record. If you want live compute, use
dn runtime start.
dn runtime start is the one-command path to get a sandbox:
dn runtime start <runtime-id>starts that exact runtime and never creates a different onedn runtime start <project>starts the only runtime in the project, or creates the first one when the project has none- if a project has multiple runtimes, pass
--runtime-idor ensure a specific runtime with--keyand--name
You can also bootstrap a runtime from runtime.yaml:
key: analystname: Analyst Runtime
defaults: agent: planner model: openai/gpt-5.2
runtime_server: env: LOG_LEVEL: debugdn runtime create --file runtime.yaml --profile staging --workspace labdn runtime start --file runtime.yaml --profile staging --workspace labThe CLI reads YAML, resolves any secret selectors, and sends normalized JSON to the API. If the runtime already exists with a different durable config, the ensure/create call fails instead of silently mutating it.
If you want to start a local runtime server, use dn serve instead. That is covered in
/cli/launch-and-runtime/.
Evaluation lifecycle
Section titled “Evaluation lifecycle”Use dn evaluation ... when the platform should run the workload for you and keep the resulting
job history.
| Command | What it does |
|---|---|
dn evaluation create | launch a new evaluation |
dn evaluation list | list evaluations in a workspace |
dn evaluation get | inspect one evaluation’s config & results |
dn evaluation list-samples | list individual samples in an evaluation |
dn evaluation get-sample | inspect one sample’s detail & telemetry |
dn evaluation get-transcript | download a sample’s agent transcript |
dn evaluation wait | block until an evaluation finishes |
dn evaluation cancel | cancel a running evaluation |
dn evaluation retry | retry failed and errored samples |
Before you create one
Section titled “Before you create one”Make sure you already know four things:
- which task or tasks should run
- which model should execute them
- which secrets should be injected into the evaluation sandboxes
- whether failed runs should keep their sandboxes for debugging
That fourth choice is what --cleanup-policy controls, and it is one of the most important
evaluation flags in practice.
Create an evaluation
Section titled “Create an evaluation”The shortest useful mental model is:
- create the evaluation
- inspect the top-level record
- inspect the sample list
- inspect a transcript when one sample needs debugging
dn evaluation create nightly-regression \ --task corp-recon \ --task local-enum \ --runtime-id 11111111-2222-3333-4444-555555555555 \ --model openai/gpt-4.1-mini \ --secret OPENROUTER_API_KEY \ --secret OPENROUTER_* \ --concurrency 4 \ --cleanup-policy on_successIn that example:
- two tasks will become two evaluation samples under one evaluation
--runtime-idlinks the run to a runtime record, but does not choose the model by itself--modelis the reliable required field for public create requests; pass it explicitly even when you also use--capability--secretselects user-configured secrets by environment-variable name or glob pattern--cleanup-policy on_successkeeps failed compute around for inspection
The common create flags are:
| Flag | Meaning |
|---|---|
--file <path> | load request fields from evaluation.yaml; explicit CLI flags override file values |
--task <name> | task to run, repeatable |
--runtime-id <id> | runtime record ID for tracking and association |
--model <id> | model identifier; treat it as required |
--capability <name> | capability to load in addition to the explicit model |
--secret <selector> | secret name or glob pattern to inject; repeatable |
--concurrency <n> | max concurrent evaluation samples |
--task-timeout-sec <n> | per-task timeout |
--cleanup-policy <always|on_success> | cleanup behavior for task resources |
--wait | block until the evaluation completes and print a results summary |
--json | print raw JSON |
dn evaluation create should always be given --model. --runtime-id alone does not choose the
execution model, and --capability should be treated as additive runtime context rather than as a
replacement for an explicit model choice.
Secret selectors
Section titled “Secret selectors”Use --secret when your evaluation needs user-configured environment variables in the runtime and
task sandboxes.
# exact name: strict, must existdn evaluation create nightly-regression \ --task corp-recon \ --model openrouter/qwen/qwen3-coder-next \ --secret OPENROUTER_API_KEY
# glob: best-effort, zero matches is alloweddn evaluation create nightly-regression \ --task corp-recon \ --model openrouter/qwen/qwen3-coder-next \ --secret 'OPENROUTER_*'The rule is:
- exact selectors like
OPENROUTER_API_KEYfail fast if the secret is not configured - glob selectors like
OPENROUTER_*are best-effort and silently skip when nothing matches - repeated selectors are de-duplicated before the CLI submits the evaluation request
Create from a file
Section titled “Create from a file”Use --file when the evaluation definition should live in source control or when the request is
too large to keep readable on one shell line.
You can define the request in evaluation.yaml:
name: nightly-regressionproject: sandboxtask_names: - corp-recon - local-enummodel: openai/gpt-4.1-minisecret_ids: - 11111111-2222-3333-4444-555555555555concurrency: 4cleanup_policy: on_successdn evaluation create --file evaluation.yamldn --project sandbox evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-miniThe second command shows the override rule: explicit CLI flags still win over values loaded from the file.
Use secret_ids in the manifest when you want exact control from source-controlled configuration.
Use repeatable --secret flags when you want the CLI to resolve names against your configured user
secrets at runtime.
Dataset-backed manifests
Section titled “Dataset-backed manifests”If you want hosted dataset rows, define them in evaluation.yaml. The CLI does not expose row data
flags directly.
name: mixed-regressionproject: sandboxmodel: openai/gpt-4.1-minidataset: rows: tenant: acme tenant: bravocleanup_policy: alwaysTwo rules matter:
- every dataset row must include
task_name - if
task_namesanddatasetare both present, the current service usestask_names
Inspect results
Section titled “Inspect results”Once the evaluation exists, drill down in layers:
# find your evaluationdn evaluation list --status running
# overview: config, progress, pass rates, duration percentilesdn evaluation get 9ab81fc1
# which samples failed?dn evaluation list-samples 9ab81fc1 --status failed
# drill into one sample's lifecycle, timing, and telemetrydn evaluation get-sample 9ab81fc1/75e4914f
# read the full agent conversationdn evaluation get-transcript 9ab81fc1/75e4914f
# operational controlsdn evaluation cancel 9ab81fc1dn evaluation retry 9ab81fc1The natural flow is:
listfinds the evaluation you care aboutgettells you overall status, configuration, and aggregate resultslist-samplestells you which samples passed, failed, or are still runningget-samplegives you the lifecycle breakdown and agent telemetry for one sampleget-transcriptis the debugging surface when you need the full agent conversationretryrequeues failed and errored samples without recreating the evaluation
Sample references use eval/sample slash syntax — for example 9ab81fc1/75e4914f. Both IDs
support prefix matching, so you only need the first 8 characters.
Transcript payload shape
Section titled “Transcript payload shape”get-transcript returns a SessionTranscriptResponse — the same shape the platform
sessions API serves. The top-level payload is:
{ "session": { "id": "...", "model": "...", "message_count": 12, "..." }, "messages": [ { "id": "...", "seq": 0, "role": "user", "content": "...", "tool_calls": null, "..." }, { "id": "...", "seq": 1, "role": "assistant", "content": "...", "tool_calls": [...], "..." } ], "current_system_prompt": "...", "has_more": false}Each message includes id, seq, parent_id, role, content, tool_calls,
tool_call_id, metadata, and timestamps. The transcript is available mid-run —
the link to the session is established as soon as the runtime creates it,
before the agent begins streaming.
Samples without a linked session return 404 (old evaluations, or items where
the runtime’s session registration failed). export --transcripts skips those
items with a warning instead of failing the export.
Cleanup policy matters
Section titled “Cleanup policy matters”--cleanup-policy is easy to ignore until compute is left running.
alwaysmeans clean up even when the evaluation failson_successmeans failed runs can leave sandboxes behind for inspection
If you choose on_success, expect to use dn sandbox ... sometimes.
This is one of the most useful operational distinctions in the CLI:
- choose
alwayswhen you want clean automation - choose
on_successwhen failed runs are valuable to inspect
Shared scope
Section titled “Shared scope”These commands use the standard platform context from /cli/authentication-and-profiles/:
--profile--server--api-key--organization--workspace--project
Blocking on completion
Section titled “Blocking on completion”Use --wait on create or the standalone wait command to block until the evaluation finishes.
This is useful for CI pipelines or scripts that need to gate on evaluation results.
# block at creation timedn evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-mini --wait
# or wait on an existing evaluationdn evaluation wait 9ab81fc1 --timeout-sec 3600Both exit non-zero if the evaluation did not complete successfully.
When an evaluation feels stuck
Section titled “When an evaluation feels stuck”If the evaluation record and the underlying compute seem out of sync, inspect both surfaces:
dn evaluation get 9ab81fc1 --jsondn evaluation list-samples 9ab81fc1dn sandbox list --state runningThat usually tells you whether you are looking at a control-plane problem, a task failure, or a cleanup-policy surprise.