Skip to content

Runtime and Evaluations

Inspect runtime records in the platform and create, inspect, and retry evaluations from the dn CLI.

This page covers two related but different control-plane surfaces:

  • dn runtime ... for hosted runtime records
  • dn evaluation ... for evaluations and their samples

They are related because evaluations often point at a runtime record, but they answer different questions:

  • runtime commands answer “what runtime record exists in the workspace?”
  • evaluation commands answer “what happened when the platform ran this workload?”

The runtime subcommand is for workspace runtime records, not for starting a local server or talking to a runtime process directly.

Terminal window
dn runtime list --profile staging --workspace lab
dn runtime create sandbox --profile staging --workspace lab
dn runtime create --key analyst --name "Analyst Runtime" --profile staging --workspace lab
dn runtime start sandbox --profile staging --workspace lab
dn runtime get <runtime-id> --profile staging --workspace lab

dn runtime create is an idempotent ensure/create call:

  • if you pass <project> or already have an active project scope, it ensures a runtime in that project
  • if no project is resolved, pass --key and --name and the platform will create or return the runtime in the workspace default project

The call returns the existing runtime instead of failing when the same runtime key already exists. That matters now that a project may have more than one runtime: the list output includes the runtime name and key so each one is identifiable.

dn runtime create only ensures the durable runtime record. If you want live compute, use dn runtime start.

dn runtime start is the one-command path to get a sandbox:

  • dn runtime start <runtime-id> starts that exact runtime and never creates a different one
  • dn runtime start <project> starts the only runtime in the project, or creates the first one when the project has none
  • if a project has multiple runtimes, pass --runtime-id or ensure a specific runtime with --key and --name

You can also bootstrap a runtime from runtime.yaml:

key: analyst
name: Analyst Runtime
defaults:
agent: planner
model: openai/gpt-5.2
runtime_server:
env:
LOG_LEVEL: debug
Terminal window
dn runtime create --file runtime.yaml --profile staging --workspace lab
dn runtime start --file runtime.yaml --profile staging --workspace lab

The CLI reads YAML, resolves any secret selectors, and sends normalized JSON to the API. If the runtime already exists with a different durable config, the ensure/create call fails instead of silently mutating it.

If you want to start a local runtime server, use dn serve instead. That is covered in /cli/launch-and-runtime/.

Use dn evaluation ... when the platform should run the workload for you and keep the resulting job history.

CommandWhat it does
dn evaluation createlaunch a new evaluation
dn evaluation listlist evaluations in a workspace
dn evaluation getinspect one evaluation’s config & results
dn evaluation list-sampleslist individual samples in an evaluation
dn evaluation get-sampleinspect one sample’s detail & telemetry
dn evaluation get-transcriptdownload a sample’s agent transcript
dn evaluation waitblock until an evaluation finishes
dn evaluation cancelcancel a running evaluation
dn evaluation retryretry failed and errored samples

Make sure you already know four things:

  1. which task or tasks should run
  2. which model should execute them
  3. which secrets should be injected into the evaluation sandboxes
  4. whether failed runs should keep their sandboxes for debugging

That fourth choice is what --cleanup-policy controls, and it is one of the most important evaluation flags in practice.

The shortest useful mental model is:

  1. create the evaluation
  2. inspect the top-level record
  3. inspect the sample list
  4. inspect a transcript when one sample needs debugging
Terminal window
dn evaluation create nightly-regression \
--task corp-recon \
--task local-enum \
--runtime-id 11111111-2222-3333-4444-555555555555 \
--model openai/gpt-4.1-mini \
--secret OPENROUTER_API_KEY \
--secret OPENROUTER_* \
--concurrency 4 \
--cleanup-policy on_success

In that example:

  • two tasks will become two evaluation samples under one evaluation
  • --runtime-id links the run to a runtime record, but does not choose the model by itself
  • --model is the reliable required field for public create requests; pass it explicitly even when you also use --capability
  • --secret selects user-configured secrets by environment-variable name or glob pattern
  • --cleanup-policy on_success keeps failed compute around for inspection

The common create flags are:

FlagMeaning
--file <path>load request fields from evaluation.yaml; explicit CLI flags override file values
--task <name>task to run, repeatable
--runtime-id <id>runtime record ID for tracking and association
--model <id>model identifier; treat it as required
--capability <name>capability to load in addition to the explicit model
--secret <selector>secret name or glob pattern to inject; repeatable
--concurrency <n>max concurrent evaluation samples
--task-timeout-sec <n>per-task timeout
--cleanup-policy <always|on_success>cleanup behavior for task resources
--waitblock until the evaluation completes and print a results summary
--jsonprint raw JSON

dn evaluation create should always be given --model. --runtime-id alone does not choose the execution model, and --capability should be treated as additive runtime context rather than as a replacement for an explicit model choice.

Use --secret when your evaluation needs user-configured environment variables in the runtime and task sandboxes.

Terminal window
# exact name: strict, must exist
dn evaluation create nightly-regression \
--task corp-recon \
--model openrouter/qwen/qwen3-coder-next \
--secret OPENROUTER_API_KEY
# glob: best-effort, zero matches is allowed
dn evaluation create nightly-regression \
--task corp-recon \
--model openrouter/qwen/qwen3-coder-next \
--secret 'OPENROUTER_*'

The rule is:

  • exact selectors like OPENROUTER_API_KEY fail fast if the secret is not configured
  • glob selectors like OPENROUTER_* are best-effort and silently skip when nothing matches
  • repeated selectors are de-duplicated before the CLI submits the evaluation request

Use --file when the evaluation definition should live in source control or when the request is too large to keep readable on one shell line.

You can define the request in evaluation.yaml:

name: nightly-regression
project: sandbox
task_names:
- corp-recon
- local-enum
model: openai/gpt-4.1-mini
secret_ids:
- 11111111-2222-3333-4444-555555555555
concurrency: 4
cleanup_policy: on_success
Terminal window
dn evaluation create --file evaluation.yaml
dn --project sandbox evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-mini

The second command shows the override rule: explicit CLI flags still win over values loaded from the file.

Use secret_ids in the manifest when you want exact control from source-controlled configuration. Use repeatable --secret flags when you want the CLI to resolve names against your configured user secrets at runtime.

If you want hosted dataset rows, define them in evaluation.yaml. The CLI does not expose row data flags directly.

name: mixed-regression
project: sandbox
model: openai/gpt-4.1-mini
dataset:
rows:
- task_name: [email protected]
tenant: acme
- task_name: [email protected]
tenant: bravo
cleanup_policy: always

Two rules matter:

  • every dataset row must include task_name
  • if task_names and dataset are both present, the current service uses task_names

Once the evaluation exists, drill down in layers:

Terminal window
# find your evaluation
dn evaluation list --status running
# overview: config, progress, pass rates, duration percentiles
dn evaluation get 9ab81fc1
# which samples failed?
dn evaluation list-samples 9ab81fc1 --status failed
# drill into one sample's lifecycle, timing, and telemetry
dn evaluation get-sample 9ab81fc1/75e4914f
# read the full agent conversation
dn evaluation get-transcript 9ab81fc1/75e4914f
# operational controls
dn evaluation cancel 9ab81fc1
dn evaluation retry 9ab81fc1

The natural flow is:

  1. list finds the evaluation you care about
  2. get tells you overall status, configuration, and aggregate results
  3. list-samples tells you which samples passed, failed, or are still running
  4. get-sample gives you the lifecycle breakdown and agent telemetry for one sample
  5. get-transcript is the debugging surface when you need the full agent conversation
  6. retry requeues failed and errored samples without recreating the evaluation

Sample references use eval/sample slash syntax — for example 9ab81fc1/75e4914f. Both IDs support prefix matching, so you only need the first 8 characters.

get-transcript returns a SessionTranscriptResponse — the same shape the platform sessions API serves. The top-level payload is:

{
"session": { "id": "...", "model": "...", "message_count": 12, "..." },
"messages": [
{ "id": "...", "seq": 0, "role": "user", "content": "...", "tool_calls": null, "..." },
{ "id": "...", "seq": 1, "role": "assistant", "content": "...", "tool_calls": [...], "..." }
],
"current_system_prompt": "...",
"has_more": false
}

Each message includes id, seq, parent_id, role, content, tool_calls, tool_call_id, metadata, and timestamps. The transcript is available mid-run — the link to the session is established as soon as the runtime creates it, before the agent begins streaming.

Samples without a linked session return 404 (old evaluations, or items where the runtime’s session registration failed). export --transcripts skips those items with a warning instead of failing the export.

--cleanup-policy is easy to ignore until compute is left running.

  • always means clean up even when the evaluation fails
  • on_success means failed runs can leave sandboxes behind for inspection

If you choose on_success, expect to use dn sandbox ... sometimes.

This is one of the most useful operational distinctions in the CLI:

  • choose always when you want clean automation
  • choose on_success when failed runs are valuable to inspect

These commands use the standard platform context from /cli/authentication-and-profiles/:

  • --profile
  • --server
  • --api-key
  • --organization
  • --workspace
  • --project

Use --wait on create or the standalone wait command to block until the evaluation finishes. This is useful for CI pipelines or scripts that need to gate on evaluation results.

Terminal window
# block at creation time
dn evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-mini --wait
# or wait on an existing evaluation
dn evaluation wait 9ab81fc1 --timeout-sec 3600

Both exit non-zero if the evaluation did not complete successfully.

If the evaluation record and the underlying compute seem out of sync, inspect both surfaces:

Terminal window
dn evaluation get 9ab81fc1 --json
dn evaluation list-samples 9ab81fc1
dn sandbox list --state running

That usually tells you whether you are looking at a control-plane problem, a task failure, or a cleanup-policy surprise.