Runtime and Evaluations

Inspect runtime records in the platform and create, inspect, and retry evaluations from the dn CLI.

This page covers two related but different control-plane surfaces:

dn runtime ... for hosted runtime records
dn evaluation ... for evaluations and their samples

They are related because evaluations often point at a runtime record, but they answer different questions:

runtime commands answer “what runtime record exists in the workspace?”
evaluation commands answer “what happened when the platform ran this workload?”

Runtime records

The runtime subcommand is for workspace runtime records, not for starting a local server or talking to a runtime process directly.

dn runtime list --profile staging --workspace lab
dn runtime create sandbox --profile staging --workspace lab
dn runtime create --key analyst --name "Analyst Runtime" --profile staging --workspace lab
dn runtime start sandbox --profile staging --workspace lab
dn runtime get <runtime-id> --profile staging --workspace lab

dn runtime create is an idempotent ensure/create call:

if you pass <project> or already have an active project scope, it ensures a runtime in that project
if no project is resolved, pass --key and --name and the platform will create or return the runtime in the workspace default project

The call returns the existing runtime instead of failing when the same runtime key already exists. That matters now that a project may have more than one runtime: the list output includes the runtime name and key so each one is identifiable.

dn runtime create only ensures the durable runtime record. If you want live compute, use dn runtime start.

dn runtime start is the one-command path to get a sandbox:

dn runtime start <runtime-id> starts that exact runtime and never creates a different one
dn runtime start <project> starts the only runtime in the project, or creates the first one when the project has none
if a project has multiple runtimes, pass --runtime-id or ensure a specific runtime with --key and --name

You can also bootstrap a runtime from runtime.yaml:

key: analyst
name: Analyst Runtime

defaults:
  agent: planner
  model: openai/gpt-5.2

runtime_server:
  env:
    LOG_LEVEL: debug

dn runtime create --file runtime.yaml --profile staging --workspace lab
dn runtime start --file runtime.yaml --profile staging --workspace lab

The CLI reads YAML, resolves any secret selectors, and sends normalized JSON to the API. If the runtime already exists with a different durable config, the ensure/create call fails instead of silently mutating it.

If you want to start a local runtime server, use dn serve instead. That is covered in /cli/launch-and-runtime/.

Evaluation lifecycle

Use dn evaluation ... when the platform should run the workload for you and keep the resulting job history.

Command	What it does
`dn evaluation create`	launch a new evaluation
`dn evaluation list`	list evaluations in a workspace
`dn evaluation get`	inspect one evaluation’s config & results
`dn evaluation list-samples`	list individual samples in an evaluation
`dn evaluation get-sample`	inspect one sample’s detail & telemetry
`dn evaluation get-transcript`	download a sample’s agent transcript
`dn evaluation wait`	block until an evaluation finishes
`dn evaluation cancel`	cancel a running evaluation
`dn evaluation retry`	retry failed and errored samples

Before you create one

Make sure you already know four things:

which task or tasks should run
which model should execute them
which secrets should be injected into the evaluation sandboxes
whether failed runs should keep their sandboxes for debugging

That fourth choice is what --cleanup-policy controls, and it is one of the most important evaluation flags in practice.

Create an evaluation

The shortest useful mental model is:

create the evaluation
inspect the top-level record
inspect the sample list
inspect a transcript when one sample needs debugging

dn evaluation create nightly-regression \
  --task corp-recon \
  --task local-enum \
  --runtime-id 11111111-2222-3333-4444-555555555555 \
  --model openai/gpt-4.1-mini \
  --secret OPENROUTER_API_KEY \
  --secret OPENROUTER_* \
  --concurrency 4 \
  --cleanup-policy on_success

In that example:

two tasks will become two evaluation samples under one evaluation
--runtime-id links the run to a runtime record, but does not choose the model by itself
--model is the reliable required field for public create requests; pass it explicitly even when you also use --capability
--secret selects user-configured secrets by environment-variable name or glob pattern
--cleanup-policy on_success keeps failed compute around for inspection

The common create flags are:

Flag	Meaning
`--file <path>`	load request fields from `evaluation.yaml`; explicit CLI flags override file values
`--task <name>`	task to run, repeatable
`--runtime-id <id>`	runtime record ID for tracking and association
`--model <id>`	model identifier; treat it as required
`--capability <name>`	capability to load in addition to the explicit model
`--secret <selector>`	secret name or glob pattern to inject; repeatable
`--concurrency <n>`	max concurrent evaluation samples
`--task-timeout-sec <n>`	per-task timeout
`--cleanup-policy <always\|on_success>`	cleanup behavior for task resources
`--wait`	block until the evaluation completes and print a results summary
`--json`	print raw JSON

dn evaluation create should always be given --model. --runtime-id alone does not choose the execution model, and --capability should be treated as additive runtime context rather than as a replacement for an explicit model choice.

Secret selectors

Use --secret when your evaluation needs user-configured environment variables in the runtime and task sandboxes.

# exact name: strict, must exist
dn evaluation create nightly-regression \
  --task corp-recon \
  --model openrouter/qwen/qwen3-coder-next \
  --secret OPENROUTER_API_KEY

# glob: best-effort, zero matches is allowed
dn evaluation create nightly-regression \
  --task corp-recon \
  --model openrouter/qwen/qwen3-coder-next \
  --secret 'OPENROUTER_*'

The rule is:

exact selectors like OPENROUTER_API_KEY fail fast if the secret is not configured
glob selectors like OPENROUTER_* are best-effort and silently skip when nothing matches
repeated selectors are de-duplicated before the CLI submits the evaluation request

Create from a file

Use --file when the evaluation definition should live in source control or when the request is too large to keep readable on one shell line.

You can define the request in evaluation.yaml:

name: nightly-regression
project: sandbox
task_names:
  - corp-recon
  - local-enum
model: openai/gpt-4.1-mini
secret_ids:
  - 11111111-2222-3333-4444-555555555555
concurrency: 4
cleanup_policy: on_success

dn evaluation create --file evaluation.yaml
dn --project sandbox evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-mini

The second command shows the override rule: explicit CLI flags still win over values loaded from the file.

Use secret_ids in the manifest when you want exact control from source-controlled configuration. Use repeatable --secret flags when you want the CLI to resolve names against your configured user secrets at runtime.

Dataset-backed manifests

If you want hosted dataset rows, define them in evaluation.yaml. The CLI does not expose row data flags directly.

name: mixed-regression
project: sandbox
model: openai/gpt-4.1-mini
dataset:
  rows:
    - task_name: [email protected]
      tenant: acme
    - task_name: [email protected]
      tenant: bravo
cleanup_policy: always

Two rules matter:

every dataset row must include task_name
if task_names and dataset are both present, the current service uses task_names

Inspect results

Once the evaluation exists, drill down in layers:

# find your evaluation
dn evaluation list --status running

# overview: config, progress, pass rates, duration percentiles
dn evaluation get 9ab81fc1

# which samples failed?
dn evaluation list-samples 9ab81fc1 --status failed

# drill into one sample's lifecycle, timing, and telemetry
dn evaluation get-sample 9ab81fc1/75e4914f

# read the full agent conversation
dn evaluation get-transcript 9ab81fc1/75e4914f

# operational controls
dn evaluation cancel 9ab81fc1
dn evaluation retry 9ab81fc1

The natural flow is:

list finds the evaluation you care about
get tells you overall status, configuration, and aggregate results
list-samples tells you which samples passed, failed, or are still running
get-sample gives you the lifecycle breakdown and agent telemetry for one sample
get-transcript is the debugging surface when you need the full agent conversation
retry requeues failed and errored samples without recreating the evaluation

Sample references use eval/sample slash syntax — for example 9ab81fc1/75e4914f. Both IDs support prefix matching, so you only need the first 8 characters.

Transcript payload shape

get-transcript returns a SessionTranscriptResponse — the same shape the platform sessions API serves. The top-level payload is:

{
  "session": { "id": "...", "model": "...", "message_count": 12, "..." },
  "messages": [
    { "id": "...", "seq": 0, "role": "user", "content": "...", "tool_calls": null, "..." },
    { "id": "...", "seq": 1, "role": "assistant", "content": "...", "tool_calls": [...], "..." }
  ],
  "current_system_prompt": "...",
  "has_more": false
}

Each message includes id, seq, parent_id, role, content, tool_calls, tool_call_id, metadata, and timestamps. The transcript is available mid-run — the link to the session is established as soon as the runtime creates it, before the agent begins streaming.

Samples without a linked session return 404 (old evaluations, or items where the runtime’s session registration failed). export --transcripts skips those items with a warning instead of failing the export.

Cleanup policy matters

--cleanup-policy is easy to ignore until compute is left running.

always means clean up even when the evaluation fails
on_success means failed runs can leave sandboxes behind for inspection

If you choose on_success, expect to use dn sandbox ... sometimes.

This is one of the most useful operational distinctions in the CLI:

choose always when you want clean automation
choose on_success when failed runs are valuable to inspect

Shared scope

These commands use the standard platform context from /cli/authentication-and-profiles/:

--profile
--server
--api-key
--organization
--workspace
--project

Blocking on completion

Use --wait on create or the standalone wait command to block until the evaluation finishes. This is useful for CI pipelines or scripts that need to gate on evaluation results.

# block at creation time
dn evaluation create nightly-regression --task corp-recon --model openai/gpt-4.1-mini --wait

# or wait on an existing evaluation
dn evaluation wait 9ab81fc1 --timeout-sec 3600

Both exit non-zero if the evaluation did not complete successfully.

When an evaluation feels stuck

If the evaluation record and the underlying compute seem out of sync, inspect both surfaces:

dn evaluation get 9ab81fc1 --json
dn evaluation list-samples 9ab81fc1
dn sandbox list --state running

That usually tells you whether you are looking at a control-plane problem, a task failure, or a cleanup-policy surprise.