Quickstart

Launch your first hosted evaluation against a published task, inspect results, and debug a failure.

Launch a hosted evaluation, watch it run, and drill into a failing sample — all from the CLI.

Prerequisites

The Dreadnode CLI authenticated (dn login) — see Authentication
A published task (scaffold with dn task init, validate, then dn task push)
A model identifier like openai/gpt-4.1-mini

1. Launch the evaluation

dn evaluation create flag-file-check \
  --task [email protected] \
  --model openai/gpt-4.1-mini \
  --concurrency 1 \
  --cleanup-policy on_success \
  --wait

--wait blocks until the evaluation finishes and prints a summary. --cleanup-policy on_success keeps failed sandboxes around for inspection.

2. Check overall results

dn evaluation get 9ab81fc1

● completed  flag-file-check
ID  9ab81fc1-...

Model        openai/gpt-4.1-mini
Concurrency  1
Cleanup      on_success

Progress  ████████████████████████████  1/1  pass: 100.0%
          passed=1

Results   100.0%  ✓ 1 passed
          [email protected]  100.0% (1/1)
          durations: p50=34s  p95=34s  max=34s

UUID prefix matching works everywhere — the first 8 characters are enough.

3. List samples and read a transcript

dn evaluation list-samples 9ab81fc1
dn evaluation get-transcript 9ab81fc1/75e4914f

list-samples shows status, task, and duration per sample. get-transcript returns the full agent conversation — every user message, assistant response, and tool call. Sample references use eval/sample slash syntax.

4. Debug a failure

dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f

get-sample adds the lifecycle breakdown — when the item was queued, provisioned, started, and finished — plus the error message and any verification result.

Because you ran with --cleanup-policy on_success, the failed item’s sandboxes are still up:

dn sandbox list --state running

See Inspecting compute for exec access and cleanup.

5. Retry or compare

# requeue failed, timed-out, and errored samples
dn evaluation retry 9ab81fc1

# or launch a new evaluation with a different model
dn evaluation create flag-file-check-v2 \
  --task [email protected] \
  --model openai/o4-mini \
  --wait

dn evaluation compare 9ab81fc1 b2c34de5

What to reach for next

Author your own task → Tasks
Author verification logic → Verification
Run many variants of the same task → Inputs
Automate runs in CI or source-control them → Running evaluations
Watch a long run live → Monitoring evaluations
Browse every CLI command and flag → dn evaluation