Skip to content

Quickstart

Launch your first hosted evaluation against a published task, inspect results, and debug a failure.

Launch a hosted evaluation, watch it run, and drill into a failing sample — all from the CLI.

  • The Dreadnode CLI authenticated (dn login) — see Authentication
  • A published task (scaffold with dn task init, validate, then dn task push)
  • A model identifier like openai/gpt-4.1-mini
Terminal window
dn evaluation create flag-file-check \
--model openai/gpt-4.1-mini \
--concurrency 1 \
--cleanup-policy on_success \
--wait

--wait blocks until the evaluation finishes and prints a summary. --cleanup-policy on_success keeps failed sandboxes around for inspection.

Terminal window
dn evaluation get 9ab81fc1
● completed flag-file-check
ID 9ab81fc1-...
Model openai/gpt-4.1-mini
Concurrency 1
Cleanup on_success
Progress ████████████████████████████ 1/1 pass: 100.0%
passed=1
Results 100.0% ✓ 1 passed
[email protected] 100.0% (1/1)
durations: p50=34s p95=34s max=34s

UUID prefix matching works everywhere — the first 8 characters are enough.

Terminal window
dn evaluation list-samples 9ab81fc1
dn evaluation get-transcript 9ab81fc1/75e4914f

list-samples shows status, task, and duration per sample. get-transcript returns the full agent conversation — every user message, assistant response, and tool call. Sample references use eval/sample slash syntax.

Terminal window
dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f

get-sample adds the lifecycle breakdown — when the item was queued, provisioned, started, and finished — plus the error message and any verification result.

Because you ran with --cleanup-policy on_success, the failed item’s sandboxes are still up:

Terminal window
dn sandbox list --state running

See Inspecting compute for exec access and cleanup.

Terminal window
# requeue failed, timed-out, and errored samples
dn evaluation retry 9ab81fc1
# or launch a new evaluation with a different model
dn evaluation create flag-file-check-v2 \
--model openai/o4-mini \
--wait
dn evaluation compare 9ab81fc1 b2c34de5