Quickstart
Launch your first hosted evaluation against a published task, inspect results, and debug a failure.
Launch a hosted evaluation, watch it run, and drill into a failing sample — all from the CLI.
Prerequisites
Section titled “Prerequisites”- The Dreadnode CLI authenticated (
dn login) — see Authentication - A published task (scaffold with
dn task init, validate, thendn task push) - A model identifier like
openai/gpt-4.1-mini
1. Launch the evaluation
Section titled “1. Launch the evaluation”dn evaluation create flag-file-check \ --model openai/gpt-4.1-mini \ --concurrency 1 \ --cleanup-policy on_success \ --wait--wait blocks until the evaluation finishes and prints a summary. --cleanup-policy on_success
keeps failed sandboxes around for inspection.
2. Check overall results
Section titled “2. Check overall results”dn evaluation get 9ab81fc1● completed flag-file-checkID 9ab81fc1-...
Model openai/gpt-4.1-miniConcurrency 1Cleanup on_success
Progress ████████████████████████████ 1/1 pass: 100.0% passed=1
Results 100.0% ✓ 1 passed [email protected] 100.0% (1/1) durations: p50=34s p95=34s max=34sUUID prefix matching works everywhere — the first 8 characters are enough.
3. List samples and read a transcript
Section titled “3. List samples and read a transcript”dn evaluation list-samples 9ab81fc1dn evaluation get-transcript 9ab81fc1/75e4914flist-samples shows status, task, and duration per sample. get-transcript returns the full
agent conversation — every user message, assistant response, and tool call. Sample references use
eval/sample slash syntax.
4. Debug a failure
Section titled “4. Debug a failure”dn evaluation list-samples 9ab81fc1 --status faileddn evaluation get-sample 9ab81fc1/75e4914fget-sample adds the lifecycle breakdown — when the item was queued, provisioned, started, and
finished — plus the error message and any verification result.
Because you ran with --cleanup-policy on_success, the failed item’s sandboxes are still up:
dn sandbox list --state runningSee Inspecting compute for exec access and cleanup.
5. Retry or compare
Section titled “5. Retry or compare”# requeue failed, timed-out, and errored samplesdn evaluation retry 9ab81fc1
# or launch a new evaluation with a different modeldn evaluation create flag-file-check-v2 \ --model openai/o4-mini \ --wait
dn evaluation compare 9ab81fc1 b2c34de5What to reach for next
Section titled “What to reach for next”- Author your own task → Tasks
- Author verification logic → Verification
- Run many variants of the same task → Inputs
- Automate runs in CI or source-control them → Running evaluations
- Watch a long run live → Monitoring evaluations
- Browse every CLI command and flag →
dn evaluation