Evaluations
Evaluations are the execution framework for judged task runs on Dreadnode.
What an evaluation is
Section titled “What an evaluation is”An evaluation answers: How well does this execution configuration perform across a set of task definitions?
You provide:
- a dataset or list of task names
- execution settings such as model, concurrency, and timeout
- runtime configuration
- optional
secret_idsto inject your selected user secrets into sandboxes for each evaluation item
Execution model
Section titled “Execution model”Each dataset row becomes one evaluation item.
Each item:
- references one task definition
- provisions a task environment sandbox plus a runtime sandbox
- runs one judged execution
- records pass/fail or infrastructure outcome
Tasks used in evaluations must define a verification config. If a task has no verification step, evaluation creation is rejected before any items are provisioned.
Typical item states are:
queuedclaimingprovisioningagent_runningagent_finishedverifyingpassed,failed,timed_out,cancelled, orinfra_error
Task + Runtime
Section titled “Task + Runtime”Evaluation items use two sandboxes:
- the runtime sandbox that hosts the agent loop
- the task environment sandbox derived from the task build
This separation lets the platform track compute costs, lifecycle, and telemetry independently for each concern.
Per-item chat transcripts are stored as workspace-scoped artifacts, with a pointer kept on the evaluation item metadata.
Analytics
Section titled “Analytics”The analytics view includes total_samples, passed_count, failed_count, error_count, and pass_rate, along with task breakdowns, duration rollups, and grouped errors.
From the TUI, open /tui/evaluations to inspect run status, progress, and detail metadata.