Skip to content

Evaluations

Evaluations are the execution framework for judged task runs on Dreadnode.

An evaluation answers: How well does this execution configuration perform across a set of task definitions?

You provide:

  • a dataset or list of task names
  • execution settings such as model, concurrency, and timeout
  • runtime configuration
  • optional secret_ids to inject your selected user secrets into sandboxes for each evaluation item

Each dataset row becomes one evaluation item.

Each item:

  • references one task definition
  • provisions a task environment sandbox plus a runtime sandbox
  • runs one judged execution
  • records pass/fail or infrastructure outcome

Tasks used in evaluations must define a verification config. If a task has no verification step, evaluation creation is rejected before any items are provisioned.

Typical item states are:

  • queued
  • claiming
  • provisioning
  • agent_running
  • agent_finished
  • verifying
  • passed, failed, timed_out, cancelled, or infra_error

Evaluation items use two sandboxes:

  • the runtime sandbox that hosts the agent loop
  • the task environment sandbox derived from the task build

This separation lets the platform track compute costs, lifecycle, and telemetry independently for each concern.

Per-item chat transcripts are stored as workspace-scoped artifacts, with a pointer kept on the evaluation item metadata.

The analytics view includes total_samples, passed_count, failed_count, error_count, and pass_rate, along with task breakdowns, duration rollups, and grouped errors.

From the TUI, open /tui/evaluations to inspect run status, progress, and detail metadata.