Evaluations

Run AI agents against security tasks at scale, check pass/fail against ground truth, and compare models.

An evaluation answers the question: “How well does this agent solve these security tasks?”

You pick one or more published tasks, choose a model, and launch. The platform provisions isolated sandboxes, runs the agent against each task, checks pass/fail using the task’s own verification rules, and records every transcript, trace, and score. Compare across models, prompts, and configurations without running the infrastructure yourself.

Two paths

Dreadnode supports two evaluation shapes for different stages of work:

Shape	When to reach for it	Where it lives
Hosted	Production-grade benchmarks against published tasks with full sandbox isolation.	Launched from CLI, TUI, App, or API.
Local SDK	Iterating on prompts, scorers, or agent logic during development.	Your Python process via `Evaluation(...)`.

Hosted evaluations use deterministic verification (scripts, flag checks). Local SDK evaluations bring their own task function, dataset, and scorers — and support LLM-as-judge patterns through custom scorers. The two combine well: run hosted for pass/fail, then score transcripts with SDK scorers.

Working with hosted evaluations

Quickstart

Tasks

Verification

Inputs

Instruction templates

Manifest reference

Local evaluations

Scorers

Operating an evaluation

Running evaluations

Monitoring evaluations

Full CLI reference: dn evaluation. The App offers the same operations visually, with richer sample-level analytics.