Skip to content

Evaluations

Run AI agents against security tasks at scale, check pass/fail against ground truth, and compare models.

An evaluation answers the question: “How well does this agent solve these security tasks?”

You pick one or more published tasks, choose a model, and launch. The platform provisions isolated sandboxes, runs the agent against each task, checks pass/fail using the task’s own verification rules, and records every transcript, trace, and score. Compare across models, prompts, and configurations without running the infrastructure yourself.

Dreadnode supports two evaluation shapes for different stages of work:

ShapeWhen to reach for itWhere it lives
HostedProduction-grade benchmarks against published tasks with full sandbox isolation.Launched from CLI, TUI, App, or API.
Local SDKIterating on prompts, scorers, or agent logic during development.Your Python process via Evaluation(...).

Hosted evaluations use deterministic verification (scripts, flag checks). Local SDK evaluations bring their own task function, dataset, and scorers — and support LLM-as-judge patterns through custom scorers. The two combine well: run hosted for pass/fail, then score transcripts with SDK scorers.

Full CLI reference: dn evaluation. The App offers the same operations visually, with richer sample-level analytics.