Evaluations
Run AI agents against security tasks at scale, check pass/fail against ground truth, and compare models.
An evaluation answers the question: “How well does this agent solve these security tasks?”
You pick one or more published tasks, choose a model, and launch. The platform provisions isolated sandboxes, runs the agent against each task, checks pass/fail using the task’s own verification rules, and records every transcript, trace, and score. Compare across models, prompts, and configurations without running the infrastructure yourself.
Two paths
Section titled “Two paths”Dreadnode supports two evaluation shapes for different stages of work:
| Shape | When to reach for it | Where it lives |
|---|---|---|
| Hosted | Production-grade benchmarks against published tasks with full sandbox isolation. | Launched from CLI, TUI, App, or API. |
| Local SDK | Iterating on prompts, scorers, or agent logic during development. | Your Python process via Evaluation(...). |
Hosted evaluations use deterministic verification (scripts, flag checks). Local SDK evaluations bring their own task function, dataset, and scorers — and support LLM-as-judge patterns through custom scorers. The two combine well: run hosted for pass/fail, then score transcripts with SDK scorers.
Working with hosted evaluations
Section titled “Working with hosted evaluations”Local evaluations
Section titled “Local evaluations”Operating an evaluation
Section titled “Operating an evaluation”Full CLI reference: dn evaluation. The App offers the same operations
visually, with richer sample-level analytics.