Security Evaluation Operations

Run security tasks as repeatable evaluations, review failures, and widen the investigation in analytics.

Use this recipe when you already know what should be tested and need a repeatable pass/fail run. The short version is: choose the right task or dataset, launch in the right scope, inspect one failing sample, then widen into analytics only if the failure looks systemic.

When to use this workflow

you need repeatable security verification instead of one transcript
you already have a Security Task or a Dataset
you want a clean path from one failing sample to traces and analytics

What you need before you start

the correct workspace and project
a task or a dataset-backed manifest
a clear idea of whether you are checking product behavior, environment setup, or both

Use this input	When it is the primary driver
task	you need the right environment and verification logic
dataset	you need pinned per-sample rows and each row already carries its own `task_name`

Recipe

1. Decide what drives the run

Choose the task when the environment and verifier are the main thing you care about. Choose the dataset when the sample set itself is the main thing you care about and each row already names the task to run.

2. Launch the evaluation in the right scope

From the CLI:

dn evaluation create regression-check \
  --task corp-recon \
  --model openai/gpt-4.1-mini \
  --concurrency 4 \
  --wait

Or from the TUI:

dreadnode
# 1. switch to the target workspace with /workspace <key>
# 2. press Ctrl+E to open evaluations
# 3. submit the evaluation against the chosen task or dataset

Keep workspace and project selection explicit. The same scope determines what you will see later in /tui/evaluations/, traces, and /platform/agents/.

For dataset-backed hosted runs from the CLI, use --file evaluation.yaml rather than trying to encode rows as flags.

3. Inspect one representative failure while the run is live

Do not jump straight to broad analytics. First look at one sample:

dn evaluation wait 9ab81fc1
dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f
dn evaluation get-transcript 9ab81fc1/75e4914f

Focus on:

pass rate versus failure rate
verification failures versus infrastructure or runtime errors
whether several failing samples are actually the same bug pattern

4. Escalate that sample into transcript and trace review

For one suspicious sample:

use the transcript to confirm what the agent actually did
use trace surfaces if the issue looks like tool use, environment state, or timing
keep workspace and project context identical between the evaluation and trace lookup

This is the step that prevents analytics from turning into an unfocused warehouse search.

5. Widen only then into Agents

Use Agents after you know what you are looking for:

Charts for trend questions
Data for exact SQL and CSV export
Notebook when you need runs, spans, and evaluation outcomes together

6. Choose the follow-up

Typical next actions:

tighten the task verification logic
fix missing runtime config or Secrets
rerun after capability or prompt changes
promote the pattern into a tracked regression workflow

What to keep

the evaluation ID
one or more failing sample IDs
the representative transcript or trace that explains the failure
the saved query or export if you widened into Agents

Branches and decisions

if failures are mostly runtime or infrastructure errors, debug environment setup before blaming the task or prompt
if one failing sample is ambiguous, keep drilling into transcript and trace detail before querying the warehouse
if the same failure repeats across runs, turn it into a named regression workflow rather than a one-off investigation

Security Tasks

Evaluations

Agents

Sessions & Traces