Skip to content

Security Evaluation Operations

Run security tasks as repeatable evaluations, review failures, and widen the investigation in analytics.

Use this recipe when you already know what should be tested and need a repeatable pass/fail run. The short version is: choose the right task or dataset, launch in the right scope, inspect one failing sample, then widen into analytics only if the failure looks systemic.

  • you need repeatable security verification instead of one transcript
  • you already have a Security Task or a Dataset
  • you want a clean path from one failing sample to traces and analytics
  • the correct workspace and project
  • a task or a dataset-backed manifest
  • a clear idea of whether you are checking product behavior, environment setup, or both
Use this inputWhen it is the primary driver
taskyou need the right environment and verification logic
datasetyou need pinned per-sample rows and each row already carries its own task_name

Choose the task when the environment and verifier are the main thing you care about. Choose the dataset when the sample set itself is the main thing you care about and each row already names the task to run.

2. Launch the evaluation in the right scope

Section titled “2. Launch the evaluation in the right scope”

From the CLI:

Terminal window
dn evaluation create regression-check \
--task corp-recon \
--model openai/gpt-4.1-mini \
--concurrency 4 \
--wait

Or from the TUI:

Terminal window
dreadnode
# 1. switch to the target workspace with /workspace <key>
# 2. press Ctrl+E to open evaluations
# 3. submit the evaluation against the chosen task or dataset

Keep workspace and project selection explicit. The same scope determines what you will see later in /tui/evaluations/, traces, and /platform/agents/.

For dataset-backed hosted runs from the CLI, use --file evaluation.yaml rather than trying to encode rows as flags.

3. Inspect one representative failure while the run is live

Section titled “3. Inspect one representative failure while the run is live”

Do not jump straight to broad analytics. First look at one sample:

Terminal window
dn evaluation wait 9ab81fc1
dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f
dn evaluation get-transcript 9ab81fc1/75e4914f

Focus on:

  • pass rate versus failure rate
  • verification failures versus infrastructure or runtime errors
  • whether several failing samples are actually the same bug pattern

4. Escalate that sample into transcript and trace review

Section titled “4. Escalate that sample into transcript and trace review”

For one suspicious sample:

  • use the transcript to confirm what the agent actually did
  • use trace surfaces if the issue looks like tool use, environment state, or timing
  • keep workspace and project context identical between the evaluation and trace lookup

This is the step that prevents analytics from turning into an unfocused warehouse search.

Use Agents after you know what you are looking for:

  • Charts for trend questions
  • Data for exact SQL and CSV export
  • Notebook when you need runs, spans, and evaluation outcomes together

Typical next actions:

  • tighten the task verification logic
  • fix missing runtime config or Secrets
  • rerun after capability or prompt changes
  • promote the pattern into a tracked regression workflow
  • the evaluation ID
  • one or more failing sample IDs
  • the representative transcript or trace that explains the failure
  • the saved query or export if you widened into Agents
  • if failures are mostly runtime or infrastructure errors, debug environment setup before blaming the task or prompt
  • if one failing sample is ambiguous, keep drilling into transcript and trace detail before querying the warehouse
  • if the same failure repeats across runs, turn it into a named regression workflow rather than a one-off investigation