Security Evaluation Operations
Run security tasks as repeatable evaluations, review failures, and widen the investigation in analytics.
Use this recipe when you already know what should be tested and need a repeatable pass/fail run. The short version is: choose the right task or dataset, launch in the right scope, inspect one failing sample, then widen into analytics only if the failure looks systemic.
When to use this workflow
Section titled “When to use this workflow”- you need repeatable security verification instead of one transcript
- you already have a Security Task or a Dataset
- you want a clean path from one failing sample to traces and analytics
What you need before you start
Section titled “What you need before you start”- the correct workspace and project
- a task or a dataset-backed manifest
- a clear idea of whether you are checking product behavior, environment setup, or both
| Use this input | When it is the primary driver |
|---|---|
| task | you need the right environment and verification logic |
| dataset | you need pinned per-sample rows and each row already carries its own task_name |
Recipe
Section titled “Recipe”1. Decide what drives the run
Section titled “1. Decide what drives the run”Choose the task when the environment and verifier are the main thing you care about. Choose the dataset when the sample set itself is the main thing you care about and each row already names the task to run.
2. Launch the evaluation in the right scope
Section titled “2. Launch the evaluation in the right scope”From the CLI:
dn evaluation create regression-check \ --task corp-recon \ --model openai/gpt-4.1-mini \ --concurrency 4 \ --waitOr from the TUI:
dreadnode# 1. switch to the target workspace with /workspace <key># 2. press Ctrl+E to open evaluations# 3. submit the evaluation against the chosen task or datasetKeep workspace and project selection explicit. The same scope determines what you will see later in /tui/evaluations/, traces, and /platform/agents/.
For dataset-backed hosted runs from the CLI, use --file evaluation.yaml rather than trying to
encode rows as flags.
3. Inspect one representative failure while the run is live
Section titled “3. Inspect one representative failure while the run is live”Do not jump straight to broad analytics. First look at one sample:
dn evaluation wait 9ab81fc1dn evaluation list-samples 9ab81fc1 --status faileddn evaluation get-sample 9ab81fc1/75e4914fdn evaluation get-transcript 9ab81fc1/75e4914fFocus on:
- pass rate versus failure rate
- verification failures versus infrastructure or runtime errors
- whether several failing samples are actually the same bug pattern
4. Escalate that sample into transcript and trace review
Section titled “4. Escalate that sample into transcript and trace review”For one suspicious sample:
- use the transcript to confirm what the agent actually did
- use trace surfaces if the issue looks like tool use, environment state, or timing
- keep workspace and project context identical between the evaluation and trace lookup
This is the step that prevents analytics from turning into an unfocused warehouse search.
5. Widen only then into Agents
Section titled “5. Widen only then into Agents”Use Agents after you know what you are looking for:
Chartsfor trend questionsDatafor exact SQL and CSV exportNotebookwhen you need runs, spans, and evaluation outcomes together
6. Choose the follow-up
Section titled “6. Choose the follow-up”Typical next actions:
- tighten the task verification logic
- fix missing runtime config or Secrets
- rerun after capability or prompt changes
- promote the pattern into a tracked regression workflow
What to keep
Section titled “What to keep”- the evaluation ID
- one or more failing sample IDs
- the representative transcript or trace that explains the failure
- the saved query or export if you widened into Agents
Branches and decisions
Section titled “Branches and decisions”- if failures are mostly runtime or infrastructure errors, debug environment setup before blaming the task or prompt
- if one failing sample is ambiguous, keep drilling into transcript and trace detail before querying the warehouse
- if the same failure repeats across runs, turn it into a named regression workflow rather than a one-off investigation