Skip to content

Security Evaluation Operations

Triage a failing evaluation — one sample, one transcript, one trace — before widening into analytics.

When an evaluation fails, the fastest path to a useful answer is one failing sample, one transcript, one trace — before you touch analytics. Use this recipe when pass rates drop and you need to decide whether it’s a product bug, a task bug, or infrastructure.

  • An evaluation you just ran has unexpected failures
  • You need to distinguish agent behavior from environment or runtime issues
  • You want to avoid turning triage into an unfocused warehouse search
  • A completed evaluation with at least one failed sample — see Quickstart
  • Workspace and project scoped correctly (scope mistakes cause most “evaluation disappeared” reports)

1. Look at the shape first, not the details

Section titled “1. Look at the shape first, not the details”
Terminal window
dn evaluation get 9ab81fc1

Focus on three things before drilling in:

  • pass rate vs failure rate — is this a trend or a one-off?
  • verification failures vs infra/runtime errors — they need different fixes
  • clustered failures — do multiple failing samples look like one bug?

If failures are mostly infra_error or timed_out, fix the environment before blaming the prompt or the model.

Terminal window
dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f
dn evaluation get-transcript 9ab81fc1/75e4914f

The sample lifecycle tells you where it broke. The transcript tells you what the agent thought it was doing. Read both before forming a theory.

When the transcript is ambiguous — an ambiguous tool error, a timing question, a suspicious state transition — widen into traces:

  • Use trace surfaces when the issue looks like tool use, environment state, or timing
  • Keep workspace and project context identical between the evaluation and trace lookup

This is the step that keeps triage focused. A single failing sample, fully understood, is worth more than a hundred partially-understood ones.

Once you know what you’re looking for, use Sessions to check whether the pattern appears across runs:

  • Charts for trend questions
  • Data for exact SQL and CSV export
  • Notebook when you need runs, spans, and evaluation outcomes together
If the failure isFix
Verification logic too strict or too looseUpdate verification in the task
Missing API key or credentialConfigure a secret
Infrastructure or runtime errorDebug environment setup; check sandbox provider
Consistent agent mistakeUpdate the prompt, capability, or task instruction
Same failure repeating across runsPromote to a tracked regression workflow
  • the evaluation ID
  • one or more failing sample IDs
  • the representative transcript or trace that explains the failure
  • any saved Sessions query or export