Security Evaluation Operations

Triage a failing evaluation — one sample, one transcript, one trace — before widening into analytics.

When an evaluation fails, the fastest path to a useful answer is one failing sample, one transcript, one trace — before you touch analytics. Use this recipe when pass rates drop and you need to decide whether it’s a product bug, a task bug, or infrastructure.

When to use this workflow

An evaluation you just ran has unexpected failures
You need to distinguish agent behavior from environment or runtime issues
You want to avoid turning triage into an unfocused warehouse search

Prerequisites

A completed evaluation with at least one failed sample — see Quickstart
Workspace and project scoped correctly (scope mistakes cause most “evaluation disappeared” reports)

1. Look at the shape first, not the details

dn evaluation get 9ab81fc1

Focus on three things before drilling in:

pass rate vs failure rate — is this a trend or a one-off?
verification failures vs infra/runtime errors — they need different fixes
clustered failures — do multiple failing samples look like one bug?

If failures are mostly infra_error or timed_out, fix the environment before blaming the prompt or the model.

2. Drill into one representative failure

dn evaluation list-samples 9ab81fc1 --status failed
dn evaluation get-sample 9ab81fc1/75e4914f
dn evaluation get-transcript 9ab81fc1/75e4914f

The sample lifecycle tells you where it broke. The transcript tells you what the agent thought it was doing. Read both before forming a theory.

3. Escalate one sample into trace review

When the transcript is ambiguous — an ambiguous tool error, a timing question, a suspicious state transition — widen into traces:

Use trace surfaces when the issue looks like tool use, environment state, or timing
Keep workspace and project context identical between the evaluation and trace lookup

This is the step that keeps triage focused. A single failing sample, fully understood, is worth more than a hundred partially-understood ones.

4. Only now widen into Sessions analytics

Once you know what you’re looking for, use Sessions to check whether the pattern appears across runs:

Charts for trend questions
Data for exact SQL and CSV export
Notebook when you need runs, spans, and evaluation outcomes together

5. Pick the right follow-up

If the failure is	Fix
Verification logic too strict or too loose	Update verification in the task
Missing API key or credential	Configure a secret
Infrastructure or runtime error	Debug environment setup; check sandbox provider
Consistent agent mistake	Update the prompt, capability, or task instruction
Same failure repeating across runs	Promote to a tracked regression workflow