Security Evaluation Operations
Triage a failing evaluation — one sample, one transcript, one trace — before widening into analytics.
When an evaluation fails, the fastest path to a useful answer is one failing sample, one transcript, one trace — before you touch analytics. Use this recipe when pass rates drop and you need to decide whether it’s a product bug, a task bug, or infrastructure.
When to use this workflow
Section titled “When to use this workflow”- An evaluation you just ran has unexpected failures
- You need to distinguish agent behavior from environment or runtime issues
- You want to avoid turning triage into an unfocused warehouse search
Prerequisites
Section titled “Prerequisites”- A completed evaluation with at least one failed sample — see Quickstart
- Workspace and project scoped correctly (scope mistakes cause most “evaluation disappeared” reports)
1. Look at the shape first, not the details
Section titled “1. Look at the shape first, not the details”dn evaluation get 9ab81fc1Focus on three things before drilling in:
- pass rate vs failure rate — is this a trend or a one-off?
- verification failures vs infra/runtime errors — they need different fixes
- clustered failures — do multiple failing samples look like one bug?
If failures are mostly infra_error or timed_out, fix the environment before blaming the
prompt or the model.
2. Drill into one representative failure
Section titled “2. Drill into one representative failure”dn evaluation list-samples 9ab81fc1 --status faileddn evaluation get-sample 9ab81fc1/75e4914fdn evaluation get-transcript 9ab81fc1/75e4914fThe sample lifecycle tells you where it broke. The transcript tells you what the agent thought it was doing. Read both before forming a theory.
3. Escalate one sample into trace review
Section titled “3. Escalate one sample into trace review”When the transcript is ambiguous — an ambiguous tool error, a timing question, a suspicious state transition — widen into traces:
- Use trace surfaces when the issue looks like tool use, environment state, or timing
- Keep workspace and project context identical between the evaluation and trace lookup
This is the step that keeps triage focused. A single failing sample, fully understood, is worth more than a hundred partially-understood ones.
4. Only now widen into Sessions analytics
Section titled “4. Only now widen into Sessions analytics”Once you know what you’re looking for, use Sessions to check whether the pattern appears across runs:
Chartsfor trend questionsDatafor exact SQL and CSV exportNotebookwhen you need runs, spans, and evaluation outcomes together
5. Pick the right follow-up
Section titled “5. Pick the right follow-up”| If the failure is | Fix |
|---|---|
| Verification logic too strict or too loose | Update verification in the task |
| Missing API key or credential | Configure a secret |
| Infrastructure or runtime error | Debug environment setup; check sandbox provider |
| Consistent agent mistake | Update the prompt, capability, or task instruction |
| Same failure repeating across runs | Promote to a tracked regression workflow |
What to keep
Section titled “What to keep”- the evaluation ID
- one or more failing sample IDs
- the representative transcript or trace that explains the failure
- any saved Sessions query or export