Skip to content

Traces

Inspect individual attack conversations, trial details, and scoring for AI red teaming runs.

Traces capture the full conversation history of every trial in an attack run. Use them to understand exactly what prompts were sent, what the target responded, and how the response was scored. Traces are the evidence of where the model is failing. They give model builders, and particularly post-safety-training teams, the exact data they need to build better mitigations for the risks identified: the winning adversarial prompt, the harmful response the model produced, and the judge’s reasoning for why it scored as a jailbreak.

The Traces view shows all attack traces for the project, each tagged with its outcome:

Traces view showing studies list with jailbreak, refusal, and partial tags

Each trace entry shows:

  • Study name - the attack type (e.g., study:tap_attack)
  • Duration - how long the study took to execute
  • Type - study label
  • Outcome badge - color-coded result:
    • jailbreak (red) - attack succeeded
    • refusal (green) - target refused
    • partial (yellow) - partial success

Click any trace to expand its trace tree. The trace tree shows the hierarchical structure of the attack:

  • Trace span - top-level container for the attack
    • Trial spans - individual optimization iterations
      • Target call - the prompt sent and response received
      • Evaluator call - the judge model’s score

Each span includes:

  • Full prompt text sent to the target
  • Complete target response
  • Jailbreak score (0.0 to 1.0)
  • Timing information
  • Model configuration

Toggle between two view modes in the top-right:

  • Detail - structured view with expandable spans and formatted content
  • Timeline - chronological waterfall view showing execution timing across spans

Access trace data from the command line:

Terminal window
# Get trace statistics for an assessment
dn airt traces <assessment-id>
# Get attack-level spans
dn airt attacks <assessment-id>
# Get trial-level spans with filtering
dn airt trials <assessment-id> --min-score 0.8
dn airt trials <assessment-id> --attack-name tap --jailbreaks-only
dn airt trials <assessment-id> --limit 10
FilterDescription
--attack-nameFilter by attack type (tap, pair, crescendo, etc.)
--min-scoreOnly show trials above this score threshold
--jailbreaks-onlyOnly show successful jailbreaks
--limitMaximum number of trials to return

Traces help you answer:

  • What worked? - sort by score to find the highest-scoring trials and examine the prompts that succeeded
  • Why did it work? - read the full conversation to understand the attack path
  • Which transforms helped? - compare scores with and without specific transforms
  • Which attack is most effective? - compare outcomes across study types for the same goal
  • Is the model consistently vulnerable? - look at outcome distribution (jailbreak vs refusal ratio)