Skip to main content
Running an attack is the first step in the red teaming process. The real value comes from analyzing the results to understand your model’s vulnerabilities and guide improvements. Every AIRT attack run returns a StudyResult object containing the complete history of trials.

The StudyResult Object

The StudyResult is returned from both .run() and .console() methods:
results = await attack.run()
# or
results = await attack.console()
Key properties:
PropertyDescription
results.best_trialSingle most successful trial by score
results.trialsAll completed trials
results.failed_trialsTrials that errored during execution
results.pruned_trialsTrials skipped due to constraint failures

Inspecting the Best Trial

The most common first step is examining what worked:
best = results.best_trial

if best:
    print(f"Score: {best.score:.3f}")
    print(f"Input: {best.candidate}")
    print(f"Output: {best.output}")

    # Score breakdown by objective
    for name, score in best.scores.items():
        print(f"  {name}: {score:.3f}")
The candidate is the attack input (prompt, image, etc.) and output is the target’s response.

Filtering Trials

By Score Threshold

Find all trials that exceeded a success threshold:
threshold = 0.8
successful = [t for t in results.trials if t.score >= threshold]

print(f"Found {len(successful)} successful attacks out of {len(results.trials)} trials")

for trial in successful:
    print(f"[{trial.score:.2f}] {trial.candidate[:80]}...")

By Status

Trials have a status field indicating their outcome:
# Get trials by status
finished = [t for t in results.trials if t.status == "finished"]
failed = [t for t in results.trials if t.status == "failed"]
pruned = [t for t in results.trials if t.status == "pruned"]

print(f"Finished: {len(finished)}, Failed: {len(failed)}, Pruned: {len(pruned)}")

By Objective

For multi-objective attacks, filter by specific objective scores:
# Find trials where stealth was high but extraction was low
# (successful evasion but incomplete extraction)
partial_success = [
    t for t in results.trials
    if t.scores.get("stealth", 0) > 0.8 and t.scores.get("extraction", 0) < 0.5
]

Analyzing Failures

Failed trials provide signal about what doesn’t work:
if results.failed_trials:
    print(f"\n{len(results.failed_trials)} trials failed:")
    for trial in results.failed_trials[:5]:  # First 5
        print(f"  Input: {str(trial.candidate)[:60]}...")
        if hasattr(trial, 'error'):
            print(f"  Error: {trial.error}")
Common failure patterns to look for:
  • Rate limiting: Many failures in quick succession
  • Malformed inputs: Certain attack patterns cause parsing errors
  • Timeouts: Target taking too long on specific inputs

Analyzing Pruned Trials

Pruned trials failed a constraint before scoring:
if results.pruned_trials:
    print(f"\n{len(results.pruned_trials)} trials pruned (failed constraints)")

    # If your constraint checks response validity, pruned trials
    # might indicate the target is refusing certain inputs

Convergence Analysis

Understanding how the search progressed helps tune future attacks.

Score Over Time

import matplotlib.pyplot as plt

# Plot score progression
scores = [t.score for t in results.trials]
steps = range(len(scores))

plt.figure(figsize=(10, 4))
plt.plot(steps, scores, alpha=0.5, label="Trial scores")

# Rolling max (best score seen so far)
rolling_max = []
current_max = 0
for s in scores:
    current_max = max(current_max, s)
    rolling_max.append(current_max)

plt.plot(steps, rolling_max, 'r-', linewidth=2, label="Best so far")
plt.xlabel("Trial")
plt.ylabel("Score")
plt.legend()
plt.title("Attack Convergence")
plt.show()

Detecting Plateaus

If scores plateau early, you might need different search parameters:
def detect_plateau(scores: list[float], window: int = 20) -> int | None:
    """Return trial number where improvement stopped."""
    if len(scores) < window:
        return None

    best = 0
    plateau_start = None

    for i, score in enumerate(scores):
        if score > best:
            best = score
            plateau_start = None
        elif plateau_start is None:
            plateau_start = i

        if plateau_start and (i - plateau_start) >= window:
            return plateau_start

    return None

plateau = detect_plateau([t.score for t in results.trials])
if plateau:
    print(f"Scores plateaued at trial {plateau}")

Exporting Results

To DataFrame

Convert results to pandas for powerful analysis:
import pandas as pd

df = results.to_dataframe()

# Basic stats
print(f"Total trials: {len(df)}")
print(f"Success rate (>0.8): {(df['score'] > 0.8).mean():.1%}")
print(f"Average score: {df['score'].mean():.3f}")

# Top performers
print(df.nlargest(5, 'score')[['score', 'candidate']])

To JSONL

Export for storage or sharing:
results.to_jsonl("attack_results.jsonl")

Custom Export

Extract specific fields for reporting:
import json

report = {
    "total_trials": len(results.trials),
    "best_score": results.best_trial.score if results.best_trial else 0,
    "success_rate": sum(1 for t in results.trials if t.score > 0.8) / len(results.trials),
    "successful_prompts": [
        {"score": t.score, "prompt": t.candidate, "response": str(t.output)[:500]}
        for t in results.trials if t.score > 0.8
    ],
}

with open("report.json", "w") as f:
    json.dump(report, f, indent=2)

Resuming Attacks

If an attack was interrupted or you want to continue exploring, use append=True:
# First run
results1 = await attack.with_(max_trials=50).run()
print(f"Best after 50 trials: {results1.best_trial.score:.3f}")

# Continue from where we left off
results2 = await attack.with_(max_trials=50, append=True).run()
print(f"Best after 100 trials: {results2.best_trial.score:.3f}")
The append=True flag tells AIRT to build on previous trials rather than starting fresh. This is useful for:
  • Interrupted attacks (network issues, timeouts)
  • Incremental exploration (run more trials if initial results are promising)
  • Budget management (start small, expand if needed)

Comparing Runs

When testing different attack configurations:
async def compare_attacks(configs: list[dict]) -> pd.DataFrame:
    """Run multiple attack configurations and compare results."""
    comparisons = []

    for config in configs:
        attack = tap_attack(**config).with_(max_trials=50)
        result = await attack.run()

        comparisons.append({
            "config": config.get("goal", "unknown")[:30],
            "best_score": result.best_trial.score if result.best_trial else 0,
            "avg_score": sum(t.score for t in result.trials) / len(result.trials),
            "success_rate": sum(1 for t in result.trials if t.score > 0.8) / len(result.trials),
            "trials": len(result.trials),
        })

    return pd.DataFrame(comparisons)

Best Practices

  1. Always check failed trials — high failure rates indicate configuration issues
  2. Plot convergence before running long attacks to tune parameters
  3. Export successful prompts for reproducibility and reporting
  4. Use append=True for expensive attacks rather than restarting
  5. Compare configurations systematically to find optimal attack parameters

Next Steps