What is AI Red Teaming?
AI red teaming applies adversarial methods to stress-test AI systems and expose vulnerabilities before deployment. Unlike penetration testing for traditional software, AI red teaming confronts the unique failure modes: prompt injection vulnerabilities, training data poisoning, model behavior under distribution shift, and emergent capabilities that bypass safety guardrails. Standard security practices—code review, fuzzing, input validation—don’t directly translate to AI systems. Models make probabilistic decisions based on learned patterns, not deterministic logic. This means they fail unpredictably: leaking training data through carefully crafted queries, amplifying demographic biases in edge cases, complying with malicious requests disguised as benign prompts, enabling remote code execution or data exfiltration in agentic systems with code execution capabilities, or generating harmful content through automated prompt injection techniques that bypass safety filters and damage brand reputation. Red teaming systematically explores these failure modes by simulating adversarial interactions—testing how the system responds when users actively try to break it. Why it matters: AI systems deployed in healthcare, finance, and national security require rigorous adversarial testing to meet regulatory standards (NIST AI RMF) and ensure they’re safe, secure, and trustworthy.How to Start AI Red Teaming
AI red teaming follows a structured four-phase process:1. Threat Modeling
Define the scope and risk surface before testing. Key questions:- System scope: What AI system are we testing? (Foundation model, application, or full agentic system?)
- Risk categories: What vulnerabilities matter for this deployment? (Security exploits, safety failures, privacy leaks, adversarial robustness, bias amplification, para-social manipulation in multimodal systems?)
- System characteristics: What are the capabilities and intended uses? (Multimodal inputs, multilingual support, tool use, agentic behavior, multi-turn interactions?)
- Threat actors: Who are the potential adversaries? (Opportunistic users, script kiddies, competitors, organized crime, nation-states?)
- Risk tolerance: What are acceptable risk thresholds for different vulnerability classes?
- Attack surface: What initial access vectors exist? (Public API, authenticated endpoint, enterprise deployment, sensitive use cases?)
- Existing defenses: What guardrails are already in place? (Input filters, output moderation, rate limiting, RLHF alignment?)
2. Design Plan
Define attacker goals, expected outcomes, and techniques:- Goals: Extract training data, bypass content filters, cause harmful outputs, exfiltrate sensitive information, manipulate decision-making
- Techniques: Direct prompt injection, indirect prompt injection (tool/document poisoning), multimodal attacks (visual prompt injection), adversarial perturbations, chain-of-thought manipulation
- Success criteria: What constitutes a successful attack for each goal?
- Test cases: Prioritize scenarios based on threat model (high-risk + high-likelihood first)
3. Execute Operations
Probe the system using manual exploration and automated attacks:- Run reconnaissance to understand system behavior and boundaries
- Execute planned attacks across vulnerability categories
- Capture full traces: prompts, responses, intermediate reasoning, tool calls
- Document successful attacks with reproducible examples
- Iterate on partial successes to refine techniques
4. Analyze & Report
Assess severity, document findings, and provide remediation guidance:- Severity assessment: Classify by impact (critical: RCE, data exfiltration; high: safety harms; medium: bias amplification; low: brand risk)
- Reproducibility: Include exact prompts, system state, and environmental conditions
- Root cause analysis: Was it a training data issue, insufficient alignment, missing guardrails, or architectural vulnerability?
- Recommendations: Specific mitigations for product teams (input sanitization, output filtering, model retraining, architecture changes)
- Metrics: Attack success rate, time-to-compromise, guardrail effectiveness
AIRT: Automated AI Red Teaming Framework
AIRT is a composable framework that automates the execution phase of AI red teaming. We treat red teaming as an optimization problem: systematically generate adversarial inputs, evaluate responses against objectives, and iteratively refine attacks based on feedback. This automated approach scales beyond manual testing to discover edge cases and vulnerabilities that humans overlook.Quick Start
Launch an automated jailbreak campaign against a language model:tap_attack function orchestrates a Tree of Attacks with Pruning search. The attacker model generates prompts, the target responds, and the evaluator scores each response against the goal. The search refines prompts based on scores until it finds a successful jailbreak or exhausts trials.
Levels of AI Red Teaming
AIRT supports multiple levels of sophistication depending on your needs.Level 1: Pre-built Attacks
Use established attack patterns out of the box. These generative AI attacks use an attacker model to generate adversarial prompts, a target model to test, and an evaluator model to score responses. Each attack implements a different search strategy for exploring the prompt space:| Attack | Strategy | Use Case |
|---|---|---|
tap_attack | Tree of Attacks with Pruning | Explores attack tree; prunes low-scoring branches, expands top-K prompts |
goat_attack | Graph-based neighborhood | Exploration when stuck; uses parent/sibling context to escape local optima |
crescendo_attack | Multi-turn escalation | Gradual trust-building; starts benign then progressively escalates toward goal |
prompt_attack | Customizable beam search | Flexible base for custom attacks with your own refinement and evaluation logic |
Attack with a search strategy like simba_search or hop_skip_jump_search.
Pre-built attacks handle the common case: you provide a goal and a target, the attack handles search, refinement, and scoring.
Level 2: Custom Attacks
Build custom attacks from primitives when pre-built patterns don’t fit. Here we create an adversarial image attack that tries to fool a classifier while keeping perturbations imperceptible:- Targets: Wrap any function, API, or model
- Search strategies: Beam search, graph search, gradient-free optimization (SimBA, HopSkipJump, NES, ZOO), custom iterative search
- Scorers: LLM judges, rule-based checks, custom functions
- Stop conditions: Score thresholds, plateaus, budget limits
Level 3: Agent-Assisted Attacks
Combine AIRT with autonomous agents that execute multi-step attack sequences. The agent uses AIRT primitives as tools, deciding which attacks to run based on what it learns about the target:Core Workflow
Every AIRT run follows the same loop:- Define a Target — Wrap the system under test (LLM, classifier, API)
- Choose a Strategy — Select how to explore the input space (beam search, gradient-free, etc.)
- Set Objectives — Define what success looks like (scorers, constraints, directions)
- Apply Transforms — Optionally obfuscate inputs to evade defenses
- Run the Search — Iterate until success, budget exhaustion, or stop condition
- Analyze Results — Extract successful inputs, patterns, and scores
What’s Next
Core Concepts
Understand targets, attacks, search strategies, transforms, and scoring
LLM Attacks
Run jailbreaking and prompt injection attacks
Image Attacks
Generate adversarial perturbations for classifiers
Transforms & Evasion
Apply obfuscation and perturbations to bypass filters
Custom Scoring
Build rubrics for real-world testing without ground truth
Analyzing Results
Extract insights from trials and optimize campaigns

