Key Features
- Systematic: Encodes red teaming as a structured optimization loop rather than relying on manual effort.
- Model-Agnostic: Targets are abstract. You can attack LLMs, agents, image generation models, or any custom AI system you can represent as a function.
- Composable: Build complex attacks by mixing and matching search algorithms, refinement strategies, scoring functions, and constraints.
- Pragmatic: Includes pre-built attack templates (like
GoATandTAP) and standard LLM-as-a-judge scoring rubrics to get you started quickly. - Safe & Observable: Designed to test and measure your safety posture under controlled conditions, with results logged to the Dreadnode platform for audit and review.
The Big Picture: How an Attack Works
At a high level, every AIRT run follows these steps:- Define a Goal: State the desired outcome (e.g., “cause the model to write a phishing email”).
- Pick a Target: Specify the system under test (an LLM, a toolchain, an API, etc.).
- Choose an Attack: Select a pre-built attack template or compose your own strategy.
- Set Objectives & Constraints: Define what “success” looks like, what rules the attack must follow, and when it should stop.
- Run the Search: The framework iteratively proposes inputs, observes outputs, scores them, and refines its search to find the most effective adversarial examples.
Quick Start
Let’s run a generative attack to find a jailbreak in a language model. The goal is to make the model generate a harmful response that it would normally refuse. You can copy and run this code directly.Prerequisites
Ensure you have the SzDK installed and configured for your project. If you are using a third-party model provider, make sure your API keys (e.g.,OPENAI_API_KEY) are set as environment variables.
What’s Happening?
This example shows the standard AIRT workflow:- Target Definition: We specify
o3-minias the model we want to test usingLLMTarget. - Attack Configuration: We use the
tap_attacktemplate, providing agoal, thetarget, and models to act as the attacker (to generate prompts) and evaluator (to judge success). - Execution: The
.console()method runs the attack and streams live progress, showing each attempt (trial), its proposed prompt, and the score it received.
max_trials.
Troubleshooting
KeyError: OPENAI_API_KEY(or similar): Make sure you’ve set your API keys for the model providers you’re using in your environment.- Async Errors: Attacks are async. Ensure you’re running them inside an async function and using
asyncio.run(), as shown in the example. - No Progress / Low Scores:
- Try a more capable
attacker_model. - Increase
max_trialsto give the search more time. - If you’re getting close but not succeeding, you can lower the early stopping threshold (e.g.,
score > 0.7).
- Try a more capable
- False Positives: If the evaluator seems too lenient, you can tighten the
goalto be more specific or explore custom scoring rubrics.

