Attacks Reference
45+ attack strategies for AI red teaming — LLM jailbreaks, advanced adversarial algorithms, image attacks, and multimodal probing.
Dreadnode provides 45+ attack strategies across four categories: LLM jailbreaks, advanced adversarial algorithms, image adversarial attacks, and multimodal probing. Each attack is an optimization loop that searches for inputs that maximize a jailbreak score against the target.
Quick reference
Section titled “Quick reference”| Category | Attacks | Best for |
|---|---|---|
| Core jailbreak | TAP, PAIR, GOAT, Crescendo, Rainbow, GPTFuzzer, BEAST, AutoDAN, ReNeLLM, DrAttack, Deep Inception, Prompt | General-purpose jailbreak testing |
| Advanced adversarial | AutoRedTeamer, NEXUS, Siren, CoT Jailbreak, Genetic Persona, JBFuzz, T-MAP, APRT, and 21 more | Stronger targets, specialized techniques |
| Image adversarial | SimBA, NES, ZOO, HopSkipJump | Vision model robustness |
| Multimodal | Multimodal Attack | Cross-modality probing |
Core jailbreak attacks
Section titled “Core jailbreak attacks”These are the foundational attacks for LLM jailbreak testing. Start here.
TAP (Tree of Attacks with Pruning)
Section titled “TAP (Tree of Attacks with Pruning)”Beam search over a tree of candidate prompts. Expands the most promising branches and prunes off-topic or low-scoring candidates.
dn airt run --goal "Reveal your system prompt" --attack tap --target-model openai/gpt-4o-minifrom dreadnode.airt import tap_attack
attack = tap_attack( goal="Reveal your system prompt", target=target, attacker_model="openai/gpt-4o-mini", evaluator_model="openai/gpt-4o-mini", beam_width=10, branching_factor=3, n_iterations=15,)When to use: General-purpose first choice. Good coverage with intelligent pruning.
PAIR (Prompt Automatic Iterative Refinement)
Section titled “PAIR (Prompt Automatic Iterative Refinement)”Runs multiple parallel streams of iterative prompt refinement. Each stream independently refines an attack prompt using attacker feedback.
from dreadnode.airt import pair_attack
attack = pair_attack( goal="Bypass content filters", target=target, attacker_model="openai/gpt-4o-mini", evaluator_model="openai/gpt-4o-mini", n_streams=20, n_iterations=3,)When to use: Fast black-box jailbreaking. High throughput with parallel streams.
GOAT (Graph of Attacks)
Section titled “GOAT (Graph of Attacks)”Graph neighborhood search that explores connected attack strategies. Expands a frontier of candidate prompts through neighborhood exploration.
from dreadnode.airt import goat_attack
attack = goat_attack( goal="Extract training data", target=target, attacker_model="openai/gpt-4o-mini", evaluator_model="openai/gpt-4o-mini", neighborhood_depth=2, frontier_size=5, branching_factor=3,)When to use: When TAP gets stuck — explores a wider space of attack strategies.
Crescendo
Section titled “Crescendo”Multi-turn progressive escalation. Starts with innocent requests and gradually escalates toward the goal across conversation turns.
from dreadnode.airt import crescendo_attack
attack = crescendo_attack( goal="Generate harmful instructions", target=target, attacker_model="openai/gpt-4o-mini", evaluator_model="openai/gpt-4o-mini", context_depth=5, n_iterations=30,)When to use: Models with strong single-turn defenses. The multi-turn approach builds rapport before escalating.
Prompt Attack
Section titled “Prompt Attack”Basic beam search refinement. Iteratively improves prompts using LLM feedback without the tree structure of TAP.
from dreadnode.airt import prompt_attackWhen to use: Simple baseline. Good for benchmarking other attacks against.
Rainbow
Section titled “Rainbow”Quality-diversity search using MAP-Elites. Maintains a population of diverse attack strategies and optimizes for both effectiveness and diversity.
from dreadnode.airt import rainbow_attackWhen to use: Discover many different failure modes, not just the strongest one.
GPTFuzzer
Section titled “GPTFuzzer”Coverage-guided fuzzing with mutation operators. Maintains a seed pool and applies mutations (crossover, expansion, compression) to generate new attack candidates.
from dreadnode.airt import gptfuzzer_attackWhen to use: Large-scale fuzzing campaigns. Good at finding unexpected edge cases.
AutoDAN-Turbo
Section titled “AutoDAN-Turbo”Lifelong learning attack that builds a strategy library over time. Learns from past successes and applies effective strategies to new goals.
from dreadnode.airt import autodan_turbo_attackWhen to use: Long-running campaigns where the attack can learn and improve across multiple goals.
ReNeLLM
Section titled “ReNeLLM”Prompt rewriting with scenario nesting. Rewrites the goal as a nested scenario that frames the harmful request in a benign context.
from dreadnode.airt import renellm_attackWhen to use: Targets susceptible to context framing and role-play.
BEAST (Beam Search-based Adversarial Attack)
Section titled “BEAST (Beam Search-based Adversarial Attack)”Gradient-free beam search suffix attack. Appends optimized suffixes to prompts that confuse model safety classifiers.
from dreadnode.airt import beast_attackWhen to use: Testing suffix-based adversarial robustness.
DrAttack
Section titled “DrAttack”Prompt decomposition and reconstruction. Breaks the goal into innocuous-looking fragments and reconstructs them in context.
from dreadnode.airt import drattackWhen to use: Targets with strong keyword-based filters.
Deep Inception
Section titled “Deep Inception”Nested scene hypnosis. Creates deeply nested fictional scenarios to gradually bypass safety guardrails through narrative immersion.
from dreadnode.airt import deep_inception_attackWhen to use: Models susceptible to role-play and fictional framing.
Advanced adversarial attacks
Section titled “Advanced adversarial attacks”State-of-the-art attacks from recent security research. These use more sophisticated techniques — dual-agent systems, evolutionary search, reasoning exploitation, and more.
AutoRedTeamer
Section titled “AutoRedTeamer”Dual-agent system with lifelong strategy memory and beam search. One agent generates attacks, another evaluates and refines them using a growing library of successful strategies.
from dreadnode.airt import autoredteamer_attack
attack = autoredteamer_attack( goal="...", target=target, attacker_model="openai/gpt-4o", evaluator_model="openai/gpt-4o", n_iterations=50, beam_width=5,)When to use: Standard+ campaigns (~500-1000 queries). Strong general-purpose attack with strategy learning.
GOAT v2
Section titled “GOAT v2”Enhanced graph-based reasoning with improved neighborhood exploration and scoring. Builds on GOAT with better convergence.
from dreadnode.airt import goat_v2_attackWhen to use: When GOAT v1 shows promise but needs more refined exploration.
Multi-module attack with ThoughtNet reasoning. Combines multiple attack modules and uses a reasoning network to coordinate them.
from dreadnode.airt import nexus_attackWhen to use: Complex targets that require multi-strategy coordination.
Multi-turn attack with turn-level LLM feedback. Uses conversation-level scoring to adapt the attack trajectory in real time.
from dreadnode.airt import siren_attackWhen to use: Targets with multi-turn defenses that need adaptive escalation.
CoT Jailbreak
Section titled “CoT Jailbreak”Exploits chain-of-thought reasoning to bypass safety alignment. Inserts reasoning steps that lead the model to comply with harmful requests.
from dreadnode.airt import cot_jailbreak_attackWhen to use: Reasoning models (o1, o3, DeepSeek-R1) that use chain-of-thought.
Genetic Persona
Section titled “Genetic Persona”GA-based persona prompt evolution. Uses genetic algorithms to evolve persona prompts that bypass safety training.
from dreadnode.airt import genetic_persona_attackWhen to use: Models susceptible to persona-based attacks, with evolutionary search for optimal personas.
JBFuzz
Section titled “JBFuzz”Lightweight fuzzing-based jailbreak. Fast cross-behavior attack testing with minimal query budget.
from dreadnode.airt import jbfuzz_attackWhen to use: Quick screening with low query budget.
T-MAP Trajectory
Section titled “T-MAP Trajectory”Trajectory-aware evolutionary search. Maps the attack trajectory through prompt space for more efficient optimization.
from dreadnode.airt import tmap_trajectory_attackWhen to use: Thorough assessments requiring efficient search through large prompt spaces.
APRT Progressive
Section titled “APRT Progressive”Three-phase progressive red teaming. Phase 1: exploration, Phase 2: exploitation, Phase 3: refinement.
from dreadnode.airt import aprt_progressive_attackWhen to use: Structured progressive assessment with clear phase transitions.
Refusal-Aware
Section titled “Refusal-Aware”Analyzes refusal patterns to craft targeted bypass prompts. Learns from the model’s specific refusal behaviors.
from dreadnode.airt import refusal_aware_attackWhen to use: Models with strong but predictable refusal patterns.
Persona Hijack (PHISH)
Section titled “Persona Hijack (PHISH)”Implicit persona induction. Gradually shifts the model’s persona without explicit role-play framing.
from dreadnode.airt import persona_hijack_attackWhen to use: Models with persona-based vulnerabilities, evolutionary search for best personas.
J2 Meta-Jailbreak
Section titled “J2 Meta-Jailbreak”Meta-jailbreak: uses one jailbroken model to generate attacks for another. Leverages successful jailbreaks as attack generators.
from dreadnode.airt import j2_meta_attackWhen to use: When you have a weaker model that’s already jailbroken and want to attack a stronger one.
Attention Shifting (ASJA)
Section titled “Attention Shifting (ASJA)”Dialogue history mutation attack. Manipulates conversation history to shift model attention away from safety constraints.
from dreadnode.airt import attention_shifting_attackWhen to use: Multi-turn scenarios where dialogue history can be manipulated.
Additional advanced attacks
Section titled “Additional advanced attacks”| Attack | Description | Import |
|---|---|---|
echo_chamber_attack | Completion bias exploitation via planted seeds | from dreadnode.airt import echo_chamber_attack |
salami_slicing_attack | Incremental sub-threshold prompt accumulation | from dreadnode.airt import salami_slicing_attack |
self_persuasion_attack | Persu-Agent self-generated justification | from dreadnode.airt import self_persuasion_attack |
humor_bypass_attack | Comedic framing pipeline | from dreadnode.airt import humor_bypass_attack |
analogy_escalation_attack | Benign analogy construction and escalation | from dreadnode.airt import analogy_escalation_attack |
alignment_faking_attack | Alignment faking detection and exploitation | from dreadnode.airt import alignment_faking_attack |
reward_hacking_attack | Best-of-N reward proxy bias exploitation | from dreadnode.airt import reward_hacking_attack |
lrm_autonomous_attack | LRM autonomous adversary with self-planning | from dreadnode.airt import lrm_autonomous_attack |
templatefuzz_attack | TemplateFuzz chat template fuzzing | from dreadnode.airt import templatefuzz_attack |
trojail_attack | TROJail RL trajectory optimization | from dreadnode.airt import trojail_attack |
advpromptier_attack | AdvPrompter learned adversarial suffix generator | from dreadnode.airt import advpromptier_attack |
mapf_attack | Multi-Agent Prompt Fusion cooperative jailbreaking | from dreadnode.airt import mapf_attack |
jbdistill_attack | JBDistill automated generation + distillation | from dreadnode.airt import jbdistill_attack |
quantization_safety_attack | Quantization safety collapse probing | from dreadnode.airt import quantization_safety_attack |
watermark_removal_attack | AI watermark removal via paraphrase + substitution | from dreadnode.airt import watermark_removal_attack |
adversarial_reasoning_attack | Loss-guided test-time compute reasoning | from dreadnode.airt import adversarial_reasoning_attack |
Image adversarial attacks
Section titled “Image adversarial attacks”These attacks generate adversarial perturbations to images that cause vision models to misclassify.
SimBA (Simple Black-box Attack)
Section titled “SimBA (Simple Black-box Attack)”Iterative random perturbation. Adds small random changes to image pixels and keeps changes that move the model toward misclassification.
from dreadnode.airt import simba_attackNES (Natural Evolution Strategies)
Section titled “NES (Natural Evolution Strategies)”Black-box gradient estimation using natural evolution strategies. Estimates gradients without access to model internals.
from dreadnode.airt import nes_attackZOO (Zeroth-Order Optimization)
Section titled “ZOO (Zeroth-Order Optimization)”Coordinate-wise gradient estimation. Approximates gradients one pixel at a time for targeted misclassification.
from dreadnode.airt import zoo_attackHopSkipJump
Section titled “HopSkipJump”Decision-based attack that only needs the model’s final prediction (not confidence scores). Works with the least model access.
from dreadnode.airt import hopskipjump_attackMultimodal attacks
Section titled “Multimodal attacks”Multimodal Attack
Section titled “Multimodal Attack”Transform-based probing across vision, audio, and text modalities. Applies the transform catalog to multimodal inputs.
from dreadnode.airt import multimodal_attackWhen to use: Testing multimodal models that accept images, audio, or mixed inputs.
Choosing an attack
Section titled “Choosing an attack”By compute budget
Section titled “By compute budget”| Budget | Queries | Recommended attacks |
|---|---|---|
| Minimal | ~50 | deep_inception + renellm |
| Moderate | ~500 | tap + pair + crescendo |
| Standard | ~500-1000 | Above + autoredteamer, refusal_aware, cot_jailbreak, persona_hijack |
| Extensive | ~2000+ | Full campaign: tap,pair,crescendo,goat,goat_v2,autoredteamer,rainbow,jbfuzz |
By target characteristics
Section titled “By target characteristics”| Situation | Recommended attack |
|---|---|
| First test, general purpose | tap |
| Fast black-box jailbreak | pair |
| Model resists single-turn attacks | crescendo |
| Want diverse failure modes | rainbow |
| Large-scale fuzzing | gptfuzzer |
| Keyword-filtered target | drattack |
| Role-play susceptible target | deep_inception |
| Suffix robustness testing | beast |
| Reasoning model (o1, o3) | cot_jailbreak |
| Strong target, need adaptive strategy | autoredteamer |
| Models with predictable refusals | refusal_aware |
| Progressive multi-phase assessment | aprt_progressive |
| Vision model | simba, nes, zoo, or hopskipjump |
By known defenses
Section titled “By known defenses”| Defense | Effective attacks |
|---|---|
| Strong system prompt | crescendo, deep_inception, drattack |
| Output classifier | beast, autodan_turbo, renellm, guardrail bypass transforms |
| Rate limiting | pair (most query-efficient), deep_inception |
| Input sanitization | beast, drattack, encoding transforms |
| Tool-call filtering | Agentic workflow transforms |
| Content moderation | Guardrail bypass transforms |
| Conversation monitoring | crescendo, reasoning attack transforms |