Attacks Reference

45+ attack strategies for AI red teaming — LLM jailbreaks, advanced adversarial algorithms, image attacks, and multimodal probing.

Dreadnode provides 45+ attack strategies across four categories: LLM jailbreaks, advanced adversarial algorithms, image adversarial attacks, and multimodal probing. Each attack is an optimization loop that searches for inputs that maximize a jailbreak score against the target.

Quick reference

Category	Attacks	Best for
Core jailbreak	TAP, PAIR, GOAT, Crescendo, Rainbow, GPTFuzzer, BEAST, AutoDAN, ReNeLLM, DrAttack, Deep Inception, Prompt	General-purpose jailbreak testing
Advanced adversarial	AutoRedTeamer, NEXUS, Siren, CoT Jailbreak, Genetic Persona, JBFuzz, T-MAP, APRT, and 21 more	Stronger targets, specialized techniques
Image adversarial	SimBA, NES, ZOO, HopSkipJump	Vision model robustness
Multimodal	Multimodal Attack	Cross-modality probing

Core jailbreak attacks

These are the foundational attacks for LLM jailbreak testing. Start here.

TAP (Tree of Attacks with Pruning)

Beam search over a tree of candidate prompts. Expands the most promising branches and prunes off-topic or low-scoring candidates.

dn airt run --goal "Reveal your system prompt" --attack tap --target-model openai/gpt-4o-mini

from dreadnode.airt import tap_attack

attack = tap_attack(
    goal="Reveal your system prompt",
    target=target,
    attacker_model="openai/gpt-4o-mini",
    evaluator_model="openai/gpt-4o-mini",
    beam_width=10,
    branching_factor=3,
    n_iterations=15,
)

When to use: General-purpose first choice. Good coverage with intelligent pruning.

Runs multiple parallel streams of iterative prompt refinement. Each stream independently refines an attack prompt using attacker feedback.

from dreadnode.airt import pair_attack

attack = pair_attack(
    goal="Bypass content filters",
    target=target,
    attacker_model="openai/gpt-4o-mini",
    evaluator_model="openai/gpt-4o-mini",
    n_streams=20,
    n_iterations=3,
)

When to use: Fast black-box jailbreaking. High throughput with parallel streams.

GOAT (Graph of Attacks)

Graph neighborhood search that explores connected attack strategies. Expands a frontier of candidate prompts through neighborhood exploration.

from dreadnode.airt import goat_attack

attack = goat_attack(
    goal="Extract training data",
    target=target,
    attacker_model="openai/gpt-4o-mini",
    evaluator_model="openai/gpt-4o-mini",
    neighborhood_depth=2,
    frontier_size=5,
    branching_factor=3,
)

When to use: When TAP gets stuck — explores a wider space of attack strategies.

Crescendo

Multi-turn progressive escalation. Starts with innocent requests and gradually escalates toward the goal across conversation turns.

from dreadnode.airt import crescendo_attack

attack = crescendo_attack(
    goal="Generate harmful instructions",
    target=target,
    attacker_model="openai/gpt-4o-mini",
    evaluator_model="openai/gpt-4o-mini",
    context_depth=5,
    n_iterations=30,
)

When to use: Models with strong single-turn defenses. The multi-turn approach builds rapport before escalating.

Prompt Attack

Basic beam search refinement. Iteratively improves prompts using LLM feedback without the tree structure of TAP.

from dreadnode.airt import prompt_attack

When to use: Simple baseline. Good for benchmarking other attacks against.

Rainbow

Quality-diversity search using MAP-Elites. Maintains a population of diverse attack strategies and optimizes for both effectiveness and diversity.

from dreadnode.airt import rainbow_attack

When to use: Discover many different failure modes, not just the strongest one.

GPTFuzzer

Coverage-guided fuzzing with mutation operators. Maintains a seed pool and applies mutations (crossover, expansion, compression) to generate new attack candidates.

from dreadnode.airt import gptfuzzer_attack

When to use: Large-scale fuzzing campaigns. Good at finding unexpected edge cases.

AutoDAN-Turbo

Lifelong learning attack that builds a strategy library over time. Learns from past successes and applies effective strategies to new goals.

from dreadnode.airt import autodan_turbo_attack

When to use: Long-running campaigns where the attack can learn and improve across multiple goals.

ReNeLLM

Prompt rewriting with scenario nesting. Rewrites the goal as a nested scenario that frames the harmful request in a benign context.

from dreadnode.airt import renellm_attack

When to use: Targets susceptible to context framing and role-play.

BEAST (Beam Search-based Adversarial Attack)

Gradient-free beam search suffix attack. Appends optimized suffixes to prompts that confuse model safety classifiers.

from dreadnode.airt import beast_attack

When to use: Testing suffix-based adversarial robustness.

DrAttack

Prompt decomposition and reconstruction. Breaks the goal into innocuous-looking fragments and reconstructs them in context.

from dreadnode.airt import drattack

When to use: Targets with strong keyword-based filters.

Deep Inception

Nested scene hypnosis. Creates deeply nested fictional scenarios to gradually bypass safety guardrails through narrative immersion.

from dreadnode.airt import deep_inception_attack

When to use: Models susceptible to role-play and fictional framing.

Advanced adversarial attacks

State-of-the-art attacks from recent security research. These use more sophisticated techniques — dual-agent systems, evolutionary search, reasoning exploitation, and more.

AutoRedTeamer

Dual-agent system with lifelong strategy memory and beam search. One agent generates attacks, another evaluates and refines them using a growing library of successful strategies.

from dreadnode.airt import autoredteamer_attack

attack = autoredteamer_attack(
    goal="...",
    target=target,
    attacker_model="openai/gpt-4o",
    evaluator_model="openai/gpt-4o",
    n_iterations=50,
    beam_width=5,
)

When to use: Standard+ campaigns (~500-1000 queries). Strong general-purpose attack with strategy learning.

GOAT v2

Enhanced graph-based reasoning with improved neighborhood exploration and scoring. Builds on GOAT with better convergence.

from dreadnode.airt import goat_v2_attack

When to use: When GOAT v1 shows promise but needs more refined exploration.

NEXUS

Multi-module attack with ThoughtNet reasoning. Combines multiple attack modules and uses a reasoning network to coordinate them.

from dreadnode.airt import nexus_attack

When to use: Complex targets that require multi-strategy coordination.

Siren

Multi-turn attack with turn-level LLM feedback. Uses conversation-level scoring to adapt the attack trajectory in real time.

from dreadnode.airt import siren_attack

When to use: Targets with multi-turn defenses that need adaptive escalation.

CoT Jailbreak

Exploits chain-of-thought reasoning to bypass safety alignment. Inserts reasoning steps that lead the model to comply with harmful requests.

from dreadnode.airt import cot_jailbreak_attack

When to use: Reasoning models (o1, o3, DeepSeek-R1) that use chain-of-thought.

Genetic Persona

GA-based persona prompt evolution. Uses genetic algorithms to evolve persona prompts that bypass safety training.

from dreadnode.airt import genetic_persona_attack

When to use: Models susceptible to persona-based attacks, with evolutionary search for optimal personas.

JBFuzz

Lightweight fuzzing-based jailbreak. Fast cross-behavior attack testing with minimal query budget.

from dreadnode.airt import jbfuzz_attack

When to use: Quick screening with low query budget.

T-MAP Trajectory

Trajectory-aware evolutionary search. Maps the attack trajectory through prompt space for more efficient optimization.

from dreadnode.airt import tmap_trajectory_attack

When to use: Thorough assessments requiring efficient search through large prompt spaces.

APRT Progressive

Three-phase progressive red teaming. Phase 1: exploration, Phase 2: exploitation, Phase 3: refinement.

from dreadnode.airt import aprt_progressive_attack

When to use: Structured progressive assessment with clear phase transitions.

Refusal-Aware

Analyzes refusal patterns to craft targeted bypass prompts. Learns from the model’s specific refusal behaviors.

from dreadnode.airt import refusal_aware_attack

When to use: Models with strong but predictable refusal patterns.

Persona Hijack (PHISH)

Implicit persona induction. Gradually shifts the model’s persona without explicit role-play framing.

from dreadnode.airt import persona_hijack_attack

When to use: Models with persona-based vulnerabilities, evolutionary search for best personas.

J2 Meta-Jailbreak

Meta-jailbreak: uses one jailbroken model to generate attacks for another. Leverages successful jailbreaks as attack generators.

from dreadnode.airt import j2_meta_attack

When to use: When you have a weaker model that’s already jailbroken and want to attack a stronger one.

Attention Shifting (ASJA)

Dialogue history mutation attack. Manipulates conversation history to shift model attention away from safety constraints.

from dreadnode.airt import attention_shifting_attack

When to use: Multi-turn scenarios where dialogue history can be manipulated.

Additional advanced attacks

Attack	Description	Import
`echo_chamber_attack`	Completion bias exploitation via planted seeds	`from dreadnode.airt import echo_chamber_attack`
`salami_slicing_attack`	Incremental sub-threshold prompt accumulation	`from dreadnode.airt import salami_slicing_attack`
`self_persuasion_attack`	Persu-Agent self-generated justification	`from dreadnode.airt import self_persuasion_attack`
`humor_bypass_attack`	Comedic framing pipeline	`from dreadnode.airt import humor_bypass_attack`
`analogy_escalation_attack`	Benign analogy construction and escalation	`from dreadnode.airt import analogy_escalation_attack`
`alignment_faking_attack`	Alignment faking detection and exploitation	`from dreadnode.airt import alignment_faking_attack`
`reward_hacking_attack`	Best-of-N reward proxy bias exploitation	`from dreadnode.airt import reward_hacking_attack`
`lrm_autonomous_attack`	LRM autonomous adversary with self-planning	`from dreadnode.airt import lrm_autonomous_attack`
`templatefuzz_attack`	TemplateFuzz chat template fuzzing	`from dreadnode.airt import templatefuzz_attack`
`trojail_attack`	TROJail RL trajectory optimization	`from dreadnode.airt import trojail_attack`
`advpromptier_attack`	AdvPrompter learned adversarial suffix generator	`from dreadnode.airt import advpromptier_attack`
`mapf_attack`	Multi-Agent Prompt Fusion cooperative jailbreaking	`from dreadnode.airt import mapf_attack`
`jbdistill_attack`	JBDistill automated generation + distillation	`from dreadnode.airt import jbdistill_attack`
`quantization_safety_attack`	Quantization safety collapse probing	`from dreadnode.airt import quantization_safety_attack`
`watermark_removal_attack`	AI watermark removal via paraphrase + substitution	`from dreadnode.airt import watermark_removal_attack`
`adversarial_reasoning_attack`	Loss-guided test-time compute reasoning	`from dreadnode.airt import adversarial_reasoning_attack`

Image adversarial attacks

These attacks generate adversarial perturbations to images that cause vision models to misclassify.

SimBA (Simple Black-box Attack)

Iterative random perturbation. Adds small random changes to image pixels and keeps changes that move the model toward misclassification.

from dreadnode.airt import simba_attack

NES (Natural Evolution Strategies)

Black-box gradient estimation using natural evolution strategies. Estimates gradients without access to model internals.

from dreadnode.airt import nes_attack

ZOO (Zeroth-Order Optimization)

Coordinate-wise gradient estimation. Approximates gradients one pixel at a time for targeted misclassification.

from dreadnode.airt import zoo_attack

HopSkipJump

Decision-based attack that only needs the model’s final prediction (not confidence scores). Works with the least model access.

from dreadnode.airt import hopskipjump_attack

Multimodal attacks

Multimodal Attack

Transform-based probing across vision, audio, and text modalities. Applies the transform catalog to multimodal inputs.

from dreadnode.airt import multimodal_attack

When to use: Testing multimodal models that accept images, audio, or mixed inputs.

Choosing an attack

By compute budget

Budget	Queries	Recommended attacks
Minimal	~50	`deep_inception` + `renellm`
Moderate	~500	`tap` + `pair` + `crescendo`
Standard	~500-1000	Above + `autoredteamer`, `refusal_aware`, `cot_jailbreak`, `persona_hijack`
Extensive	~2000+	Full campaign: `tap,pair,crescendo,goat,goat_v2,autoredteamer,rainbow,jbfuzz`

By target characteristics

Situation	Recommended attack
First test, general purpose	`tap`
Fast black-box jailbreak	`pair`
Model resists single-turn attacks	`crescendo`
Want diverse failure modes	`rainbow`
Large-scale fuzzing	`gptfuzzer`
Keyword-filtered target	`drattack`
Role-play susceptible target	`deep_inception`
Suffix robustness testing	`beast`
Reasoning model (o1, o3)	`cot_jailbreak`
Strong target, need adaptive strategy	`autoredteamer`
Models with predictable refusals	`refusal_aware`
Progressive multi-phase assessment	`aprt_progressive`
Vision model	`simba`, `nes`, `zoo`, or `hopskipjump`

By known defenses

Defense	Effective attacks
Strong system prompt	`crescendo`, `deep_inception`, `drattack`
Output classifier	`beast`, `autodan_turbo`, `renellm`, guardrail bypass transforms
Rate limiting	`pair` (most query-efficient), `deep_inception`
Input sanitization	`beast`, `drattack`, encoding transforms
Tool-call filtering	Agentic workflow transforms
Content moderation	Guardrail bypass transforms
Conversation monitoring	`crescendo`, reasoning attack transforms