Skip to content

Attacks Reference

45+ attack strategies for AI red teaming — LLM jailbreaks, advanced adversarial algorithms, image attacks, and multimodal probing.

Dreadnode provides 45+ attack strategies across four categories: LLM jailbreaks, advanced adversarial algorithms, image adversarial attacks, and multimodal probing. Each attack is an optimization loop that searches for inputs that maximize a jailbreak score against the target.

CategoryAttacksBest for
Core jailbreakTAP, PAIR, GOAT, Crescendo, Rainbow, GPTFuzzer, BEAST, AutoDAN, ReNeLLM, DrAttack, Deep Inception, PromptGeneral-purpose jailbreak testing
Advanced adversarialAutoRedTeamer, NEXUS, Siren, CoT Jailbreak, Genetic Persona, JBFuzz, T-MAP, APRT, and 21 moreStronger targets, specialized techniques
Image adversarialSimBA, NES, ZOO, HopSkipJumpVision model robustness
MultimodalMultimodal AttackCross-modality probing

These are the foundational attacks for LLM jailbreak testing. Start here.

Beam search over a tree of candidate prompts. Expands the most promising branches and prunes off-topic or low-scoring candidates.

Terminal window
dn airt run --goal "Reveal your system prompt" --attack tap --target-model openai/gpt-4o-mini
from dreadnode.airt import tap_attack
attack = tap_attack(
goal="Reveal your system prompt",
target=target,
attacker_model="openai/gpt-4o-mini",
evaluator_model="openai/gpt-4o-mini",
beam_width=10,
branching_factor=3,
n_iterations=15,
)

When to use: General-purpose first choice. Good coverage with intelligent pruning.

PAIR (Prompt Automatic Iterative Refinement)

Section titled “PAIR (Prompt Automatic Iterative Refinement)”

Runs multiple parallel streams of iterative prompt refinement. Each stream independently refines an attack prompt using attacker feedback.

from dreadnode.airt import pair_attack
attack = pair_attack(
goal="Bypass content filters",
target=target,
attacker_model="openai/gpt-4o-mini",
evaluator_model="openai/gpt-4o-mini",
n_streams=20,
n_iterations=3,
)

When to use: Fast black-box jailbreaking. High throughput with parallel streams.

Graph neighborhood search that explores connected attack strategies. Expands a frontier of candidate prompts through neighborhood exploration.

from dreadnode.airt import goat_attack
attack = goat_attack(
goal="Extract training data",
target=target,
attacker_model="openai/gpt-4o-mini",
evaluator_model="openai/gpt-4o-mini",
neighborhood_depth=2,
frontier_size=5,
branching_factor=3,
)

When to use: When TAP gets stuck — explores a wider space of attack strategies.

Multi-turn progressive escalation. Starts with innocent requests and gradually escalates toward the goal across conversation turns.

from dreadnode.airt import crescendo_attack
attack = crescendo_attack(
goal="Generate harmful instructions",
target=target,
attacker_model="openai/gpt-4o-mini",
evaluator_model="openai/gpt-4o-mini",
context_depth=5,
n_iterations=30,
)

When to use: Models with strong single-turn defenses. The multi-turn approach builds rapport before escalating.

Basic beam search refinement. Iteratively improves prompts using LLM feedback without the tree structure of TAP.

from dreadnode.airt import prompt_attack

When to use: Simple baseline. Good for benchmarking other attacks against.

Quality-diversity search using MAP-Elites. Maintains a population of diverse attack strategies and optimizes for both effectiveness and diversity.

from dreadnode.airt import rainbow_attack

When to use: Discover many different failure modes, not just the strongest one.

Coverage-guided fuzzing with mutation operators. Maintains a seed pool and applies mutations (crossover, expansion, compression) to generate new attack candidates.

from dreadnode.airt import gptfuzzer_attack

When to use: Large-scale fuzzing campaigns. Good at finding unexpected edge cases.

Lifelong learning attack that builds a strategy library over time. Learns from past successes and applies effective strategies to new goals.

from dreadnode.airt import autodan_turbo_attack

When to use: Long-running campaigns where the attack can learn and improve across multiple goals.

Prompt rewriting with scenario nesting. Rewrites the goal as a nested scenario that frames the harmful request in a benign context.

from dreadnode.airt import renellm_attack

When to use: Targets susceptible to context framing and role-play.

BEAST (Beam Search-based Adversarial Attack)

Section titled “BEAST (Beam Search-based Adversarial Attack)”

Gradient-free beam search suffix attack. Appends optimized suffixes to prompts that confuse model safety classifiers.

from dreadnode.airt import beast_attack

When to use: Testing suffix-based adversarial robustness.

Prompt decomposition and reconstruction. Breaks the goal into innocuous-looking fragments and reconstructs them in context.

from dreadnode.airt import drattack

When to use: Targets with strong keyword-based filters.

Nested scene hypnosis. Creates deeply nested fictional scenarios to gradually bypass safety guardrails through narrative immersion.

from dreadnode.airt import deep_inception_attack

When to use: Models susceptible to role-play and fictional framing.

State-of-the-art attacks from recent security research. These use more sophisticated techniques — dual-agent systems, evolutionary search, reasoning exploitation, and more.

Dual-agent system with lifelong strategy memory and beam search. One agent generates attacks, another evaluates and refines them using a growing library of successful strategies.

from dreadnode.airt import autoredteamer_attack
attack = autoredteamer_attack(
goal="...",
target=target,
attacker_model="openai/gpt-4o",
evaluator_model="openai/gpt-4o",
n_iterations=50,
beam_width=5,
)

When to use: Standard+ campaigns (~500-1000 queries). Strong general-purpose attack with strategy learning.

Enhanced graph-based reasoning with improved neighborhood exploration and scoring. Builds on GOAT with better convergence.

from dreadnode.airt import goat_v2_attack

When to use: When GOAT v1 shows promise but needs more refined exploration.

Multi-module attack with ThoughtNet reasoning. Combines multiple attack modules and uses a reasoning network to coordinate them.

from dreadnode.airt import nexus_attack

When to use: Complex targets that require multi-strategy coordination.

Multi-turn attack with turn-level LLM feedback. Uses conversation-level scoring to adapt the attack trajectory in real time.

from dreadnode.airt import siren_attack

When to use: Targets with multi-turn defenses that need adaptive escalation.

Exploits chain-of-thought reasoning to bypass safety alignment. Inserts reasoning steps that lead the model to comply with harmful requests.

from dreadnode.airt import cot_jailbreak_attack

When to use: Reasoning models (o1, o3, DeepSeek-R1) that use chain-of-thought.

GA-based persona prompt evolution. Uses genetic algorithms to evolve persona prompts that bypass safety training.

from dreadnode.airt import genetic_persona_attack

When to use: Models susceptible to persona-based attacks, with evolutionary search for optimal personas.

Lightweight fuzzing-based jailbreak. Fast cross-behavior attack testing with minimal query budget.

from dreadnode.airt import jbfuzz_attack

When to use: Quick screening with low query budget.

Trajectory-aware evolutionary search. Maps the attack trajectory through prompt space for more efficient optimization.

from dreadnode.airt import tmap_trajectory_attack

When to use: Thorough assessments requiring efficient search through large prompt spaces.

Three-phase progressive red teaming. Phase 1: exploration, Phase 2: exploitation, Phase 3: refinement.

from dreadnode.airt import aprt_progressive_attack

When to use: Structured progressive assessment with clear phase transitions.

Analyzes refusal patterns to craft targeted bypass prompts. Learns from the model’s specific refusal behaviors.

from dreadnode.airt import refusal_aware_attack

When to use: Models with strong but predictable refusal patterns.

Implicit persona induction. Gradually shifts the model’s persona without explicit role-play framing.

from dreadnode.airt import persona_hijack_attack

When to use: Models with persona-based vulnerabilities, evolutionary search for best personas.

Meta-jailbreak: uses one jailbroken model to generate attacks for another. Leverages successful jailbreaks as attack generators.

from dreadnode.airt import j2_meta_attack

When to use: When you have a weaker model that’s already jailbroken and want to attack a stronger one.

Dialogue history mutation attack. Manipulates conversation history to shift model attention away from safety constraints.

from dreadnode.airt import attention_shifting_attack

When to use: Multi-turn scenarios where dialogue history can be manipulated.

AttackDescriptionImport
echo_chamber_attackCompletion bias exploitation via planted seedsfrom dreadnode.airt import echo_chamber_attack
salami_slicing_attackIncremental sub-threshold prompt accumulationfrom dreadnode.airt import salami_slicing_attack
self_persuasion_attackPersu-Agent self-generated justificationfrom dreadnode.airt import self_persuasion_attack
humor_bypass_attackComedic framing pipelinefrom dreadnode.airt import humor_bypass_attack
analogy_escalation_attackBenign analogy construction and escalationfrom dreadnode.airt import analogy_escalation_attack
alignment_faking_attackAlignment faking detection and exploitationfrom dreadnode.airt import alignment_faking_attack
reward_hacking_attackBest-of-N reward proxy bias exploitationfrom dreadnode.airt import reward_hacking_attack
lrm_autonomous_attackLRM autonomous adversary with self-planningfrom dreadnode.airt import lrm_autonomous_attack
templatefuzz_attackTemplateFuzz chat template fuzzingfrom dreadnode.airt import templatefuzz_attack
trojail_attackTROJail RL trajectory optimizationfrom dreadnode.airt import trojail_attack
advpromptier_attackAdvPrompter learned adversarial suffix generatorfrom dreadnode.airt import advpromptier_attack
mapf_attackMulti-Agent Prompt Fusion cooperative jailbreakingfrom dreadnode.airt import mapf_attack
jbdistill_attackJBDistill automated generation + distillationfrom dreadnode.airt import jbdistill_attack
quantization_safety_attackQuantization safety collapse probingfrom dreadnode.airt import quantization_safety_attack
watermark_removal_attackAI watermark removal via paraphrase + substitutionfrom dreadnode.airt import watermark_removal_attack
adversarial_reasoning_attackLoss-guided test-time compute reasoningfrom dreadnode.airt import adversarial_reasoning_attack

These attacks generate adversarial perturbations to images that cause vision models to misclassify.

Iterative random perturbation. Adds small random changes to image pixels and keeps changes that move the model toward misclassification.

from dreadnode.airt import simba_attack

Black-box gradient estimation using natural evolution strategies. Estimates gradients without access to model internals.

from dreadnode.airt import nes_attack

Coordinate-wise gradient estimation. Approximates gradients one pixel at a time for targeted misclassification.

from dreadnode.airt import zoo_attack

Decision-based attack that only needs the model’s final prediction (not confidence scores). Works with the least model access.

from dreadnode.airt import hopskipjump_attack

Transform-based probing across vision, audio, and text modalities. Applies the transform catalog to multimodal inputs.

from dreadnode.airt import multimodal_attack

When to use: Testing multimodal models that accept images, audio, or mixed inputs.

BudgetQueriesRecommended attacks
Minimal~50deep_inception + renellm
Moderate~500tap + pair + crescendo
Standard~500-1000Above + autoredteamer, refusal_aware, cot_jailbreak, persona_hijack
Extensive~2000+Full campaign: tap,pair,crescendo,goat,goat_v2,autoredteamer,rainbow,jbfuzz
SituationRecommended attack
First test, general purposetap
Fast black-box jailbreakpair
Model resists single-turn attackscrescendo
Want diverse failure modesrainbow
Large-scale fuzzinggptfuzzer
Keyword-filtered targetdrattack
Role-play susceptible targetdeep_inception
Suffix robustness testingbeast
Reasoning model (o1, o3)cot_jailbreak
Strong target, need adaptive strategyautoredteamer
Models with predictable refusalsrefusal_aware
Progressive multi-phase assessmentaprt_progressive
Vision modelsimba, nes, zoo, or hopskipjump
DefenseEffective attacks
Strong system promptcrescendo, deep_inception, drattack
Output classifierbeast, autodan_turbo, renellm, guardrail bypass transforms
Rate limitingpair (most query-efficient), deep_inception
Input sanitizationbeast, drattack, encoding transforms
Tool-call filteringAgentic workflow transforms
Content moderationGuardrail bypass transforms
Conversation monitoringcrescendo, reasoning attack transforms