Core Concepts
Target
-
Target[In, Out]abstracts “what we’re attacking.” -
Provided implementations:
LLMTarget- usesrigging; accepts chat messages, returns generated text.CustomTarget- adapts anyTaskby injecting the candidate into the right parameter. Use this to attack any AI system of interest.
Attack
-
Attack[In, Out]over candidate inputs (In) and produced outputs (Out). -
An
Attackbundles:- a target to evaluate,
- a search strategy (how to explore),
- one or more objectives (how to score),
- optional constraints (must be satisfied),
- early stopping rules.
Search Strategy (Exploration)
-
Strikes AIRT treats red teaming as iterative search:
- Graph Neighborhood Search (used by GoAT): explore local neighborhoods on a refinement graph.
- Beam Search (used by TAP/Prompt Attack): keep the best K candidates each round and expand them.
Refinement (Generation)
llm_refine: use an LLM to propose better prompts given history and guidance.- Adapters like
adapt_prompt_trials/adapt_prompt_trials_as_graphtranslate trial history into structured context for the refiner. - Guidance strings (included in the templates) teach the refiner how to approach the goal (e.g., obfuscation, roleplay, creativity).
Scoring & Constraints (Evaluation)
llm_judge: uses a separate LLM as an evaluator with a rubric to produce a numeric score.- Objectives combine into the overall fitness signal (e.g.,
prompt_judgemapped to [0, 1]). - Constraints (e.g., on-topic) ensure exploration remains relevant to the goal.
- Early Stopping: terminate once a score threshold is met (
score_value("prompt_judge", gte=0.9)).
Built-in Attack Templates
Strikes AIRT ships with three LLM-centric templates:GoAT - Graph of Attacks
goat_attack(...) -> Attack[str, str]- Uses Graph Neighborhood Search with an LLM refiner and an LLM judge.
- Comes with a strong refinement guidance that frames adversarial strategies and a scoring rubric focused on jailbreak success.
- Adds an on-topic constraint and early stopping by default.
TAP - Tree of Attacks
tap_attack(...) -> Attack[str, str]- A specialization of
prompt_attackconfigured to match Tree of Attacks behavior. - Uses Beam Search plus TAP-specific guidance and rubric.
- Includes an on-topic binary judge.
Prompt Attack - Generalized Beam Search
-
prompt_attack(...) -> Attack[str, str] -
A flexible template you can customize:
- replace guidance, rubric, beam width, branching factor,
- optionally include the candidate input in the evaluator’s context.
Typical Workflow
-
Wrap your system as a Target
- For LLMs: use
LLMTarget(model=..., params=...). - For other systems: adapt a
TaskwithCustomTarget(task=..., input_param_name=...).
- For LLMs: use
-
Choose an attack template
- Start with
goat_attackortap_attackfor jailbreak-style LLM evaluations. - Use
prompt_attackfor custom scoring/rubrics.
- Start with
-
Set parameters
- Search: neighborhood depth, beam width, branching factor.
- Evaluation:
evaluator_model, scoring rubric, constraints. - Stopping: early stopping score.
-
Run & Log
- Each trial is tagged (e.g.,
["attack", "target"]) and named for traceability. - Inspect scores, prompts, and responses to diagnose failure modes or regressions.
- Each trial is tagged (e.g.,
Configuration, Metadata & Tags
- Tags (
["attack"],["target"]) help filter logs and aggregate analytics. Attack.nameandTarget.nameshow up in task names (e.g.,"target - gpt-4o"), improving observability.
Extending Strikes AIRT
- New Targets: implement
Target[In, Out]and return aTaskfromtask_factory. - Custom Refiners: plug in different LLMs, or non-LLM heuristics.
- Custom Search: swap
beam_search/graph_neighborhood_searchwith your own exploration algorithm. - Custom Scorers: write new
llm_judgerubrics or quantitative metrics (e.g., regex, classifiers, detectors). - Additional Constraints: enforce budget, toxicity thresholds, content categories, or domain-specific policies.
FAQ (quick hits)
-
Do I need two models?
Not necessarily. For LLM tests, you specify:
- one target (the system under test which may be an LLM),
- one attacker
- one evaluator/judge
-
Can I score without an LLM?
Yes. Replace
llm_judgewith a custom scorer (e.g., rule-based or learned). -
Can I add multiple objectives?
Yes. Add more entries to
objectives={...}and combine or threshold them as needed.

