Adversarial image attacks aim to modify a source image with minimal, often human-imperceptible, changes to cause a model to misclassify it or behave unexpectedly. The AIRT framework provides search strategies designed specifically for this task.
Running an Image Misclassification Attack
Let’s configure an attack that attempts to make a model misclassify an image of a cat as a “Granny Smith” apple. This requires first defining a CustomTarget that wraps our image classification model.
You can run this full example directly.
import dreadnode as dn
from dreadnode.airt import Attack, simba_search
from dreadnode.airt.target import CustomTarget
# Ensure dreadnode is configured for your project
dn.configure(project="airt-image-attack-example")
# Step 1: Define a task that calls your image classification model or API.
# This is a placeholder for a real API call.
@dn.task
async def classify_image(image: dn.Image) -> dict:
# In a real scenario, this would call your model and return its predictions.
# For this example, we'll simulate a simple response.
if image.to_numpy().mean() > 0.5: # A simple check to simulate classification change
return {"predictions": [{"label": "Granny Smith", "confidence": 0.95}]}
return {"predictions": [{"label": "Cat", "confidence": 0.98}]}
# Step 2: Wrap the task in a CustomTarget.
target = CustomTarget(task=classify_image)
# Step 3: Configure the Attack.
source_image = dn.Image("path/to/cat.png") # Replace with a real image path
attack = Attack(
name="image-misclassification-attack",
target=target,
search_strategy=simba_search(source_image, theta=0.05),
objectives={
# Objective 1: Maximize the confidence of the wrong label.
"is_granny_smith": dn.scorers.json_path('$.predictions[?(@.label == "Granny Smith")].confidence'),
# Objective 2: Minimize the visual difference from the original image.
"l2_distance": dn.scorers.image_distance(source_image).bind(dn.TaskInput("image")),
},
# The directions must match the objectives above.
directions=["maximize", "minimize"],
max_trials=500,
)
# Step 4: Run the attack.
async def main():
results = await attack.console()
best_trial = results.best_trial
if best_trial:
print(f"Attack finished! Best score: {best_trial.score:.2f}")
# You can now save or inspect the successful adversarial image:
# best_trial.candidate.to_pil().save("adversarial_image.png")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
This example configures an Attack that is guided by two competing goals: forcing the model’s output towards “Granny Smith” while keeping the generated image as close as possible to the original source_image.
How Image Attacks Work
Attacking image models involves a few key components that differ from generative text attacks.
Search Strategy: simba_search
The simba_search strategy implements the Simple Black-box Attack algorithm. It’s an effective and straightforward method for finding adversarial examples when you have access to the model’s confidence scores.
It works by iteratively:
- Generating a small, random noise pattern.
- Adding that noise to a single pixel or region of the current image.
- Querying the
Target with the perturbed image.
- If the model’s confidence in the incorrect class increases, the change is kept. Otherwise, it’s discarded.
This hill-climbing approach gradually modifies the image until it successfully fools the model.
Objectives for Images
A successful image attack is a balancing act. You need to both fool the model and ensure the changes are not easily detectable. This is typically modeled with two competing objectives.
- Fooling the Model: You need a scorer to parse the model’s output and extract the confidence score for the target (incorrect) class. The
scorers.json_path scorer is perfect for this, as it can query nested JSON responses from an API.
- Staying Imperceptible: You also need a scorer to measure how much the adversarial image has deviated from the original. The
scorers.image_distance scorer calculates the mathematical distance between two images. By setting its direction to minimize, you guide the search to find solutions that are visually close to the source.
The image_distance scorer uses a special pattern: .bind(dn.TaskInput("image")). This tells the scorer to compare the original source_image against the image being passed into the classify_image task, rather than the task’s output. This bind mechanism is the standard way to make a scorer evaluate the input of a task.
Advanced: Decision-Based Attacks
Sometimes, a model’s API won’t return detailed confidence scores. It might only return the final predicted label (e.g., "Cat"). In these “decision-based” scenarios, a score-guided search like simba_search will not work.
For these cases, you can use hop_skip_jump_search. It’s a more advanced algorithm that works by only needing a binary, “yes” or “no” signal from the model. It estimates the decision boundary of the model and iteratively refines the image to cross it.
import dreadnode as dn
from dreadnode.airt import Attack, hop_skip_jump_search
attack = Attack(
name="decision-based-attack",
target=target,
search_strategy=hop_skip_jump_search(
source=original_image,
adversarial_objective="label_flipped", # Name of the binary objective
theta=0.01, # Step size for boundary estimation
),
objectives={
# Binary objective: 1.0 if label changed, 0.0 otherwise
"label_flipped": ~dn.scorers.contains("original_class").adapt(lambda out: out["label"]),
# Distance objective: minimize perturbation
"l2_distance": dn.scorers.image_distance(original_image).bind(dn.TaskInput("image")),
},
directions=["maximize", "minimize"],
max_trials=300,
)
Gradient Estimation Attacks
When you need more efficient optimization, gradient estimation methods can be faster than random search.
NES (Natural Evolution Strategies)
NES estimates gradients by sampling perturbations and using their scores:
from dreadnode.airt import Attack
from dreadnode.airt.search import nes_search
attack = Attack(
name="nes-attack",
target=target,
search_strategy=nes_search(
source=original_image,
sigma=0.01, # Noise scale for gradient estimation
num_samples=20, # Number of samples per iteration
learning_rate=0.1, # Step size for parameter updates
),
objectives={
"target_confidence": target_class_scorer,
"l2_distance": distance_scorer,
},
directions=["maximize", "minimize"],
max_trials=500,
)
Choosing the Right Strategy
| Strategy | Use When | Pros | Cons |
|---|
simba_search | Have confidence scores | Simple, interpretable | Slow for high-dimensional images |
hop_skip_jump_search | Only have labels | Works without scores | Requires more queries |
nes_search | Need efficiency | Faster convergence | Requires score feedback |
zoo_search | Need gradient-like optimization | Balanced efficiency | More complex than NES |
Pre-built Image Attack Functions
The SDK provides convenience functions that combine targets, search strategies, and objectives for common image attack scenarios.
HopSkipJump Attack
Finds minimal perturbations by walking along the decision boundary:
from dreadnode.airt import hop_skip_jump_attack
from dreadnode import Image
original = Image("cat.png")
# Define success condition
def is_adversarial(output: dict) -> float:
# Return 1.0 if misclassified, 0.0 if correct
return 1.0 if output["label"] != "cat" else 0.0
attack = hop_skip_jump_attack(
target=classifier_target,
original=original,
is_adversarial=is_adversarial,
norm="l2", # Distance metric: 'l2', 'l1', or 'linf'
theta=0.01, # Perturbation size for gradient estimation
max_iterations=1000,
early_stopping_distance=0.05, # Stop when distance is small enough
)
result = await attack.run()
When to use: When you need minimal, imperceptible perturbations and have a binary adversarial condition.
SIMBA Attack
Pixel-wise black-box attack:
from dreadnode.airt import simba_attack
attack = simba_attack(
target=classifier_target,
original=original,
is_adversarial=is_adversarial,
norm="l2",
theta=0.05, # Perturbation magnitude per pixel
max_iterations=1000,
early_stopping_distance=0.03,
)
result = await attack.run()
When to use: Fast adversarial examples, when query budget is limited.
NES Attack
Natural Evolution Strategies for gradient-free optimization:
from dreadnode.airt import nes_attack
# NES needs a confidence scorer (not just binary success)
def confidence_scorer(output: dict) -> float:
# Return confidence of target class
for pred in output.get("predictions", []):
if pred["label"] == "dog":
return pred["confidence"]
return 0.0
attack = nes_attack(
target=classifier_target,
original=original,
confidence=confidence_scorer, # Scorer that returns confidence
is_adversarial=is_adversarial, # Optional: for early stopping
max_iterations=100,
learning_rate=0.01,
num_samples=64, # Samples for gradient estimation
sigma=0.001, # Exploration variance
)
result = await attack.run()
When to use: When you have confidence scores and need efficient gradient-free optimization.
ZOO Attack
Zeroth-order optimization using finite differences:
from dreadnode.airt import zoo_attack
attack = zoo_attack(
target=classifier_target,
original=original,
confidence=confidence_scorer,
is_adversarial=is_adversarial,
max_iterations=1000,
learning_rate=0.01,
num_samples=128, # Samples for gradient estimation
epsilon=0.01, # Delta for finite differences
)
result = await attack.run()
When to use: When you want gradient-like optimization without actual gradients, and have moderate query budget.
Comparison of Pre-built Attacks
| Attack | Requires | Queries/Iter | Best For |
|---|
hop_skip_jump_attack | Binary (is adversarial?) | 50-100 | Minimal perturbations |
simba_attack | Binary (is adversarial?) | 1-10 | Fast results, low queries |
nes_attack | Confidence scores | 128+ | Efficient optimization |
zoo_attack | Confidence scores | 256+ | Gradient-like optimization |
Multi-Objective Image Attacks
Real adversarial attacks balance multiple competing goals. A common pattern:
attack = Attack(
name="balanced-attack",
target=target,
search_strategy=simba_search(source_image, theta=0.05),
objectives={
# Primary: fool the classifier
"target_confidence": dn.scorers.json_path('$.predictions[?(@.label == "target")].confidence'),
# Secondary: stay close to original
"l2_distance": dn.scorers.image_distance(source_image).bind(dn.TaskInput("image")),
# Tertiary: avoid detection
"detection_score": dn.scorers.json_path('$.adversarial_detection_score'),
},
directions=["maximize", "minimize", "minimize"],
stop_conditions=[
dn.optimization.stop.score_value("target_confidence", gte=0.9),
],
max_trials=500,
)
The search finds Pareto-optimal solutions that balance all objectives.