Adversarial Image Attacks - Dreadnode Documentation

Adversarial image attacks aim to modify a source image with minimal, often human-imperceptible, changes to cause a model to misclassify it or behave unexpectedly. The AIRT framework provides search strategies designed specifically for this task.

Running an Image Misclassification Attack

Let’s configure an attack that attempts to make a model misclassify an image of a cat as a “Granny Smith” apple. This requires first defining a CustomTarget that wraps our image classification model. You can run this full example directly.

import dreadnode as dn
from dreadnode.airt import Attack, simba_search
from dreadnode.airt.target import CustomTarget

# Ensure dreadnode is configured for your project
dn.configure(project="airt-image-attack-example")


# Step 1: Define a task that calls your image classification model or API.
# This is a placeholder for a real API call.
@dn.task
async def classify_image(image: dn.Image) -> dict:
    # In a real scenario, this would call your model and return its predictions.
    # For this example, we'll simulate a simple response.
    if image.to_numpy().mean() > 0.5:  # A simple check to simulate classification change
        return {"predictions": [{"label": "Granny Smith", "confidence": 0.95}]}
    return {"predictions": [{"label": "Cat", "confidence": 0.98}]}


# Step 2: Wrap the task in a CustomTarget.
target = CustomTarget(task=classify_image)

# Step 3: Configure the Attack.
source_image = dn.Image("path/to/cat.png")  # Replace with a real image path

attack = Attack(
    name="image-misclassification-attack",
    target=target,
    search_strategy=simba_search(source_image, theta=0.05),
    objectives={
        # Objective 1: Maximize the confidence of the wrong label.
        "is_granny_smith": dn.scorers.json_path('$.predictions[?(@.label == "Granny Smith")].confidence'),
        # Objective 2: Minimize the visual difference from the original image.
        "l2_distance": dn.scorers.image_distance(source_image).bind(dn.TaskInput("image")),
    },
    # The directions must match the objectives above.
    directions=["maximize", "minimize"],
    max_trials=500,
)


# Step 4: Run the attack.
async def main():
    results = await attack.console()
    best_trial = results.best_trial
    if best_trial:
        print(f"Attack finished! Best score: {best_trial.score:.2f}")
        # You can now save or inspect the successful adversarial image:
        # best_trial.candidate.to_pil().save("adversarial_image.png")


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

This example configures an Attack that is guided by two competing goals: forcing the model’s output towards “Granny Smith” while keeping the generated image as close as possible to the original source_image.

How Image Attacks Work

Attacking image models involves a few key components that differ from generative text attacks.

Search Strategy: `simba_search`

The simba_search strategy implements the Simple Black-box Attack algorithm. It’s an effective and straightforward method for finding adversarial examples when you have access to the model’s confidence scores. It works by iteratively:

Generating a small, random noise pattern.
Adding that noise to a single pixel or region of the current image.
Querying the Target with the perturbed image.
If the model’s confidence in the incorrect class increases, the change is kept. Otherwise, it’s discarded.

This hill-climbing approach gradually modifies the image until it successfully fools the model.

Objectives for Images

A successful image attack is a balancing act. You need to both fool the model and ensure the changes are not easily detectable. This is typically modeled with two competing objectives.

Fooling the Model: You need a scorer to parse the model’s output and extract the confidence score for the target (incorrect) class. The scorers.json_path scorer is perfect for this, as it can query nested JSON responses from an API.
Staying Imperceptible: You also need a scorer to measure how much the adversarial image has deviated from the original. The scorers.image_distance scorer calculates the mathematical distance between two images. By setting its direction to minimize, you guide the search to find solutions that are visually close to the source.

The image_distance scorer uses a special pattern: .bind(dn.TaskInput("image")). This tells the scorer to compare the original source_image against the image being passed into the classify_image task, rather than the task’s output. This bind mechanism is the standard way to make a scorer evaluate the input of a task.

Advanced: Decision-Based Attacks

Sometimes, a model’s API won’t return detailed confidence scores. It might only return the final predicted label (e.g., "Cat"). In these “decision-based” scenarios, a score-guided search like simba_search will not work. For these cases, you can use hop_skip_jump_search. It’s a more advanced algorithm that works by only needing a binary, “yes” or “no” signal from the model. It estimates the decision boundary of the model and iteratively refines the image to cross it.

import dreadnode as dn
from dreadnode.airt import Attack, hop_skip_jump_search

attack = Attack(
    name="decision-based-attack",
    target=target,
    search_strategy=hop_skip_jump_search(
        source=original_image,
        adversarial_objective="label_flipped",  # Name of the binary objective
        theta=0.01,  # Step size for boundary estimation
    ),
    objectives={
        # Binary objective: 1.0 if label changed, 0.0 otherwise
        "label_flipped": ~dn.scorers.contains("original_class").adapt(lambda out: out["label"]),
        # Distance objective: minimize perturbation
        "l2_distance": dn.scorers.image_distance(original_image).bind(dn.TaskInput("image")),
    },
    directions=["maximize", "minimize"],
    max_trials=300,
)

Gradient Estimation Attacks

When you need more efficient optimization, gradient estimation methods can be faster than random search.

NES (Natural Evolution Strategies)

NES estimates gradients by sampling perturbations and using their scores:

from dreadnode.airt import Attack
from dreadnode.airt.search import nes_search

attack = Attack(
    name="nes-attack",
    target=target,
    search_strategy=nes_search(
        source=original_image,
        sigma=0.01,           # Noise scale for gradient estimation
        num_samples=20,       # Number of samples per iteration
        learning_rate=0.1,    # Step size for parameter updates
    ),
    objectives={
        "target_confidence": target_class_scorer,
        "l2_distance": distance_scorer,
    },
    directions=["maximize", "minimize"],
    max_trials=500,
)

Choosing the Right Strategy

Strategy	Use When	Pros	Cons
`simba_search`	Have confidence scores	Simple, interpretable	Slow for high-dimensional images
`hop_skip_jump_search`	Only have labels	Works without scores	Requires more queries
`nes_search`	Need efficiency	Faster convergence	Requires score feedback
`zoo_search`	Need gradient-like optimization	Balanced efficiency	More complex than NES

Pre-built Image Attack Functions

The SDK provides convenience functions that combine targets, search strategies, and objectives for common image attack scenarios.

HopSkipJump Attack

Finds minimal perturbations by walking along the decision boundary:

from dreadnode.airt import hop_skip_jump_attack
from dreadnode import Image

original = Image("cat.png")

# Define success condition
def is_adversarial(output: dict) -> float:
    # Return 1.0 if misclassified, 0.0 if correct
    return 1.0 if output["label"] != "cat" else 0.0

attack = hop_skip_jump_attack(
    target=classifier_target,
    original=original,
    is_adversarial=is_adversarial,
    norm="l2",                    # Distance metric: 'l2', 'l1', or 'linf'
    theta=0.01,                   # Perturbation size for gradient estimation
    max_iterations=1000,
    early_stopping_distance=0.05, # Stop when distance is small enough
)

result = await attack.run()

When to use: When you need minimal, imperceptible perturbations and have a binary adversarial condition.

SIMBA Attack

Pixel-wise black-box attack:

from dreadnode.airt import simba_attack

attack = simba_attack(
    target=classifier_target,
    original=original,
    is_adversarial=is_adversarial,
    norm="l2",
    theta=0.05,                  # Perturbation magnitude per pixel
    max_iterations=1000,
    early_stopping_distance=0.03,
)

result = await attack.run()

When to use: Fast adversarial examples, when query budget is limited.

NES Attack

Natural Evolution Strategies for gradient-free optimization:

from dreadnode.airt import nes_attack

# NES needs a confidence scorer (not just binary success)
def confidence_scorer(output: dict) -> float:
    # Return confidence of target class
    for pred in output.get("predictions", []):
        if pred["label"] == "dog":
            return pred["confidence"]
    return 0.0

attack = nes_attack(
    target=classifier_target,
    original=original,
    confidence=confidence_scorer,        # Scorer that returns confidence
    is_adversarial=is_adversarial,      # Optional: for early stopping
    max_iterations=100,
    learning_rate=0.01,
    num_samples=64,                      # Samples for gradient estimation
    sigma=0.001,                         # Exploration variance
)

result = await attack.run()

When to use: When you have confidence scores and need efficient gradient-free optimization.

ZOO Attack

Zeroth-order optimization using finite differences:

from dreadnode.airt import zoo_attack

attack = zoo_attack(
    target=classifier_target,
    original=original,
    confidence=confidence_scorer,
    is_adversarial=is_adversarial,
    max_iterations=1000,
    learning_rate=0.01,
    num_samples=128,                     # Samples for gradient estimation
    epsilon=0.01,                        # Delta for finite differences
)

result = await attack.run()

When to use: When you want gradient-like optimization without actual gradients, and have moderate query budget.

Comparison of Pre-built Attacks

Attack	Requires	Queries/Iter	Best For
`hop_skip_jump_attack`	Binary (is adversarial?)	50-100	Minimal perturbations
`simba_attack`	Binary (is adversarial?)	1-10	Fast results, low queries
`nes_attack`	Confidence scores	128+	Efficient optimization
`zoo_attack`	Confidence scores	256+	Gradient-like optimization

Multi-Objective Image Attacks

Real adversarial attacks balance multiple competing goals. A common pattern:

attack = Attack(
    name="balanced-attack",
    target=target,
    search_strategy=simba_search(source_image, theta=0.05),
    objectives={
        # Primary: fool the classifier
        "target_confidence": dn.scorers.json_path('$.predictions[?(@.label == "target")].confidence'),
        # Secondary: stay close to original
        "l2_distance": dn.scorers.image_distance(source_image).bind(dn.TaskInput("image")),
        # Tertiary: avoid detection
        "detection_score": dn.scorers.json_path('$.adversarial_detection_score'),
    },
    directions=["maximize", "minimize", "minimize"],
    stop_conditions=[
        dn.optimization.stop.score_value("target_confidence", gte=0.9),
    ],
    max_trials=500,
)

The search finds Pareto-optimal solutions that balance all objectives.

​Running an Image Misclassification Attack

​How Image Attacks Work

​Search Strategy: simba_search

​Objectives for Images

​Advanced: Decision-Based Attacks

​Gradient Estimation Attacks

​NES (Natural Evolution Strategies)

​Choosing the Right Strategy

​Pre-built Image Attack Functions

​HopSkipJump Attack

​SIMBA Attack

​NES Attack

​ZOO Attack

​Comparison of Pre-built Attacks

​Multi-Objective Image Attacks

Running an Image Misclassification Attack

How Image Attacks Work

Search Strategy: `simba_search`

Objectives for Images

Advanced: Decision-Based Attacks

Gradient Estimation Attacks

NES (Natural Evolution Strategies)

Choosing the Right Strategy

Pre-built Image Attack Functions

HopSkipJump Attack

SIMBA Attack

NES Attack

ZOO Attack

Comparison of Pre-built Attacks

Multi-Objective Image Attacks