Adversarial Perturbations - Dreadnode Documentation

Overview

Adversarial Perturbations are intentionally crafted, subtle modifications to continuous inputs (like images, audio, or sensor data) that are designed to cause a machine learning model to produce an incorrect or desired output.

Why it Matters: In a security context, this is a critical evasion technique. A successful perturbation could allow an attacker to:
- Bypass a malware detector by making a malicious file appear benign.
- Fool a facial recognition system with altered photos or specialized glasses.
- Trick an autonomous vehicle’s object detector into misidentifying a stop sign as a speed limit sign.
- Circumvent content moderation filters in generative models.

The core threat is the violation of the model’s expected integrity at inference time.

Technical Mechanics & Foundations

To understand how these attacks work, it’s essential to view a model and its inputs as purely mathematical constructs.

Models as Functions: A trained model is a complex but deterministic mathematical function. It takes a numerical vector as input and produces a numerical vector (e.g., confidence scores for each class) as output.
Decision Boundaries: In this high-dimensional space of all possible inputs, the model learns “decision boundaries”—hyperplanes that separate one class from another.
The Attacker’s Goal: The objective of an adversarial perturbation is to take a legitimate input vector and modify it just enough to push it across a decision boundary, causing it to be misclassified as the target class.
Finding the Path of Least Resistance (Gradient-Based Attacks): The key to crafting these perturbations efficiently is the loss function. During training, a loss function measures how “wrong” a model’s prediction is. Attackers invert this process by calculating the gradient of the loss function with respect to the input pixels to find the most efficient way to alter an image and force a misclassification.

Challenge Arena

Crawl: Introduction to Misclassification
- granny: Slightly modify an image of a wolf to cause a model to misclassify it as something completely different.
- autopilot1: Modify a road image to cause an object detection model to identify an object that isn’t present.
Walk: Constrained Evasion & Generative Models
- autopilot2: Achieve the same goal as autopilot1, but with a strict, programmatically enforced limit on how much the image can be changed.
- deface & deface2: Bypass the content filters of a text-to-image generative model to create a prohibited image type.
- blindspot: Subtly alter an image to make a real object ‘invisible’ to an object detector.
Run: Advanced & Robust Evasion
- granny_jpg: Create an adversarial image that remains effective even after undergoing lossy JPEG compression.
- autopilot3: Modify an image to produce a precise set of multiple, specific false object detections.
- fiftycats: Leverage a generative model to create an image that is recognized by a detector as containing a specific number of objects.

​Overview

​Technical Mechanics & Foundations

​Challenge Arena

Overview

Technical Mechanics & Foundations

Challenge Arena