Skip to content

Model Extraction

Model Extraction refers to an attack where an adversary tries to replicate or steal the functionality of a machine learning model by querying it and using the outputs to build a copy of the original model. This type of attack doesn’t necessarily involve extracting sensitive data used for training, as in model inversion, but instead focuses on how the model behaves—its predictions and outputs—in order to create a surrogate or shadow model that behaves similarly to the original.

Once successfully extracted, the attacker can use the cloned model for malicious purposes, such as bypassing licensing or restrictions, reverse engineering proprietary technology, or launching further attacks.

Quick Overview & Read More

The main goal of model extraction is to recreate the original model (or something very similar) without having direct access to its internal parameters, architecture, or training data. This can be done through a process of repeated querying and analyzing the model’s responses.

To learn more about model extraction, explore our paper stack for research papers that discuss this attack type.

How Model Extraction Works#

1. Access to the Outputs#

The attacker begins by having access to a machine learning model through an API or other means. This access allows the adversary to send input queries to the model and observe the corresponding outputs (predictions, classifications, etc.). The more queries the attacker is able to send, the better they can understand how the model behaves across different inputs.

2. Querying the Model#

The attacker sends a large number of carefully crafted queries to the model. These queries are designed to cover a broad range of input space and elicit useful output data. By collecting these outputs, the attacker begins to form a picture of how the model makes decisions.

  • Input data can be generic or specific, depending on the type of model. For example, an attacker might query a text generation model with various sentences to learn the patterns it produces, or an image classification model with different images to understand its decision-making.

3. Rebuilding the Model#

Once the attacker has gathered a significant amount of output data, they use that information to recreate a model that mimics the original behavior. There are a few techniques to achieve this:

  • Supervised Learning: The attacker may train a new model using the outputs from the original model as labels for their own inputs. For example, if the original model classifies images into categories, the attacker can train a model using the original inputs and their corresponding labels.

  • Reverse Engineering: The attacker may try to reverse engineer the architecture of the model by analyzing the outputs and experimenting with different input combinations, essentially guessing how the model was structured.

4. Refining the Extracted Model#

To improve the performance of the extracted model and make it more similar to the original, the attacker might apply further refinements such as:

  • Fine-tuning: The extracted model might be fine-tuned using additional data to enhance its accuracy or to mimic specific behaviors of the original model.

  • Testing and iteration: The attacker iterates over the model, testing its responses with different inputs, comparing them to the original outputs, and making adjustments.

Real-World Example: Model Extraction in AI Services#

Let’s consider an attacker attempting to replicate a language model (for example, an AI chatbot like GPT). The attacker doesn't have direct access to the model or its training data but can query the model through an API.

  1. The attacker sends various types of text-based inputs to the chatbot (for example, “What is the capital of France?”, “Tell me a joke about computers.”) and records the responses.
  2. The attacker sends hundreds or thousands of similar questions to learn the way the model processes language and generates responses. Over time, the attacker builds up a comprehensive dataset of inputs and outputs.
  3. Using this dataset, the attacker attempts to train a new model that mimics the behavior of the original language model. The extracted model may not be identical but could approximate the behavior of the original one closely enough to be useful for the attacker.

Example Query#

  • Input: “What are the symptoms of the flu?”
  • Model output: “Common symptoms of the flu include fever, cough, sore throat, body aches, and fatigue.”

The attacker might send a variety of similar queries to learn how the model responds to different variations of the same question, ultimately using the responses to train their own replica.

Security Implications of Model Extraction#

Model extraction can have significant security, intellectual property, and privacy implications, especially when the extracted model is used for malicious purposes:

  • Intellectual property theft: The most immediate risk of model extraction is the theft of intellectual property. Many companies invest significant resources in building and training unique models. An attacker who successfully extracts a model can deploy it without paying for it or without adhering to licensing restrictions.
  • Breach of trade secrets: If the original model is proprietary (a fraud detection model, a recommendation engine, or a high-value AI system), its extraction can lead to the leakage of sensitive algorithms or trade secrets, potentially harming the model owner’s competitive advantage.
  • Model manipulation: Once an attacker has an approximate version of a model, they may try to manipulate or adapt it for other attacks, such as evading detection systems (spam filters, malware detection models) or targeting specific vulnerabilities in the original model.
  • Security exploits: Extracted models might not just be used for replicating behavior—they can also be used as stepping stones for more targeted attacks. For example, attackers might adapt an extracted model to create adversarial inputs or carry out evasion techniques.
  • Loss of control: In cases where a model is critical to a business or service (for example, banking fraud detection or medical diagnosis), extraction might enable adversaries to reverse-engineer the system and develop ways to evade its safeguards, leading to further attacks.

Defending Against Model Extraction#

1. Limiting Model Access#

One of the most straightforward ways to prevent model extraction is to limit the accessibility. This can be achieved by:

  • Restricting the number of queries that can be sent to a model with an API.
  • Using query rate-limiting techniques to prevent large-scale automated queries that could lead to extraction.
  • Applying authentication measures to ensure only authorized users can access the model.

2. Obfuscation and Output Randomization#

To make it harder for attackers to extract meaningful patterns, models can be designed to obfuscate their outputs:

  • Randomization: Adding noise or randomness to the outputs can make it difficult for an attacker to learn accurate mappings between inputs and outputs.
  • Non-deterministic Outputs: Some models can be configured to provide slightly different outputs for the same input, complicating the extraction process.

3. Model Watermarking#

Watermarking is a technique where the model is trained to respond to certain queries in unique ways that would allow the owner to prove the model has been extracted. For example, special tokens or patterns in the output could indicate the model has been cloned.

4. Adversarial Training#

To strengthen the model against attacks, adversarial training can be used, where the model is trained to recognize and resist extraction attempts. By including adversarial examples in the training data, the model can become more resilient to efforts to replicate its functionality.

5. Differential Privacy#

Incorporating differential privacy into the model training process ensures that the model cannot reveal too much about any individual data point used for training, making it harder for an attacker to recreate the original model from its outputs.

Conclusion#

Model extraction is a serious threat where adversaries attempt to replicate a machine learning model by querying it extensively and using the responses to train a new model that behaves similarly to the original. This attack can lead to significant risks, including intellectual property theft, manipulation of the extracted model, and loss of control over the system's functionality.

To mitigate model extraction, organizations can implement various defenses such as limiting access to the model, output randomization, and adversarial training. By understanding the risks and employing these strategies, organizations can better protect their AI models from extraction and maintain control over their intellectual property and security.