Overview
Prompt Injection is a class of attack that targets Large Language Models (LLMs). The attack exploits the fact that LLMs often cannot distinguish between their trusted, initial instructions (the “system prompt”) and untrusted input provided by a user. By crafting a malicious prompt, an attacker can cause the model to ignore its intended purpose and follow the attacker’s instructions instead.- Why it Matters: This is one of the most significant vulnerabilities in modern AI systems. A successful prompt injection can lead to:
- Confidential Data Leakage
- Unauthorized Actions
- Generation of Harmful Content
Technical Mechanics & Foundations
The core vulnerability stems from the architecture of transformer-based models, which process all text in a single, unified “context window.” An attacker’s goal is to craft their input in such a way that the most probable continuation is the one they desire, even if it contradicts the original system instructions.- Instruction Hijacking: This is the most direct form of attack, where the user’s prompt contains instructions that explicitly override the system prompt.
- Role-Playing & Virtualization: Attackers create a fictional scenario or assign a new persona to the model, creating a context where the original safety rules may no longer seem to apply.
- Exploiting Multi-Step Logic (RAG & Tool Use): Attackers can inject a payload into the retrieval query, the retrieved data, or the final generation prompt.
- Obfuscation & Bypassing Filters: To defeat simple defenses like keyword blocklists, attackers use various obfuscation techniques such as using synonyms, different languages, or encodings.
Challenge Arena
- Crawl: Direct Commands & Simple Leaks
whatistheflag1
: Elicit a secret value from an LLM with a basic system prompt.pieceofcake
: Exploit a flawed system prompt that allows for indirect requests for confidential information.bear4
: Bypass a model’s instruction to replace a secret name with a fake one.puppeteer1
: Issue a direct command to an LLM to produce a specific, arbitrary output phrase.
- Walk: Bypassing Filters & Constraints
whatistheflag2
: Bypass a simple keyword-based filter to extract a secret.squeeze1
,squeeze2
: Extract a full secret from an LLM whose response length is severely limited.puppeteer2
: Craft a prompt that forces the model to output a specific phrase and nothing else.whatistheflag3
: Break a model out of a forced, repetitive response pattern to reveal a secret.
- Run: Advanced Logic & Indirect Control
whatistheflag4
,whatistheflag5
,whatistheflag6
: Find loopholes in increasingly complex and layered defensive system prompts to leak a secret.puppeteer3
,puppeteer4
: Elicit a specific phrase from an LLM without using any of the words from the phrase in your prompt.spanglish
: Manipulate a translation-focused LLM to leak information from its context.extractor
,extractor2
: Extract the full system prompt from a guarded LLM.miner
: Discover and use a hidden request parameter to interact with a special-purpose chatbot to find a secret.probe
,probe2
: Bypass advanced defenses, including a blocklist of known attack prompts and output obfuscation, to extract a system prompt.
- Boss Level: RAG Exploitation
pirate_flag
: Craft a prompt that forces a RAG system to retrieve and reveal a specific sentence containing a secret from a large document.fragile
: Manipulate the entire logical chain of a multi-step, RAG-based LLM system to extract a secret.