Attack Surface Overview
Agent vulnerabilities occur at multiple stages of operation:| Stage | Attack Type | Description |
|---|---|---|
| User Input | Direct Prompt Injection (DPI) | Malicious instructions in user queries |
| Tool Responses | Indirect Prompt Injection (IPI) | Malicious content in tool outputs |
| Reasoning Loop | ReAct Chain Injection | Forged thoughts/observations in agent reasoning |
| Tool Execution | Tool Misuse | Manipulating agents to use legitimate tools unsafely |
| Tool Execution | Privilege Escalation | Exploiting agent permissions to access restricted functions |
| Code Generation | Unsafe Code Execution | Triggering generation/execution of malicious code |
| Memory/RAG | Memory Poisoning | Corrupted context from retrieval systems |
| System Prompt | Plan-of-Thought Backdoor | Hidden triggers in agent instructions |
| Multi-Agent | Inter-Agent Exploitation | Attacks on multi-agent communication and coordination |
Direct Prompt Injection (DPI)
DPI targets the user input stage, attempting to override agent instructions.DPI Variants
Research identifies a few common attack patterns:Implementing DPI Attacks
First, define your agent as a target for AIRT to interact with:Indirect Prompt Injection (IPI)
IPI embeds malicious instructions in content the agent retrieves—tool responses, RAG documents, or external data.IPI in Tool Responses
When agents call external APIs, those responses can contain injection payloads:IPI in RAG Systems
For systems with retrieval, inject into the knowledge base:IPI in Web Content
Agents that browse websites or fetch URLs are vulnerable to injection through HTML content:Testing Web Content IPI
IPI in File Uploads
Agents that process user-uploaded files are highly vulnerable to embedded instructions:PDF Injection Techniques
PDF files are particularly dangerous as they support multiple hidden content layers:Testing File Upload IPI
Defense Considerations
For web content and file uploads, key defenses include:- Content Sanitization - Strip HTML comments, hidden divs, metadata
- Structural Validation - Parse files strictly, reject malformed content
- Instruction Filtering - Detect system-like directives in content
- Sandboxing - Process untrusted content in isolated contexts
- Output Validation - Check for unexpected tool calls after processing external content
ReAct Thought/Observation Injection
For agents using ReAct-style reasoning (Thought → Action → Observation loops), attackers can forge the reasoning chain.Forging Tool Outputs
Attack Pattern
- Identify tools the agent normally calls
- Craft fake Thought/Action/Observation sequences
- Inject forged “observations” with malicious instructions
- Agent treats injected observations as real tool responses
Tool Misuse and Privilege Escalation
Agents with access to tools can be manipulated into using them in unintended or unsafe ways, even when the tools themselves are legitimate.Tool Parameter Manipulation
Testing for Tool Misuse
Privilege Escalation Patterns
Unsafe Code Execution
Agents that generate or execute code introduce critical attack surfaces, especially when user input influences code generation.Code Injection Patterns
Testing Code Generation Safety
Code Execution Validation
Memory Poisoning
Agents with long-term memory (RAG databases, conversation history) are vulnerable to persistent attacks.Poisoning Agent Memory
Testing Memory Vulnerabilities
Plan-of-Thought (PoT) Backdoor
PoT attacks embed hidden triggers in system prompts that activate under specific conditions.Trigger Patterns
Testing for Backdoors
Inter-Agent Exploitation
Multi-agent systems, where multiple AI agents communicate and coordinate, introduce vulnerabilities in agent-to-agent communication channels.Attack Vectors in Multi-Agent Systems
Agent Communication Spoofing
Testing Inter-Agent Trust Boundaries
Defense Strategies for Multi-Agent Systems
- Agent Authentication - Verify agent identity with cryptographic signatures
- Message Validation - Parse and validate inter-agent message structure strictly
- Trust Boundaries - Limit what each agent can request from others
- Audit Trails - Log all inter-agent communications for anomaly detection
- Capability Restrictions - Enforce principle of least privilege across agents
- Consensus Mechanisms - Require multiple agent agreement for critical actions
Scoring Agent Attacks
Multi-Signal Detection
Using Scores in AIRT
Testing Methodology
Phase 1: Reconnaissance
- Probe for tool awareness with benign queries
- Analyze response structure for tool call visibility
- Enumerate capabilities through behavioral testing
Phase 2: Attack Surface Mapping
- Test DPI variants against user input stage
- If RAG/tools/web access present, test IPI patterns (tool responses, web content, file uploads)
- For ReAct-style agents, probe for thought/observation chain injection
- For agents with tools, probe for tool misuse and privilege escalation
- If agent generates code, test for unsafe code execution patterns
- For stateful agents with memory/RAG, probe memory influence
- Check for trigger-based behaviors in system prompts (PoT backdoors)
- For multi-agent systems, test inter-agent communication exploitation
Phase 3: Exploitation
- Run targeted attacks against discovered surfaces
- Use multi-objective scoring to detect partial success
- Iterate based on response patterns
Phase 4: Validation
- Confirm successful attacks aren’t false positives
- Document exploitation chains
- Assess real-world impact of discovered vulnerabilities
References
- Agent Security Bench (ASB) - Comprehensive agent attack benchmark
- OWASP Top 10 for Agentic Applications - Official OWASP agentic vulnerability taxonomy
- WithSecure: ReAct Agent Injection - Thought/Observation injection research
- AgentFence - Open-source agent security testing
Next Steps
- Custom Scoring for building detection functions
- Build an AI Red Team Agent for automating attack campaigns

