AIRTBench Agent
Breaking down the AIRTBench agent for solving AI/ML CTF challenges in Crucible
This documentation complements the dreadnode/AIRTBench-Code
AI Red-Teaming Agent. We’ll reference specific components throughout this topic, but you can also explore the full implementation to understand how everything fits together.
For this guide, we’ll assume you have the dreadnode
package installed and are familiar with the basics of Strikes. If you haven’t already, check out the installation and introduction guides. Additionally, as mentioned in the Agent Implementation section, we will be using a Rigging agent, documented here.
This agent also serves as a major functional component to complement our practical exploit research paper: “AIRTBench: Can Language Models Autonomously Exploit Language Models?” which explores the use of LLMs to solve CTF challenges in Crucible, Dreadnode’s AI hacking playground.
The paper discusses the design and implementation of the agent, as well as its performance on various challenges. You can find the paper here on arXiv, or learn more on our accompanying blog post, “Do LLM Agents Have AI Red Team Capabilities? We Built a Benchmark to Find Out”.
In this guide, we’ll cover building an agent capable of solving AI/ML capture-the-flag (CTF) challenges hosted on Crucible. While we won’t delve deeply into the theory behind large language models (LLMs) or the Crucible CTF format, we’ll provide enough context to understand how to design an agent that can effectively tackle these challenges.
We’ll use Strikes to gather insightful data on agent behavior and evaluate performance based on the agent’s ability to dynamically capture flags generated by Crucible. To achieve this, we’ll equip the agent with interactive environments that closely resemble those used by human operators. These environments will allow for multi-step reasoning, command execution, result inspection, and iterative problem solving.
This guide covers:
- Creating isolated Docker environments so the agent can execute commands and interface with the Crucible challenge APIs
- Equipping the agent with a base Docker image that includes commonly used tools for security researchers, data scientists, and ML engineers, tailored to the specific challenge types.
- Designing and structuring prompts to guide agent behavior and help the agent understand the context of each challenge
- Building tool layers that enable the agent to programmatically interact with its environment
- Measuring and evaluating agent performance based on objective success metrics
- Patterns for scaling evaluations across multiple challenges and models
Throughout the code examples, we’ll include validation checks to ensure that the agent’s execution environment is responsive, fair, and free from critical errors. This ensures that the agent is not only functional, but also robust enough to reliably perform in real-world conditions.
Architecture Overview
At a high level, we can break down our agent into three components:
-
Environments are definitions for:
- Docker containers with access to a Jupyter kernel
- Execution environments with common tools and python libraries
- Interaction with the Crucible API to verify flag submissions
-
Agent is the core LLM-integrated loop that:
- Processes instructions from the Crucible challenge notebook and context on the objective
- Decides which commands to execute with the given environment
- Analyzes output to assess if a flag was found and determine next steps
- Tracks progress toward finding the target flag with the context of the chat pipeline
-
Harness is our supporting infrastructure that:
- Manages the lifecycle of challenge container(s)
- Iterates over challenges to orchestrate runs and agents
- Scales our agent executions
We’ll work to build the following flow:
Crucible Challenge Notebooks
The Crucible challenge notebooks are designed to run in a Jupyter environment, providing a standardized interface to interact with challenges through API calls. Each notebook is organized into sections that focus on different aspects of the challenge. You can find a detailed breakdown of the notebook structure here.
The agent harness converts these notebooks into Markdown by loading the notebook file using Notebook.load()
and transforming its cells into a human-readable format with the to_markdown()
method.
This process is encapsulated in the Notebook.load(g_challenge_dir / challenge.notebook)
block, which is wrapped within the @dn.task(name="Attempt challenge")
decorator. This ensures the notebook loading occurs within the task execution flow, allowing Strikes to track and monitor the operation for performance and metrics collection.
The task decorator creates a traceable execution unit where loading the challenge notebook is properly associated with the specific challenge you’re attempting.
Once converted, the notebook content is presented to the agent as part of the user prompt, along with the necessary Dreadnode platform API key to enable interaction with the challenge API.
We define our challenges in a YAML manifest with a filepath to each notebook with additional challenge metadata. This allows us to easily add new challenges and update existing ones without modifying the code, seamlessly integrating with the new Crucible challenge release schedules:
The system uses Pydantic models to represent Jupyter notebooks:
Docker Challenges
Just like evaluations, we’ll start by considering the environment our agent will operate in an environment that also tightly aligns with how human contestants interact with Crucible challenges. In Crucible, we hack AI ML CTF challenges through code and we need a way to define, build, and manage containerized execution environments where our agent has the capability to execute python code and analyze a challenge endpoint response in order to measure progress.
We can create and destroy containers on demand, provide isolated networks for each environment’s run, and launch multiple copies of the same Crucible challenge to parallelize agents.
In order to achieve this, we provide a custom Docker image that includes a Jupyter kernel and the necessary libraries for the agent to interact with the challenge API. This image is built from a base jupyter/scipy-notebook
image, which is commonly used by security researchers, data scientists, and machine learning engineers. Our challenges span multiple domains where the base image includes common tools and libraries used in these domains. However, we provide additional libraries and tools that are not included in the base image, such as:
The reason for including these additional libraries is to provide the agent with a more comprehensive set of tools to work with. This allows the agent to interact with the challenge environment in a more flexible and powerful way, enabling it to solve a wider variety of challenges. We are also very much interested in probing the agents ability to interact with modern security tools and libraries, such as adversarial-robustness-toolbox
, foolbox
, and lief
, which are commonly used in the field of adversarial machine learning.
We strongly encourage you to explore the choice of models and additional tooling. This will help you understand how to best leverage the agent’s capabilities, design your own challenges in the future, and evaluate the agent’s performance when using other security tools.
Hint: You can find the easter egg on our current thought process in the currently commented-out section of the Dockerfile.
With those defined, we can establish code to build our containers and return prepared Challenge
objects when our agent starts:
We also define additional Docker settings to the arguments passed to the build_container
function, such as memory_limit
, which allows us to set the memory limit for the container. This is useful for ensuring that the container does not consume too much memory and crash the host system during our experiments.
As compared to our other CTF agent example, in this example we don’t need to build a custom image for each challenge. Instead, we can use a single base image and add the necessary libraries and tools as needed. This allows us to keep our Docker images small and manageable, while still providing the agent with the tools it needs to solve the challenges. Additionally, the challenges are hosted in Crucible, which means we only need the execution environment to interact with the challenge API and not the actual challenge code.
Container Startup
When our agent harness starts, we need to bring up all the containers required for a challenge, and provide a way for the LLM to execute commands inside our container environment. The agent harness brings up containers and enables command execution through a carefully structured workflow:
- During challenge initialization, the
build_container()
function builds a Docker image with all required dependencies. - The agent uses the
PythonKernel
class to create and manage a containerized Jupyter notebook environment. - The container is initialized with
kernel = PythonKernel(image=docker_image, memory_limit=args.memory_limit)
in an async context manager.
The command execution pipeline is designed to allow the agent to interact with the Jupyter kernel running inside the container. The agent can execute Python code, restart the kernel, and manage the container lifecycle:
- The agent’s core loop is in the
attempt_challenge()
function, which: Loads the challenge notebook usingNotebook.load(g_challenge_dir / challenge.notebook)
- Converts the notebook to markdown for the LLM prompt with
challenge_notebook.to_markdown()
- Creates a chat pipeline that instructs the LLM about available tools
The LLM-Container interface is presented to the agent, advising that it can execute code inside the container by sending commands using XML tags:
In run_step()
, the agent:
- Extracts commands from the LLM’s response using
chat.last.try_parse_set- (ExecuteCode)
- Executes code with
result = await kernel.execute(execution.code, timeout=args.- kernel_timeout)
- Returns the execution output to the LLM for analysis
Lastly, the container is managed through the Python async context manager interface. When a challenge is complete, cleanup occurs in __aexit__:
, which removes the container and its associated kernel.
Execution Interface
With containers running, we need a way for the agent to execute commands. We’ll use our custom docker image as the base for our containers, and we can use the Docker exec API to run commands inside the container. This allows us to execute commands in the same environment as the agent, while also providing a way to capture output and errors.
The PythonKernel
class implements a secure Python code execution environment using containerized Jupyter kernels. Let’s walk through the key components:
This code creates a temporary directory, initializes the Docker client, starts a container, and initializes the Jupyter kernel inside it:
The container setup includes:
- Memory limiting: Sets strict memory boundaries
- Filesystem isolation: Mounts temporary directory to the container
- Security token: Generates a unique token for Jupyter authentication
- Port binding: Automatically selects an available port
The core execution happens over WebSockets, communicating directly with the Jupyter kernel:
This establishes a WebSocket connection to the kernel’s channels endpoint, sends the code execution request, and processes the responses in real time.
We additionally perform type processing within the execution handler processes for different types of Jupyter messages:
This allows processing of rich outputs including:
- Text output (stdout/stderr)
- Rich media (plots, tables, HTML)
- Error messages and tracebacks
The kernel_timeout
wrapper is a useful mechanic to prevent the evaluation from getting stuck on commands that might hang indefinitely, such as waiting for user input or network connections that never complete.
Following the executions within the run(s), the class performs cleanup and shutdown of the container and kernel to prevent memory leaks and ensure that resources are released properly. This is done in the shutdown()
method, which is called at the end of the evaluation:
Agent Implementation
With confidence in our challenge setup, we can now implement the agent that interacts with the containers. The agent will use Rigging for the LLM interaction and tool execution. It is designed as a self-contained unit of work that, given a target challenge and configuration, returns a detailed log of its behavior and results.
Our main orchestration function and task, which we can easily measure, manages the multi-step solution process:
Here, we loads challenge details from notebook files, initialize the generator, and setup the kernel execution environment.
We then define the agent-kernel interaction cycle by instantiating the PythonKernel
class and running the agent in an async context manager.
Overall the process is simple: we establish a prompt, configure tools for our agent to use, and run the agent. Strikes makes it easy to track the agent’s progress and log all relevant data.
Chat Pipeline
We use Rigging to create a basic chat pipeline that prompts the LLM with the goal and gives some general guidance:
We also heavily use Rigging’s version 3 implementation of pipeline .cache
to reduce our overall token limit and inference consumption, and prevent overwhelming the attack model with too much context. This addition to the Rigging library allows us to cache the last N messages in the chat history, which helps to keep the context relevant and focused on the current task.
For tool calls, we use the preferred Rigging xml-based tool implementation. The Pydantic models used to describe the agent’s available actions are defined as follows:
The harness parses these actions from model responses:
The agent processes code execution results with comprehensive error handling
Each tool is wrapped as a task so we can observe when they are called and with what arguments. We also perform various log_metric
calls where they’re applicable, and update our AgentLog
structure to reflect the current state of the agent.
The give_up
tool is an optional addition that you can make as an agent author. Without it, agents might continue attempting the same failed approaches, even if they’ve hit a fundamental limitation. However, agents might preemptively give up on challenges that they could have solved with more time. This is a tradeoff between efficiency and thoroughness.
Finally, we connect everything and run the agent. But, we immediately fail with some preprocessing checks for the challenge API to avoid wasting experiments:
Rigging will take care of the rest and let the LLM continue to execute tools until it either:
- Stops issuing any more tool calls
- Reaches the maximum number of steps of iterative calls defined within
max_steps
After which we can inspect the final output chat
for error states we want to track and log back to us.
Scaling the Harness
With our agent defined, we can now execute runs by invoking agent tasks across combinations of challenges and inference models.
Concurrency
To make our evaluation scale, we want to run multiple agents across different challenges at the same time, even having multiple copies of agents try the same challenge to get more robust performance metrics. We have a convenience function to help us with this:
This function gets passed a list of async coroutines and:
- Creates a semaphore to limit concurrency
- Wraps each coroutine with the semaphore
- Runs all coroutines with controlled concurrency
This ensures that at most we only have limit
coroutines running at the same time. This is useful for:
- Avoiding overwhelming the LLM provider with requests
- Preventing resource exhaustion on your local machine
Rate Limits
We can use the backoff
library to handle rate limits from LLM providers and pass it to our Rigging generator. This library:
- Catches rate limit exceptions
- Applies exponential backoff with random jitter
- Retries the request after waiting
- Gives up after 5 minutes of trying
Implementing rate limit handling ensures evaluation consistency. Without it, your evaluation might fail in the middle of a run due to temporary API limits, wasting resources and creating incomplete results. With that said, this is an optional addition during early development stages when your focus is to get the components working.
Performance Analysis
With our agent implementation complete, we need to analyze its performance. Throughout the code we’ve added many calls to dn.log_metric
to track places we arrive in code, failure modes, and success rates.
Success Metrics
The most basic metric is a binary success/pass rate, which we get for free by using CTF-style challenges with a known flag value. Once the model is confident that it has retrieved the flag, it’s then prompted to submit the flag to the challenge API endpoint.
The submit_flag
function doesn’t actually exist in the codebase itself—it’s an expected function that the agent must learn to implement based on the challenge context presented in the notebook. This design is intentional and tests the agent’s ability to properly interpret the challenge requirements and implement the appropriate solutions.
The agent must generate code that creates its own submit_flag
function based on the challenge description provided in the notebook; the code tracks when agents try to use a non-existent submit_flag
function:
The typical expected outcome is presented to the agent as a function that makes an HTTP request to the challenge endpoint with the flag:
The system automatically looks for flags in both model responses and code execution outputs:
When a valid flag is found, the system:
- Records a metric:
dn.log_metric("found_flag", 1)
- Assigns a score: Creates a
dn.Score
object with value 1.0 - Logs the score:
dn.log_score(flag_score)
- Terminates the challenge: Returns
None
from the step function to signal completion
This gives us:
- The overall success rate across all challenges versus the steps taken
- The success rate for each challenge
- The success rate for each difficulty level
- The success rate for each model
Efficiency Metrics
Beyond the binary success/failure rate, we track an array of metrics to gain insights into the agent’s performance on code execution and command usage.
This gives us:
- How many steps were required to find the flag
- How many commands were executed
- Which commands were most commonly used
- How often the agent used sleep or gave up
Comparative Analysis
By running multiple models on the same challenges, we can directly compare their performance:
This gives us:
- Which model has higher success rates
- Which model solves challenges more efficiently
- How models perform across different difficulty levels
- Which model excels at which types of challenges
Next Steps
- Evaluate the agent’s performance on a wider range of challenges, including additional domains within adversarial machine learning, data science, and security.
- Experiment with different docker images, libraries, and domain-specific tools to see how they affect the agent’s performance.
- Add a feedback loop to improve agent performance over time.
- Continue testing against additional challenges made available on the platform.