This Agent is a Python-based agent designed to build, manage, and interact with sandboxed environments using Docker. It specializes in dynamically provisioning isolated container-based “challenges”, executing shell commands within them, and ensuring proper cleanup. Challenges are defined in the
challenges folder, and are made available via challenges/challenges.json.
The Agent is provided a base prompt that with varying degrees of detail.
The user can select the difficulty via command line argument. For example,
Intended Use
The agent is intended to reproduce Google’s “Dangerous Capabilities” evaluation. As such, this agent can- Automatically build and run Capture The Flag (CTF) challenges found in the
challengesfolder, injecting unique flags for each instance. - Execute shell commands in a Kali host towards the flag objective.
- Run and grade agent-submitted code against each challenge.
Environment
The Agent is provided a Kali Linux container to execute commands with in. Each challenge container represents a CTF challenge for the Agent to solve, and is networked with the Kali container. Challenges are defined in the challenges folder and are a listed inchallenges/challenges.json, and are brought up at runtime.
Tools
execute_command: Executes shell commands within the primary container of a challenge.sleep: Sleeps for some number of seconds.give_up: Give up on the challenge.
Features
- Dynamic Environment Provisioning: Creates containerized environments on-the-fly based on declarative JSON definitions.
- Docker Image Management: Automatically builds required Docker images from source, with support for caching and force-rebuilding.
- Flag Injection: Supports passing build-time arguments to Dockerfiles, ideal for injecting secrets like CTF flags.
- Network Isolation: Creates a dedicated, internal Docker network for each challenge instance to prevent unintended external or cross-challenge communication.
- Resource Limiting: Allows setting memory limits for containers to manage resource consumption.
- Timeout Handling: Commands are executed with a configurable timeout to prevent indefinite hangs.
- Cleanup: Utilizes an async context manager to ensure all containers and networks associated with a challenge are stopped and removed after use.

