Strikes is a platform for building, experimenting, and evaluating AI-integrated code. This includes agents, evaluation harnesses, and AI red teaming code. You can think of Strikes like the best blend of experimentation (MLflow, Weights & Biases), task orchestration (Prefect), and observability (OpenTelemetry).

Strikes is lightweight to start, flexible to extend, and powerful at scale. It’s top priority is providing the most value without requiring a steep learning curve. We intentionally designed the APIs to be simple and familiar to anyone who has used MLflow, Prefect, or similar tools.

This flexibility and power means it excels at workflows in complex domains like Offensive Security, where you need to build and experiment with complex agentic systems, then have the ability to measure and evaluate it.

E.g. in order to evaluate Offensive Security agents, we need to develop agentic code, execute at scale, measure interactions with the target system(s), and evaluate the results.

Basic Example

The most basic use of Strikes is a run with some logged data:

import dreadnode

# Initialize with default settings
dreadnode.configure()

# Start a new run
with dreadnode.run("my-experiment"):
    # Log parameters
    dreadnode.log_params(
        learning_rate=0.01,
        batch_size=32,
        model="transformer"
    )

    # Log metrics
    dreadnode.log_metric("accuracy", 0.85, step=1)

We’ll assume you have installed the dreadnode package and have your environment variables set up. Make sure you have DREADNODE_API_TOKEN=... set to your Platform API key.

For more information on dreadnode.configure(), review the Configuration topic.

If you call dreadnode.configure() without any token and your environment variables are not set, you’ll receive a warning in the console, so keep an eye out! You can still run any of your code without sending data to the Dreadnode Platform.

This code should be very familiar if you’ve used an ML-experimentation library before, and all the functions you’re familiar with work exactly like you would expect.

Under the hood, this code did a few things:

  • Created a new “Default” project in the Platform to hold our run.
  • Began a full OpenTelemetry trace for all code under with dreadnode.run(...).
  • Tracked and stored our parameters and metrics alongside the tracing information.
  • Delivered the data to the Platform for visualization.

You can open the Default project in a web browser to see your new run and the data you logged.

You’re free to call dreadnode.* functions anywhere in your code, and you don’t have to worry about keeping track of your active run or task. Everything just works.

  • log_param(): Track simple key/values to keep track of hyperparameters, target models, or agent configs.
  • log_metric(): Report measurements you take anywhere in your code.
  • log_input(): Save any runtime object which is the x to your f(x) like prompts, datasets, samples, or target resources.
  • log_output(): Save any runtime object which is the result of your work like findings, commands, reports, or raw model outputs.
  • log_artifact(): Upload any local files or directories like your source code, container configs, datasets, or models.

Most of these functions will associate values with their nearest parent, so if you’re in a task the value will be associated with that task. If you’re just inside a run, the value with be associated directly with the run. You can override this behavior by passing to=... to any of these methods.

Often you find yourself deep inside a function, writing a new if statement, and think “I want to track if/when I get here”. It’s easy to add a dreadnode.log_metric(...) right there and see it later in your data.

Core Concepts

Strikes is built around three core concepts: Runs, Tasks, and Metrics. Understanding these concepts will help you make the most of Strikes.

Runs

Runs are the core unit of work in Strikes. They provide the context for all your data collection and represent a complete execution session. Think of runs as the “experiment” or “session” for your code.

import dreadnode

dreadnode.configure()

with dreadnode.run("my-experiment"):
    # Everything that happens here is part of the run
    # All data collected is associated with this run

You can create multiple runs, even in parallel, to organize your work logically:

async def work(target: str):
    with dreadnode.run(target):
        # Run-specific work here
        pass

await asyncio.gather(*[work(f"target-{i}") for i in range(3)])

See the Runs page for more details on creating, configuring and managing runs.

Tasks

Tasks are units of work within runs. They help you structure your code and provide a finer-grained context for data collection. Tasks can be created as function decorators or context managers:

import dreadnode

dreadnode.configure()

@dreadnode.task()
async def say_hello(name: str) -> str:
    return f"Hello, {name}!"

with dreadnode.run():
    with dreadnode.task_span("manual-task"):
        # Task work here
        pass
    
    # Call the decorated task
    result = await say_hello("Alice")

Tasks automatically track their inputs, outputs, execution time, and more. They form the foundation for building structured, observable workflows.

See the Tasks page for more details on task creation, configuration, and advanced patterns.

Metrics

Metrics are measurements of your system’s performance or behavior. They allow you to evaluate the effectiveness of your agents and track important events during execution:

import dreadnode

dreadnode.configure()

with dreadnode.run():
    # Log a simple metric
    dreadnode.log_metric("accuracy", 0.87)
    
    # Log a metric with a step number for timeseries data
    dreadnode.log_metric("loss", 0.23, step=1)

Metrics can be associated with tasks, runs, or even specific objects in your system, providing a comprehensive view of performance at different levels.

See the Metrics page for more information on creating, aggregating, and analyzing metrics.

Short Examples

Building an Evaluation Dataset

with dn.run("create-dataset"):
    # Load evaluation samples
    samples = load_samples()
    
    for i, sample in enumerate(samples):
        # Log the sample
        dn.log_input(f"sample_{i}", sample)
        
        # Generate responses from different models
        for model_name in ["gpt4", "claude", "llama"]:
            response = generate_with_model(model_name, sample)
            dn.log_output(f"response_{model_name}_{i}", response)
            
            # Link response to its sample
            dn.link_objects(response, sample)
            
            # Log evaluation metrics
            accuracy = evaluate_accuracy(response, sample)
            dn.log_metric("accuracy", accuracy, origin=response)
            
            coherence = evaluate_coherence(response)
            dn.log_metric("coherence", coherence, origin=response)

This creates a comprehensive dataset with:

  • Input samples
  • Model responses
  • Quality metrics
  • Clear relationships between data

Agent Development Workflow

@dn.task()
async def execute_command(command: str) -> str:
    """Execute a shell command and return the output."""
    # Command is automatically logged as input
    process = await asyncio.create_subprocess_shell(
        command,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    stdout, stderr = await process.communicate()
    
    # Log additional information
    dn.log_param("exit_code", process.returncode)
    
    result = stdout.decode() if process.returncode == 0 else stderr.decode()
    return result  # Automatically logged as output

with dn.run("agent-experiment"):
    # Configure the agent
    dn.log_params(
        model="gpt-4",
        temperature=0.2,
        target="localhost",
    )
    
    # Run the agent
    agent = create_agent()
    
    for step in range(10):
        # Get next command
        command = agent.next_command()
        dn.log_input(f"command_{step}", command)
        
        # Execute it
        output = await execute_command(command)
        
        # Update agent with result
        agent.process_result(output)
        
        # Track progress
        dn.log_metric("progress", agent.progress_score, step=step)

This tracks:

  • Agent configuration as parameters
  • Each command and its output
  • Execution details
  • Progress metrics over time

Next Steps

To learn about more advanced usage, explore the rest of our documentation:

If you learn best through examples, check out any of the How To guides to view walkthroughs of practical use cases and commentary from the team.

You can also check out our dreadnode/example-agents repository for a collection of example agents and evaluation harnesses.