Write an Evaluation
Overview of the evaluation process and writing your own.
Writing evaluations for Large Language Models (LLMs) is a notoriously difficult, but critical part of the agent development process. Evaluations help you understand how well your model is performing and identify areas for improvement. Ideally, evaluations represent “real-world” environments and tasks, as at some point, the agent will be expected to operate in the real world.
Evaluations let you answer key questions like:
- Which LLM is best for my task?
- How do changes in my agent code affect performance?
- What prompting strategy yields the highest success rates?
- How often does my agent fail? And why?
We’ll walk you through the process of designing evaluations for LLMs using Strikes and Rigging.
Step 1: Define your environment
The first step in writing an evaluation is to define your evaluation environment. These environments should be designed to test the capabilities of your model and provide meaningful insights into its performance. In offensive security, evaluation environment might include CTFs, network environments like GOAD, or human graded tasks like phishing.
Environments can be generalized as any resources made available to your agent, such as datasets, tools, and files which are relevant to the tasks you want to evaluate.
Step 2: Define your tasks
Tasks are the units of work that yield a valuable (and measurable) output given some set of inputs. These tasks should be specific, measurable, and relevant to your model’s capabilities. For instance, if you’d like a model to analyze source code for vulnerabilities, a task might be analyze_file
where the model is provided a single file and asked to return a list of vulnerabilities, or even a binary classification of “does this file contain vulnerabilities?”.
Don’t be afraid to adjust the scope of your tasks and stitch them together into more complex workflows. For example:
- Assess source code for weak behaviors
- Identify specific vulnerable functions
- Trace execution into those functions
- Report those vulnerabilities
You can treat these as either a single task or a series of smaller tasks, depending on your needs. A good rule of thumb: imagine the code you would have to write manually (without model inference) and ask yourself if that function would be doing too much work.
Step 3: Define your metrics
We use a guiding principle that “every task should have a score” when writing agents. Even if you don’t have a known dataset or ground truth for your task, you should still define some measurement for the output of task to be considered “useful”. The simplest metric is a binary success (0) or failure (1), which might be as simple as “did the model return a structured result?” or “did the model call my tool with the correct arguments?”. Ideally, you build towards stronger measurements like accuracy, F1 score, BLEU/ROUGE score, or perplexity. Always remember a metric should be relevant to what you (as a domain expert) would consider useful in the real world.
Here are some examples:
- Command Execution: Check to see if the command is properly formatted, or that it exited with a 0 status code.
- Social engineering: Perform a similarity check against a known dataset of phishing emails or use another inference request to check if the content “seems suspicious”, or p
- Lateral movement: Assess the state delta in your C2 framework and count the number of new callbacks generated by the model.
- Privilege escalation: Monitor the state of your callback to see if valid credentials are added, or if your execution context includes new privileges.
As you execute your tasks and collect data, you should assess your metrics by inspecting results accross a variety of reported performance, and see if they align with your expectations. If a metrics seem weakly correlated with the quality of data or real performance, the metric should be re-evaluated.
Step 4: Run your evaluation
Nothing is more important that actually producing data, even if you’re early in the development process. Execute your evaluation early and often, and use the data to inform your design. The scope of your run can be a useful tool for gathering comparative data, so never feel constrained to doing all of your work in a single run. A common pattern is to take a set of inputs, and map over them to produce a set of runs that operate on each.
In the Platform, you’ll recieve a run for each directory, and can use the project page to compare performance between them, or step into a single run to and check details to answer specific questions you have.