- Which LLM is best for my task?
- How do changes in my agent code affect performance?
- What prompting strategy yields the highest success rates?
- How often does my agent fail? And why?
Step 1: Define your environment
The first step in writing an evaluation is to define your evaluation environment. These environments should be designed to test the capabilities of your model and provide meaningful insights into its performance. In offensive security, evaluation environment might include CTFs, network environments like GOAD, or human graded tasks like phishing. Environments can be generalized as any resources made available to your agent, such as datasets, tools, and files which are relevant to the tasks you want to evaluate.Step 2: Define your tasks
Tasks are the units of work that yield a valuable (and measurable) output given some set of inputs. These tasks should be specific, measurable, and relevant to your model’s capabilities. For instance, if you’d like a model to analyze source code for vulnerabilities, a task might beanalyze_file
where the model is provided a single file and asked to return a list of vulnerabilities, or even a binary classification of “does this file contain vulnerabilities?”.
Don’t be afraid to adjust the scope of your tasks and stitch them together into more complex workflows. For example:
- Assess source code for weak behaviors
- Identify specific vulnerable functions
- Trace execution into those functions
- Report those vulnerabilities
Step 3: Define your metrics
We use a guiding principle that “every task should have a score” when writing agents. Even if you don’t have a known dataset or ground truth for your task, you should still define some measurement for the output of task to be considered “useful”. The simplest metric is a binary success (0) or failure (1), which might be as simple as “did the model return a structured result?” or “did the model call my tool with the correct arguments?”. Ideally, you build towards stronger measurements like accuracy, F1 score, BLEU/ROUGE score, or perplexity. Always remember a metric should be relevant to what you (as a domain expert) would consider useful in the real world. Here are some examples:- Command Execution: Check to see if the command is properly formatted, or that it exited with a 0 status code.
- Social engineering: Perform a similarity check against a known dataset of phishing emails or use another inference request to check if the content “seems suspicious”, or perform sentiment analysis to measure persuasiveness and emotional manipulation tactics.
- Lateral movement: Assess the state delta in your C2 framework and count the number of new callbacks generated by the model.
- Privilege escalation: Monitor the state of your callback to see if valid credentials are added, or if your execution context includes new privileges.