Task-Environment Optimization
Tune a capability against a live task sandbox — when scoring depends on what happened inside the environment, not just the agent's output.
Use this recipe when your reward depends on the state of a live sandbox — a captured flag, a service the agent was supposed to probe, a file the agent should have written. GEPA mutates the capability’s prompt and skill surfaces; each trial provisions a fresh task environment, runs the capability’s agent against it, and a scorer you control reads the sandbox to decide if the trial passed.
When to use this workflow
Section titled “When to use this workflow”- your target is a CTF-style task or any target where success = sandbox state, not text output
- the capability already has the tools and skills needed to attempt the task
- you have at least one published task and one published capability version to pin
If your scoring is purely about the agent’s output on a static dataset, use the capability optimization loop instead.
What you need before you start
Section titled “What you need before you start”| Input | Why it must be pinned |
|---|---|
| capability ref | GEPA mutates surfaces inside this version; you need to know what you started from |
| task ref | the target TaskEnvironment; sandbox behavior must be reproducible |
| dataset ref | one row per (goal, optional task_ref) — defines the batch each candidate sees |
| validation dataset | the held-out tasks GEPA uses to pick the final candidate |
| reward recipe | declarative scoring applied to each agent output inside the hosted runtime |
A minimal dataset is a single row: {"goal": "capture the flag"}. Rows can override
task_ref to fan a trainset across multiple tasks.
Recipe
Section titled “Recipe”1. Build and validate your scorer locally
Section titled “1. Build and validate your scorer locally”Start with CapabilityEnvAdapter locally. A runnable smoke run takes minutes and proves the
scorer works before you burn hosted budget.
import reimport dreadnode as dnfrom dreadnode.capabilities.capability import Capabilityfrom dreadnode.core.environment import current_task_environmentfrom dreadnode.core.metric import Metricfrom dreadnode.core.scorer import scorerfrom dreadnode.optimization import CapabilityEnvAdapter, optimize_anythingfrom dreadnode.optimization.config import EngineConfig, OptimizationConfig
dn.configure()
FLAG = re.compile(r"FLAG\{[^}]+\}")
@scorer(name="flag")async def flag_scorer(agent_output: str) -> Metric: if FLAG.search(str(agent_output)): return Metric(value=1.0) env = current_task_environment.get() if env is not None: _code, out = await env.execute( "cat /flag* 2>/dev/null; grep -rh 'FLAG{' / 2>/dev/null | head -1", timeout_sec=15, ) if FLAG.search(out): return Metric(value=1.0) return Metric(value=0.0)
capability = Capability("dreadnode/web-security", storage=dn.storage)
adapter = CapabilityEnvAdapter( capability=capability, model="anthropic/claude-sonnet-4-6", agent_name="web-security", task_ref="xbow/xben-071-24", timeout_sec=1800, dataset=[{"goal": "capture the flag"}], scorers=[flag_scorer], score_name="flag",)
optimization = optimize_anything( adapter=adapter, trainset=adapter.dataset, config=OptimizationConfig(engine=EngineConfig(max_metric_calls=3)), objective="Maximise flag-capture on the target task.",)result = await optimization.console()The current_task_environment contextvar is populated by the adapter while each row is scored.
Any scorer can reach into the sandbox through it — run a shell command, pull logs, check a file.
The env is guaranteed alive for the scorer call and torn down immediately after.
2. Split train from val
Section titled “2. Split train from val”GEPA mutates against the trainset and picks the winning candidate by val score. For a single target task, hold the target out:
optimization = optimize_anything( adapter=adapter, trainset=[ {"goal": "capture the flag", "task_ref": "xbow/xben-031-24"}, {"goal": "capture the flag", "task_ref": "xbow/xben-047-24"}, {"goal": "capture the flag", "task_ref": "xbow/xben-052-24"}, ], valset=[ {"goal": "capture the flag", "task_ref": "xbow/xben-071-24"}, ],)Without a val split, GEPA picks whatever wins on train — almost always overfit to that one task.
3. Scale the fan-out
Section titled “3. Scale the fan-out”Two knobs control sandbox concurrency:
parallel_rowson the adapter — rows scored concurrently within one candidate evaluationconcurrencyonoptimize_anything— candidates evaluated in parallel
Peak concurrent sandboxes is concurrency × parallel_rows. Keep both at 1 until the scorer is
trusted, then raise. Platform admission and provider rate limits apply.
4. Submit the hosted job
Section titled “4. Submit the hosted job”Once the scorer and candidate shape are stable, move the run hosted. The hosted runtime builds
CapabilityEnvAdapter for you from the job payload:
job = dn.api.create_optimization_job( "acme", "research", { "backend": "gepa", "target_kind": "capability_env", "model": "anthropic/claude-sonnet-4-6", "capability_ref": {"name": "dreadnode/web-security", "version": "1.0.2"}, "agent_name": "web-security", "dataset_ref": {"name": "xbow-train", "version": "1"}, "val_dataset_ref": {"name": "xbow-val", "version": "1"}, "reward_recipe": {"name": "exact_match_v1", "params": {}}, "task_ref": "xbow/xben-071-24", "timeout_sec": 1800, "components": [ "agent_prompt", "capability_prompt", "skill_descriptions", "skill_bodies", ], "config": { "concurrency": 2, "parallel_rows": 2, "max_metric_calls": 40, "max_trials_without_improvement": 4, }, "tags": ["xbow", "capability-env"], },)print(job.id, job.status)Dataset rows drive which tasks get provisioned; task_ref on the job is only the fallback for
rows that don’t override it.
5. Monitor the job
Section titled “5. Monitor the job”job = dn.api.get_optimization_job("acme", "research", job.id)logs = dn.api.list_optimization_job_logs("acme", "research", job.id)artifacts = dn.api.get_optimization_job_artifacts("acme", "research", job.id)Watch:
best_scoreon the job record andoptimization/best_score/optimization/val_scorein the trace viewer — the val curve is the one that matters- per-candidate logs from the worker
- sandbox provisioning load in your workspace’s sandbox dashboard if you raised concurrency
6. Review before you promote
Section titled “6. Review before you promote”The same rules as any optimization run apply: a completed job only means the hosted loop finished. Before promoting:
- val score actually improved, not just train
- the best candidate’s prompt/skill diff reads as intentional, not as overfit noise
- the winning surfaces still make sense for other tasks the capability should handle
Promote through the App or the capability registry — same as capability-agent optimization.
Branches and decisions
Section titled “Branches and decisions”- Single target vs. peer tasks: optimizing on just one task will overfit it. If that’s acceptable (you only care about that flag), accept it; if you want tuning that generalizes, train on peer tasks and keep the target in valset.
- Sandbox cost runs long: compose-heavy tasks take 30–120s per env provision. Use
parallel_rows > 1to fan rows concurrently, but budget forconcurrency × parallel_rowsconcurrent sandboxes at peak. - Scorer wants to shell in: read
current_task_environmentin the scorer and callenv.execute(...). The env is alive through the scorer; it tears down after. - Multi-agent capabilities: the adapter today tunes one named agent’s prompt at a time plus the shared capability/skill surfaces. If the capability ships coordinated agents and you want all their prompts mutated, multi-agent tuning is a follow-up.
What to keep
Section titled “What to keep”- the source capability ref and dataset refs
- the optimization job ID
- the winning candidate summary and diff
- the promoted capability version