Skip to content

Task-Environment Optimization

Tune a capability against a live task sandbox — when scoring depends on what happened inside the environment, not just the agent's output.

Use this recipe when your reward depends on the state of a live sandbox — a captured flag, a service the agent was supposed to probe, a file the agent should have written. GEPA mutates the capability’s prompt and skill surfaces; each trial provisions a fresh task environment, runs the capability’s agent against it, and a scorer you control reads the sandbox to decide if the trial passed.

  • your target is a CTF-style task or any target where success = sandbox state, not text output
  • the capability already has the tools and skills needed to attempt the task
  • you have at least one published task and one published capability version to pin

If your scoring is purely about the agent’s output on a static dataset, use the capability optimization loop instead.

InputWhy it must be pinned
capability refGEPA mutates surfaces inside this version; you need to know what you started from
task refthe target TaskEnvironment; sandbox behavior must be reproducible
dataset refone row per (goal, optional task_ref) — defines the batch each candidate sees
validation datasetthe held-out tasks GEPA uses to pick the final candidate
reward recipedeclarative scoring applied to each agent output inside the hosted runtime

A minimal dataset is a single row: {"goal": "capture the flag"}. Rows can override task_ref to fan a trainset across multiple tasks.

Start with CapabilityEnvAdapter locally. A runnable smoke run takes minutes and proves the scorer works before you burn hosted budget.

import re
import dreadnode as dn
from dreadnode.capabilities.capability import Capability
from dreadnode.core.environment import current_task_environment
from dreadnode.core.metric import Metric
from dreadnode.core.scorer import scorer
from dreadnode.optimization import CapabilityEnvAdapter, optimize_anything
from dreadnode.optimization.config import EngineConfig, OptimizationConfig
dn.configure()
FLAG = re.compile(r"FLAG\{[^}]+\}")
@scorer(name="flag")
async def flag_scorer(agent_output: str) -> Metric:
if FLAG.search(str(agent_output)):
return Metric(value=1.0)
env = current_task_environment.get()
if env is not None:
_code, out = await env.execute(
"cat /flag* 2>/dev/null; grep -rh 'FLAG{' / 2>/dev/null | head -1",
timeout_sec=15,
)
if FLAG.search(out):
return Metric(value=1.0)
return Metric(value=0.0)
capability = Capability("dreadnode/web-security", storage=dn.storage)
adapter = CapabilityEnvAdapter(
capability=capability,
model="anthropic/claude-sonnet-4-6",
agent_name="web-security",
task_ref="xbow/xben-071-24",
timeout_sec=1800,
dataset=[{"goal": "capture the flag"}],
scorers=[flag_scorer],
score_name="flag",
)
optimization = optimize_anything(
adapter=adapter,
trainset=adapter.dataset,
config=OptimizationConfig(engine=EngineConfig(max_metric_calls=3)),
objective="Maximise flag-capture on the target task.",
)
result = await optimization.console()

The current_task_environment contextvar is populated by the adapter while each row is scored. Any scorer can reach into the sandbox through it — run a shell command, pull logs, check a file. The env is guaranteed alive for the scorer call and torn down immediately after.

GEPA mutates against the trainset and picks the winning candidate by val score. For a single target task, hold the target out:

optimization = optimize_anything(
adapter=adapter,
trainset=[
{"goal": "capture the flag", "task_ref": "xbow/xben-031-24"},
{"goal": "capture the flag", "task_ref": "xbow/xben-047-24"},
{"goal": "capture the flag", "task_ref": "xbow/xben-052-24"},
],
valset=[
{"goal": "capture the flag", "task_ref": "xbow/xben-071-24"},
],
)

Without a val split, GEPA picks whatever wins on train — almost always overfit to that one task.

Two knobs control sandbox concurrency:

  • parallel_rows on the adapter — rows scored concurrently within one candidate evaluation
  • concurrency on optimize_anything — candidates evaluated in parallel

Peak concurrent sandboxes is concurrency × parallel_rows. Keep both at 1 until the scorer is trusted, then raise. Platform admission and provider rate limits apply.

Once the scorer and candidate shape are stable, move the run hosted. The hosted runtime builds CapabilityEnvAdapter for you from the job payload:

job = dn.api.create_optimization_job(
"acme",
"research",
{
"backend": "gepa",
"target_kind": "capability_env",
"model": "anthropic/claude-sonnet-4-6",
"capability_ref": {"name": "dreadnode/web-security", "version": "1.0.2"},
"agent_name": "web-security",
"dataset_ref": {"name": "xbow-train", "version": "1"},
"val_dataset_ref": {"name": "xbow-val", "version": "1"},
"reward_recipe": {"name": "exact_match_v1", "params": {}},
"task_ref": "xbow/xben-071-24",
"timeout_sec": 1800,
"components": [
"agent_prompt",
"capability_prompt",
"skill_descriptions",
"skill_bodies",
],
"config": {
"concurrency": 2,
"parallel_rows": 2,
"max_metric_calls": 40,
"max_trials_without_improvement": 4,
},
"tags": ["xbow", "capability-env"],
},
)
print(job.id, job.status)

Dataset rows drive which tasks get provisioned; task_ref on the job is only the fallback for rows that don’t override it.

job = dn.api.get_optimization_job("acme", "research", job.id)
logs = dn.api.list_optimization_job_logs("acme", "research", job.id)
artifacts = dn.api.get_optimization_job_artifacts("acme", "research", job.id)

Watch:

  • best_score on the job record and optimization/best_score / optimization/val_score in the trace viewer — the val curve is the one that matters
  • per-candidate logs from the worker
  • sandbox provisioning load in your workspace’s sandbox dashboard if you raised concurrency

The same rules as any optimization run apply: a completed job only means the hosted loop finished. Before promoting:

  • val score actually improved, not just train
  • the best candidate’s prompt/skill diff reads as intentional, not as overfit noise
  • the winning surfaces still make sense for other tasks the capability should handle

Promote through the App or the capability registry — same as capability-agent optimization.

  • Single target vs. peer tasks: optimizing on just one task will overfit it. If that’s acceptable (you only care about that flag), accept it; if you want tuning that generalizes, train on peer tasks and keep the target in valset.
  • Sandbox cost runs long: compose-heavy tasks take 30–120s per env provision. Use parallel_rows > 1 to fan rows concurrently, but budget for concurrency × parallel_rows concurrent sandboxes at peak.
  • Scorer wants to shell in: read current_task_environment in the scorer and call env.execute(...). The env is alive through the scorer; it tears down after.
  • Multi-agent capabilities: the adapter today tunes one named agent’s prompt at a time plus the shared capability/skill surfaces. If the capability ships coordinated agents and you want all their prompts mutated, multi-agent tuning is a follow-up.
  • the source capability ref and dataset refs
  • the optimization job ID
  • the winning candidate summary and diff
  • the promoted capability version