Task-Environment Optimization

Tune a capability against a live task sandbox — when scoring depends on what happened inside the environment, not just the agent's output.

Use this recipe when your reward depends on the state of a live sandbox — a captured flag, a service the agent was supposed to probe, a file the agent should have written. GEPA mutates the capability’s prompt and skill surfaces; each trial provisions a fresh task environment, runs the capability’s agent against it, and a scorer you control reads the sandbox to decide if the trial passed.

When to use this workflow

your target is a CTF-style task or any target where success = sandbox state, not text output
the capability already has the tools and skills needed to attempt the task
you have at least one published task and one published capability version to pin

If your scoring is purely about the agent’s output on a static dataset, use the capability optimization loop instead.

What you need before you start

Input	Why it must be pinned
capability ref	GEPA mutates surfaces inside this version; you need to know what you started from
task ref	the target `TaskEnvironment`; sandbox behavior must be reproducible
dataset ref	one row per `(goal, optional task_ref)` — defines the batch each candidate sees
validation dataset	the held-out tasks GEPA uses to pick the final candidate
reward recipe	declarative scoring applied to each agent output inside the hosted runtime

A minimal dataset is a single row: {"goal": "capture the flag"}. Rows can override task_ref to fan a trainset across multiple tasks.

Recipe

1. Build and validate your scorer locally

Start with CapabilityEnvAdapter locally. A runnable smoke run takes minutes and proves the scorer works before you burn hosted budget.

import re
import dreadnode as dn
from dreadnode.capabilities.capability import Capability
from dreadnode.core.environment import current_task_environment
from dreadnode.core.metric import Metric
from dreadnode.core.scorer import scorer
from dreadnode.optimization import CapabilityEnvAdapter, optimize_anything
from dreadnode.optimization.config import EngineConfig, OptimizationConfig

dn.configure()

FLAG = re.compile(r"FLAG\{[^}]+\}")


@scorer(name="flag")
async def flag_scorer(agent_output: str) -> Metric:
    if FLAG.search(str(agent_output)):
        return Metric(value=1.0)
    env = current_task_environment.get()
    if env is not None:
        _code, out = await env.execute(
            "cat /flag* 2>/dev/null; grep -rh 'FLAG{' / 2>/dev/null | head -1",
            timeout_sec=15,
        )
        if FLAG.search(out):
            return Metric(value=1.0)
    return Metric(value=0.0)


capability = Capability("dreadnode/web-security", storage=dn.storage)

adapter = CapabilityEnvAdapter(
    capability=capability,
    model="anthropic/claude-sonnet-4-6",
    agent_name="web-security",
    task_ref="xbow/xben-071-24",
    timeout_sec=1800,
    dataset=[{"goal": "capture the flag"}],
    scorers=[flag_scorer],
    score_name="flag",
)

optimization = optimize_anything(
    adapter=adapter,
    trainset=adapter.dataset,
    config=OptimizationConfig(engine=EngineConfig(max_metric_calls=3)),
    objective="Maximise flag-capture on the target task.",
)
result = await optimization.console()

The current_task_environment contextvar is populated by the adapter while each row is scored. Any scorer can reach into the sandbox through it — run a shell command, pull logs, check a file. The env is guaranteed alive for the scorer call and torn down immediately after.

2. Split train from val

GEPA mutates against the trainset and picks the winning candidate by val score. For a single target task, hold the target out:

optimization = optimize_anything(
    adapter=adapter,
    trainset=[
        {"goal": "capture the flag", "task_ref": "xbow/xben-031-24"},
        {"goal": "capture the flag", "task_ref": "xbow/xben-047-24"},
        {"goal": "capture the flag", "task_ref": "xbow/xben-052-24"},
    ],
    valset=[
        {"goal": "capture the flag", "task_ref": "xbow/xben-071-24"},
    ],
)

Without a val split, GEPA picks whatever wins on train — almost always overfit to that one task.

3. Scale the fan-out

Two knobs control sandbox concurrency:

parallel_rows on the adapter — rows scored concurrently within one candidate evaluation
concurrency on optimize_anything — candidates evaluated in parallel

Peak concurrent sandboxes is concurrency × parallel_rows. Keep both at 1 until the scorer is trusted, then raise. Platform admission and provider rate limits apply.

4. Submit the hosted job

Once the scorer and candidate shape are stable, move the run hosted. The hosted runtime builds CapabilityEnvAdapter for you from the job payload:

job = dn.api.create_optimization_job(
    "acme",
    "research",
    {
        "backend": "gepa",
        "target_kind": "capability_env",
        "model": "anthropic/claude-sonnet-4-6",
        "capability_ref": {"name": "dreadnode/web-security", "version": "1.0.2"},
        "agent_name": "web-security",
        "dataset_ref": {"name": "xbow-train", "version": "1"},
        "val_dataset_ref": {"name": "xbow-val", "version": "1"},
        "reward_recipe": {"name": "exact_match_v1", "params": {}},
        "task_ref": "xbow/xben-071-24",
        "timeout_sec": 1800,
        "components": [
            "agent_prompt",
            "capability_prompt",
            "skill_descriptions",
            "skill_bodies",
        ],
        "config": {
            "concurrency": 2,
            "parallel_rows": 2,
            "max_metric_calls": 40,
            "max_trials_without_improvement": 4,
        },
        "tags": ["xbow", "capability-env"],
    },
)
print(job.id, job.status)

Dataset rows drive which tasks get provisioned; task_ref on the job is only the fallback for rows that don’t override it.

5. Monitor the job

job = dn.api.get_optimization_job("acme", "research", job.id)
logs = dn.api.list_optimization_job_logs("acme", "research", job.id)
artifacts = dn.api.get_optimization_job_artifacts("acme", "research", job.id)

Watch:

best_score on the job record and optimization/best_score / optimization/val_score in the trace viewer — the val curve is the one that matters
per-candidate logs from the worker
sandbox provisioning load in your workspace’s sandbox dashboard if you raised concurrency

6. Review before you promote

The same rules as any optimization run apply: a completed job only means the hosted loop finished. Before promoting:

val score actually improved, not just train
the best candidate’s prompt/skill diff reads as intentional, not as overfit noise
the winning surfaces still make sense for other tasks the capability should handle

Promote through the App or the capability registry — same as capability-agent optimization.

Branches and decisions

Single target vs. peer tasks: optimizing on just one task will overfit it. If that’s acceptable (you only care about that flag), accept it; if you want tuning that generalizes, train on peer tasks and keep the target in valset.
Sandbox cost runs long: compose-heavy tasks take 30–120s per env provision. Use parallel_rows > 1 to fan rows concurrently, but budget for concurrency × parallel_rows concurrent sandboxes at peak.
Scorer wants to shell in: read current_task_environment in the scorer and call env.execute(...). The env is alive through the scorer; it tears down after.
Multi-agent capabilities: the adapter today tunes one named agent’s prompt at a time plus the shared capability/skill surfaces. If the capability ships coordinated agents and you want all their prompts mutated, multi-agent tuning is a follow-up.

What to keep

the source capability ref and dataset refs
the optimization job ID
the winning candidate summary and diff
the promoted capability version

Hosted jobs

SDK Optimization

Capability Optimization Loop

Tasks