Hosted jobs
Submit, monitor, and promote platform-managed GEPA optimization jobs against a published capability.
Hosted optimization runs a GEPA search on platform-managed compute against a published capability and a published dataset, then writes the winning instructions back as a new capability version after you review them. The CLI is the primary surface; the App exposes the same jobs for monitoring and promotion.
dn optimize submit \ --model openai/gpt-4o-mini \ --agent-name assistant \ --reward-recipe exact_match_v1 \ --objective "Improve instruction quality without increasing verbosity." \ --max-metric-calls 100 \ --max-trials-without-improvement 3 \ --waitWith --wait, the command blocks until the job reaches a terminal state and exits non-zero on
failed or cancelled. Without it, submit returns the job ID and you poll separately.
When to reach for hosted jobs
Section titled “When to reach for hosted jobs”Reach for hosted jobs when the capability and dataset are already published, the scoring approach is stable, and you want platform-managed runs that land as auditable records. While any of those inputs are still moving, capability improvement or local search are better places to experiment.
Backend: gepa. Two target kinds are available — pick by what determines a successful trial.
| Target kind | Optimized surface | Scoring |
|---|---|---|
capability_agent | the agent’s instructions field | a reward recipe scores each candidate’s output on the dataset |
capability_env | prompt and skill surfaces across the capability (agent_prompt, capability_prompt, skill_descriptions, skill_bodies) | the runtime provisions a live task environment per dataset row, runs the agent against it, and the reward recipe scores the run |
Pick capability_env when scoring needs the sandbox (CTF targets, services the agent probes, files
on disk). The task-environment optimization guide walks
through the end-to-end workflow — local smoke, hosted submission, monitoring, promotion. The rest
of this page covers the control-plane mechanics both target kinds share.
The hosted worker runs inside a sandbox whose API key is scoped to the optimization surface only
(optimization:write, environments:{read,write,execute}, capability and package reads, traces
and sessions, inference catalog). Task reads, secrets, credits, and admin scopes are excluded, so a
compromised job payload cannot escalate out of the optimization surface.
Inputs
Section titled “Inputs”The flags below are the ones most jobs pin. dn optimize submit --help and the
dn optimize reference cover the rest (naming, tagging, trace capture,
reflection controls, polling).
| Input | What it pins |
|---|---|
--capability | NAME@VERSION — the capability whose instructions the job edits. |
--agent-name | The agent inside the capability (required when there are multiple). |
--dataset | NAME@VERSION — the training set. |
--val-dataset | NAME@VERSION — an optional held-out set. |
--reward-recipe | One of the hosted reward recipes. |
--reward-params | A JSON object passed to the recipe. |
--model | The target model the job improves. |
--reflection-lm | Model for reflection steps. Server defaults to --model when unset. |
Pin dataset versions explicitly — optimization against a moving dataset is not reproducible, even when the inputs look stable at submit time.
Extra inputs for capability_env
Section titled “Extra inputs for capability_env”Env-scored jobs take the same capability, dataset, model, and reward recipe as capability_agent
— plus the fields that drive sandbox provisioning:
| Input | What it controls |
|---|---|
task_ref | Default [org/]name[@version] task the runtime provisions per dataset row. Dataset rows can override per-row with their own task_ref. |
timeout_sec | Per-env provisioning timeout. Raise for compose-heavy tasks (30–120s is typical). |
components | Which capability surfaces GEPA may edit. agent_prompt, capability_prompt, skill_descriptions, skill_bodies. |
parallel_rows | Dataset rows scored concurrently inside one candidate evaluation (passed via config). |
concurrency | Candidates evaluated in parallel across the search (passed via config). Peak concurrent sandboxes is concurrency × parallel_rows. |
A dataset row for env scoring is minimally {"goal": "capture the flag"}. Rows can also carry
task_ref (to fan one trainset across multiple tasks) or inputs (templating values forwarded
to the env).
Stopping controls
Section titled “Stopping controls”Three flags bound the search in different ways; the job stops at whichever hits first.
| Flag | Bounds |
|---|---|
--max-metric-calls | Total scorer calls. |
--max-trials | Total candidate trials. |
--max-trials-without-improvement | Finished trials since the last new best score. |
--max-runtime-sec | Wall-clock lifetime of the hosted sandbox. |
--max-trials-without-improvement is usually the most useful brake: it stops jobs that are
circling without producing anything new.
The full flag list lives on the auto-generated dn optimize reference.
Monitoring a running job
Section titled “Monitoring a running job”Once a job exists, control-plane commands inspect different layers:
dn optimize list # in-flight and recent jobsdn optimize get <job-id> # saved config + statusdn optimize wait <job-id> # block until terminaldn optimize logs <job-id> # what the loop is doing right nowdn optimize artifacts <job-id> # outputs worth reusingdn optimize cancel <job-id>dn optimize retry <job-id> # rerun the same config, cleared statewait exits non-zero when the job ends in failed or cancelled, which is what you want in CI.
retry applies only to terminal jobs and requeues the same saved setup with cleared metrics and
artifacts.
The App exposes the same jobs with a live log stream, metric sparklines, and the best-score trajectory. For dev compute that looks out of sync with job state, drop to inspecting compute.
Reading the result
Section titled “Reading the result”A completed job says “the loop finished.” Before you do anything with it, check:
- Best score — did the metric actually improve over the baseline?
- Validation behavior — if you passed
--val-dataset, does the win hold on held-out data? - Candidate summary — is the new instruction block something you’d ship, or overfit noise?
The App’s job detail view and dn optimize artifacts both expose the best candidate. The job
record also carries the saved config, which is what retry reruns against.
Promotion
Section titled “Promotion”Promotion is a separate step from the search. It publishes the winning instructions as a new
version of the source capability and is gated: only completed jobs with promotable instructions
in the best candidate can promote.
Promotion lives in the App today — open the job, review the diff, publish. The same action is
exposed on the platform API as POST /org/{org}/ws/{workspace}/optimization/jobs/{job_id}/promote,
which you can call directly when you need scripted promotion. There is no dn optimize promote
subcommand yet.
Once promoted, the capability has a new pinned version. Rerun the relevant evaluations against that version before any downstream automation moves to it.
Scripting submission from the SDK
Section titled “Scripting submission from the SDK”When the CLI isn’t the right place (notebooks, in-process pipelines), the ApiClient exposes the
same endpoints:
from dreadnode.app.api import create_api_clientfrom dreadnode.app.api.models import ( CapabilityRef, CreateGEPAOptimizationJobRequest, DatasetRef, RewardRecipe,)
api = create_api_client() # reads the profile from `dn login`
job = api.create_optimization_job( org="acme", workspace="research", request=CreateGEPAOptimizationJobRequest( model="openai/gpt-4o-mini", capability_ref=CapabilityRef(name="support-agent", version="1.0.0"), agent_name="assistant", dataset_ref=DatasetRef(name="support-prompts", version="0.1.0"), reward_recipe=RewardRecipe(name="exact_match_v1"), components=["instructions"], objective="Improve answer quality without increasing verbosity.", ),)
print(job.id, job.status)create_api_client() returns the same platform API client the CLI uses — it reads the logged-in
profile from dn login and picks up --profile if you pass one. create_optimization_job,
get_optimization_job, list_optimization_jobs, list_optimization_job_logs,
get_optimization_job_artifacts, cancel_optimization_job, and retry_optimization_job all
mirror their CLI counterparts. Prefer the CLI for interactive runs and CI; drop to the SDK when
you need the job to live inside a larger Python workflow.
Submitting a capability_env job
Section titled “Submitting a capability_env job”dn optimize submit handles both target kinds. The CLI infers target_kind from which
training-surface flag you pass: --task or --task-dataset make the job capability_env;
--dataset makes it capability_agent. Exactly one is required.
dn optimize submit \ --model anthropic/claude-sonnet-4-6 \ --agent-name web-security \ --task-dataset xbow-train@1 \ --val-dataset xbow-val@1 \ --reward-recipe exact_match_v1 \ --env-timeout-sec 1800 \ --parallel-rows 2 \ --concurrency 2 \ --component agent_prompt \ --component capability_prompt \ --component skill_descriptions \ --component skill_bodies \ --max-metric-calls 40 \ --max-trials-without-improvement 4 \ --tag xbow --tag capability-env--task is the inline alternative when a dataset isn’t worth publishing — repeat it to fan the
training set across several tasks (--task xbow/xben-031-24 --task xbow/xben-047-24). Use
--val-task for held-out tasks. --env-timeout-sec, --parallel-rows, --concurrency, and
--component are env-mode only and the CLI rejects them on agent-scored jobs.
The same submission is available from the SDK when the CLI isn’t the right surface — the client accepts a dict, which passes straight through to the server validator:
job = api.create_optimization_job( org="acme", workspace="research", request={ "backend": "gepa", "target_kind": "capability_env", "model": "anthropic/claude-sonnet-4-6", "capability_ref": {"name": "dreadnode/web-security", "version": "1.0.2"}, "agent_name": "web-security", "dataset_ref": {"name": "xbow-train", "version": "1"}, "val_dataset_ref": {"name": "xbow-val", "version": "1"}, "reward_recipe": {"name": "exact_match_v1", "params": {}}, "task_ref": "xbow/xben-071-24", "timeout_sec": 1800, "components": [ "agent_prompt", "capability_prompt", "skill_descriptions", "skill_bodies", ], "config": { "concurrency": 2, "parallel_rows": 2, "max_metric_calls": 40, "max_trials_without_improvement": 4, }, "tags": ["xbow", "capability-env"], },)print(job.id, job.status)The App renders capability_env jobs with the same monitoring, retry, and promote surfaces as
agent-scored jobs. Follow the full scenario in the
task-environment optimization guide.
Related
Section titled “Related”- Capability optimization loop walks the full freeze → submit → review → promote scenario end to end.
- Task-environment optimization is the sandbox-scoring variant — tune against a live target when the reward depends on sandbox state, not text output.
- Reward recipes details what each
--reward-recipescores. - Capabilities is where promoted instructions land as a new version.