Skip to content

Hosted jobs

Submit, monitor, and promote platform-managed GEPA optimization jobs against a published capability.

Hosted optimization runs a GEPA search on platform-managed compute against a published capability and a published dataset, then writes the winning instructions back as a new capability version after you review them. The CLI is the primary surface; the App exposes the same jobs for monitoring and promotion.

Terminal window
dn optimize submit \
--model openai/gpt-4o-mini \
--capability [email protected] \
--agent-name assistant \
--dataset [email protected] \
--val-dataset [email protected] \
--reward-recipe exact_match_v1 \
--objective "Improve instruction quality without increasing verbosity." \
--max-metric-calls 100 \
--max-trials-without-improvement 3 \
--wait

With --wait, the command blocks until the job reaches a terminal state and exits non-zero on failed or cancelled. Without it, submit returns the job ID and you poll separately.

Reach for hosted jobs when the capability and dataset are already published, the scoring approach is stable, and you want platform-managed runs that land as auditable records. While any of those inputs are still moving, capability improvement or local search are better places to experiment.

Backend: gepa. Two target kinds are available — pick by what determines a successful trial.

Target kindOptimized surfaceScoring
capability_agentthe agent’s instructions fielda reward recipe scores each candidate’s output on the dataset
capability_envprompt and skill surfaces across the capability (agent_prompt, capability_prompt, skill_descriptions, skill_bodies)the runtime provisions a live task environment per dataset row, runs the agent against it, and the reward recipe scores the run

Pick capability_env when scoring needs the sandbox (CTF targets, services the agent probes, files on disk). The task-environment optimization guide walks through the end-to-end workflow — local smoke, hosted submission, monitoring, promotion. The rest of this page covers the control-plane mechanics both target kinds share.

The hosted worker runs inside a sandbox whose API key is scoped to the optimization surface only (optimization:write, environments:{read,write,execute}, capability and package reads, traces and sessions, inference catalog). Task reads, secrets, credits, and admin scopes are excluded, so a compromised job payload cannot escalate out of the optimization surface.

The flags below are the ones most jobs pin. dn optimize submit --help and the dn optimize reference cover the rest (naming, tagging, trace capture, reflection controls, polling).

InputWhat it pins
--capabilityNAME@VERSION — the capability whose instructions the job edits.
--agent-nameThe agent inside the capability (required when there are multiple).
--datasetNAME@VERSION — the training set.
--val-datasetNAME@VERSION — an optional held-out set.
--reward-recipeOne of the hosted reward recipes.
--reward-paramsA JSON object passed to the recipe.
--modelThe target model the job improves.
--reflection-lmModel for reflection steps. Server defaults to --model when unset.

Pin dataset versions explicitly — optimization against a moving dataset is not reproducible, even when the inputs look stable at submit time.

Env-scored jobs take the same capability, dataset, model, and reward recipe as capability_agent — plus the fields that drive sandbox provisioning:

InputWhat it controls
task_refDefault [org/]name[@version] task the runtime provisions per dataset row. Dataset rows can override per-row with their own task_ref.
timeout_secPer-env provisioning timeout. Raise for compose-heavy tasks (30–120s is typical).
componentsWhich capability surfaces GEPA may edit. agent_prompt, capability_prompt, skill_descriptions, skill_bodies.
parallel_rowsDataset rows scored concurrently inside one candidate evaluation (passed via config).
concurrencyCandidates evaluated in parallel across the search (passed via config). Peak concurrent sandboxes is concurrency × parallel_rows.

A dataset row for env scoring is minimally {"goal": "capture the flag"}. Rows can also carry task_ref (to fan one trainset across multiple tasks) or inputs (templating values forwarded to the env).

Three flags bound the search in different ways; the job stops at whichever hits first.

FlagBounds
--max-metric-callsTotal scorer calls.
--max-trialsTotal candidate trials.
--max-trials-without-improvementFinished trials since the last new best score.
--max-runtime-secWall-clock lifetime of the hosted sandbox.

--max-trials-without-improvement is usually the most useful brake: it stops jobs that are circling without producing anything new.

The full flag list lives on the auto-generated dn optimize reference.

Once a job exists, control-plane commands inspect different layers:

Terminal window
dn optimize list # in-flight and recent jobs
dn optimize get <job-id> # saved config + status
dn optimize wait <job-id> # block until terminal
dn optimize logs <job-id> # what the loop is doing right now
dn optimize artifacts <job-id> # outputs worth reusing
dn optimize cancel <job-id>
dn optimize retry <job-id> # rerun the same config, cleared state

wait exits non-zero when the job ends in failed or cancelled, which is what you want in CI. retry applies only to terminal jobs and requeues the same saved setup with cleared metrics and artifacts.

The App exposes the same jobs with a live log stream, metric sparklines, and the best-score trajectory. For dev compute that looks out of sync with job state, drop to inspecting compute.

A completed job says “the loop finished.” Before you do anything with it, check:

  • Best score — did the metric actually improve over the baseline?
  • Validation behavior — if you passed --val-dataset, does the win hold on held-out data?
  • Candidate summary — is the new instruction block something you’d ship, or overfit noise?

The App’s job detail view and dn optimize artifacts both expose the best candidate. The job record also carries the saved config, which is what retry reruns against.

Promotion is a separate step from the search. It publishes the winning instructions as a new version of the source capability and is gated: only completed jobs with promotable instructions in the best candidate can promote.

Promotion lives in the App today — open the job, review the diff, publish. The same action is exposed on the platform API as POST /org/{org}/ws/{workspace}/optimization/jobs/{job_id}/promote, which you can call directly when you need scripted promotion. There is no dn optimize promote subcommand yet.

Once promoted, the capability has a new pinned version. Rerun the relevant evaluations against that version before any downstream automation moves to it.

When the CLI isn’t the right place (notebooks, in-process pipelines), the ApiClient exposes the same endpoints:

from dreadnode.app.api import create_api_client
from dreadnode.app.api.models import (
CapabilityRef,
CreateGEPAOptimizationJobRequest,
DatasetRef,
RewardRecipe,
)
api = create_api_client() # reads the profile from `dn login`
job = api.create_optimization_job(
org="acme",
workspace="research",
request=CreateGEPAOptimizationJobRequest(
model="openai/gpt-4o-mini",
capability_ref=CapabilityRef(name="support-agent", version="1.0.0"),
agent_name="assistant",
dataset_ref=DatasetRef(name="support-prompts", version="0.1.0"),
reward_recipe=RewardRecipe(name="exact_match_v1"),
components=["instructions"],
objective="Improve answer quality without increasing verbosity.",
),
)
print(job.id, job.status)

create_api_client() returns the same platform API client the CLI uses — it reads the logged-in profile from dn login and picks up --profile if you pass one. create_optimization_job, get_optimization_job, list_optimization_jobs, list_optimization_job_logs, get_optimization_job_artifacts, cancel_optimization_job, and retry_optimization_job all mirror their CLI counterparts. Prefer the CLI for interactive runs and CI; drop to the SDK when you need the job to live inside a larger Python workflow.

dn optimize submit handles both target kinds. The CLI infers target_kind from which training-surface flag you pass: --task or --task-dataset make the job capability_env; --dataset makes it capability_agent. Exactly one is required.

Terminal window
dn optimize submit \
--model anthropic/claude-sonnet-4-6 \
--capability dreadnode/[email protected] \
--agent-name web-security \
--task-dataset xbow-train@1 \
--val-dataset xbow-val@1 \
--reward-recipe exact_match_v1 \
--env-timeout-sec 1800 \
--parallel-rows 2 \
--concurrency 2 \
--component agent_prompt \
--component capability_prompt \
--component skill_descriptions \
--component skill_bodies \
--max-metric-calls 40 \
--max-trials-without-improvement 4 \
--tag xbow --tag capability-env

--task is the inline alternative when a dataset isn’t worth publishing — repeat it to fan the training set across several tasks (--task xbow/xben-031-24 --task xbow/xben-047-24). Use --val-task for held-out tasks. --env-timeout-sec, --parallel-rows, --concurrency, and --component are env-mode only and the CLI rejects them on agent-scored jobs.

The same submission is available from the SDK when the CLI isn’t the right surface — the client accepts a dict, which passes straight through to the server validator:

job = api.create_optimization_job(
org="acme",
workspace="research",
request={
"backend": "gepa",
"target_kind": "capability_env",
"model": "anthropic/claude-sonnet-4-6",
"capability_ref": {"name": "dreadnode/web-security", "version": "1.0.2"},
"agent_name": "web-security",
"dataset_ref": {"name": "xbow-train", "version": "1"},
"val_dataset_ref": {"name": "xbow-val", "version": "1"},
"reward_recipe": {"name": "exact_match_v1", "params": {}},
"task_ref": "xbow/xben-071-24",
"timeout_sec": 1800,
"components": [
"agent_prompt",
"capability_prompt",
"skill_descriptions",
"skill_bodies",
],
"config": {
"concurrency": 2,
"parallel_rows": 2,
"max_metric_calls": 40,
"max_trials_without_improvement": 4,
},
"tags": ["xbow", "capability-env"],
},
)
print(job.id, job.status)

The App renders capability_env jobs with the same monitoring, retry, and promote surfaces as agent-scored jobs. Follow the full scenario in the task-environment optimization guide.