Skip to content

Reinforcement learning

Train against rewards, task verifiers, offline trajectories, or a live Worlds environment.

Reach for RL when the signal comes from rewards, verifier outcomes, or environment rollouts rather than fixed target answers. The most useful question to answer before anything else is: where does the experience come from?

Experience sourceFlagWhat it means
Prompt dataset--prompt-dataset NAME@VERSIONYou have prompts and will score each generated completion with a recipe.
Offline trajectories--trajectory-dataset NAME@VERSION (repeatable)Learn from agent rollouts already collected into published datasets.
Live Worlds environment--world-manifest-id <id>Generate fresh experience by rolling out against a Worlds manifest.

The common case: a prompt dataset supplies the prompts, the capability runs the policy, and a server-side reward recipe decides what counts as success.

Terminal window
dn train rl \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--task security-mutillidae-sqli-login-bypass \
--prompt-dataset seed-prompts@sqli-v1 \
--algorithm importance_sampling \
--reward-recipe task_verifier_v1 \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--num-rollouts 32

--reward-recipe names a server-side recipe; --reward-params passes a JSON blob of parameters. --task REF is what task_verifier_v1 reads to find the expected flag hash — the prompt dataset supplies the prompts, the task supplies the ground truth. See reward recipes for the five available recipes.

When the experience already exists as Worlds rollouts:

Terminal window
dn train rl \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--trajectory-dataset dreadnode/[email protected] \
--trajectory-dataset dreadnode/[email protected] \
--algorithm importance_sampling

Trajectory datasets are resolved at submission and streamed to the trainer without an intermediate conversion step.

To let the job generate experience against a live Worlds manifest during training:

Terminal window
dn train rl \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability dreadnode/[email protected] \
--world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \
--world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \
--world-agent-name operator \
--world-goal "Escalate to Domain Admin in corp.local" \
--world-reward discovery_v1 \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--num-rollouts 8

--world-runtime-id plus --world-agent-name select a runtime-bound capability snapshot to use for the rollouts. The validator requires --world-manifest-id whenever --world-runtime-id is set, and --world-runtime-id whenever --world-agent-name is set. --world-reward applies an SDK-side reward policy that shapes intermediate signals during the trajectory — see reward recipes for the presets and component-based composition.

--reward-recipe and --world-reward are orthogonal: the recipe scores the completion; the world-reward shapes the trajectory. You can pass both, one, or neither.

--execution-mode controls how rollout generation and optimizer updates interleave:

ModeWhat it does
syncOne rollout group at a time; no overlap between generation and training.
one_step_off_asyncKeeps a single rollout group in flight while the previous group updates — one step of staleness.
fully_asyncWidens the pipeline to multiple queued rollout groups with bounded staleness.

Async modes require --max-steps-off-policy. For one_step_off_async it must be 1; for fully_async it’s the staleness budget.

from dreadnode.app.api.client import ApiClient
from dreadnode.app.api.models import (
CapabilityRef,
CreateTinkerRLJobRequest,
DatasetRef,
RewardRecipe,
TinkerRLJobConfig,
)
client = ApiClient("https://app.dreadnode.io", api_key="dn_...")
job = client.create_training_job(
"acme",
"research",
CreateTinkerRLJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="web-agent", version="2.0.1"),
config=TinkerRLJobConfig(
algorithm="importance_sampling",
task_ref="security-mutillidae-sqli-login-bypass",
prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"),
reward_recipe=RewardRecipe(name="task_verifier_v1"),
execution_mode="fully_async",
max_steps_off_policy=3,
num_rollouts=32,
lora_rank=16,
max_new_tokens=128,
temperature=0.1,
stop=["</answer>"],
),
),
)

Every RL option is typed on TinkerRLJobConfig — see the manifest reference for the full field table with defaults and validation rules.

The flags you’ll touch most:

FlagDoes
--algorithmimportance_sampling or ppo.
--num-rollouts <n>Rollouts collected per training window.
--max-turns <n>Maximum agent turns per episode.
--max-episode-steps <n>Environment-step cap per episode.
--weight-sync-interval <n>Refresh the sampler’s weights every N optimizer steps.
--max-new-tokens <n>Sampling cap per completion.
--temperature <float>Sampling temperature.
--stop <token>Stop sequence (repeatable).
--prompt-split <name>Dataset split to use for prompt sampling when the prompt dataset has splits.

Full surface: dn train.

RL jobs share the lifecycle surface with SFT. See running training jobs for list / get / wait / logs / cancel / retry, monitoring for the App view, and outputs for the artifacts a completed RL job produces.