Reinforcement learning
Train against rewards, task verifiers, offline trajectories, or a live Worlds environment.
Reach for RL when the signal comes from rewards, verifier outcomes, or environment rollouts rather than fixed target answers. The most useful question to answer before anything else is: where does the experience come from?
| Experience source | Flag | What it means |
|---|---|---|
| Prompt dataset | --prompt-dataset NAME@VERSION | You have prompts and will score each generated completion with a recipe. |
| Offline trajectories | --trajectory-dataset NAME@VERSION (repeatable) | Learn from agent rollouts already collected into published datasets. |
| Live Worlds environment | --world-manifest-id <id> | Generate fresh experience by rolling out against a Worlds manifest. |
Verifier-driven RL
Section titled “Verifier-driven RL”The common case: a prompt dataset supplies the prompts, the capability runs the policy, and a server-side reward recipe decides what counts as success.
dn train rl \ --model meta-llama/Llama-3.1-8B-Instruct \ --task security-mutillidae-sqli-login-bypass \ --prompt-dataset seed-prompts@sqli-v1 \ --algorithm importance_sampling \ --reward-recipe task_verifier_v1 \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --num-rollouts 32--reward-recipe names a server-side recipe; --reward-params passes a JSON blob of
parameters. --task REF is what task_verifier_v1 reads to find the expected flag hash —
the prompt dataset supplies the prompts, the task supplies the ground truth. See
reward recipes for the five available recipes.
Offline RL from trajectories
Section titled “Offline RL from trajectories”When the experience already exists as Worlds rollouts:
dn train rl \ --model meta-llama/Llama-3.1-8B-Instruct \ --algorithm importance_samplingTrajectory datasets are resolved at submission and streamed to the trainer without an intermediate conversion step.
Live Worlds rollouts
Section titled “Live Worlds rollouts”To let the job generate experience against a live Worlds manifest during training:
dn train rl \ --model meta-llama/Llama-3.1-8B-Instruct \ --world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \ --world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \ --world-agent-name operator \ --world-goal "Escalate to Domain Admin in corp.local" \ --world-reward discovery_v1 \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --num-rollouts 8--world-runtime-id plus --world-agent-name select a runtime-bound capability snapshot to use
for the rollouts. The validator requires --world-manifest-id whenever --world-runtime-id is
set, and --world-runtime-id whenever --world-agent-name is set. --world-reward applies an
SDK-side reward policy that shapes intermediate signals during the trajectory — see
reward recipes for the presets and component-based composition.
--reward-recipe and --world-reward are orthogonal: the recipe scores the completion; the
world-reward shapes the trajectory. You can pass both, one, or neither.
Execution modes
Section titled “Execution modes”--execution-mode controls how rollout generation and optimizer updates interleave:
| Mode | What it does |
|---|---|
sync | One rollout group at a time; no overlap between generation and training. |
one_step_off_async | Keeps a single rollout group in flight while the previous group updates — one step of staleness. |
fully_async | Widens the pipeline to multiple queued rollout groups with bounded staleness. |
Async modes require --max-steps-off-policy. For one_step_off_async it must be 1; for
fully_async it’s the staleness budget.
From the SDK
Section titled “From the SDK”from dreadnode.app.api.client import ApiClientfrom dreadnode.app.api.models import ( CapabilityRef, CreateTinkerRLJobRequest, DatasetRef, RewardRecipe, TinkerRLJobConfig,)
client = ApiClient("https://app.dreadnode.io", api_key="dn_...")
job = client.create_training_job( "acme", "research", CreateTinkerRLJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="web-agent", version="2.0.1"), config=TinkerRLJobConfig( algorithm="importance_sampling", task_ref="security-mutillidae-sqli-login-bypass", prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"), reward_recipe=RewardRecipe(name="task_verifier_v1"), execution_mode="fully_async", max_steps_off_policy=3, num_rollouts=32, lora_rank=16, max_new_tokens=128, temperature=0.1, stop=["</answer>"], ), ),)Every RL option is typed on TinkerRLJobConfig — see the
manifest reference for the full field table with defaults and
validation rules.
Tuning knobs
Section titled “Tuning knobs”The flags you’ll touch most:
| Flag | Does |
|---|---|
--algorithm | importance_sampling or ppo. |
--num-rollouts <n> | Rollouts collected per training window. |
--max-turns <n> | Maximum agent turns per episode. |
--max-episode-steps <n> | Environment-step cap per episode. |
--weight-sync-interval <n> | Refresh the sampler’s weights every N optimizer steps. |
--max-new-tokens <n> | Sampling cap per completion. |
--temperature <float> | Sampling temperature. |
--stop <token> | Stop sequence (repeatable). |
--prompt-split <name> | Dataset split to use for prompt sampling when the prompt dataset has splits. |
Full surface: dn train.
After the job starts
Section titled “After the job starts”RL jobs share the lifecycle surface with SFT. See running training jobs for list / get / wait / logs / cancel / retry, monitoring for the App view, and outputs for the artifacts a completed RL job produces.