Reinforcement learning

Train against rewards, task verifiers, offline trajectories, or a live Worlds environment.

Reach for RL when the signal comes from rewards, verifier outcomes, or environment rollouts rather than fixed target answers. The most useful question to answer before anything else is: where does the experience come from?

Experience source	Flag	What it means
Prompt dataset	`--prompt-dataset NAME@VERSION`	You have prompts and will score each generated completion with a recipe.
Offline trajectories	`--trajectory-dataset NAME@VERSION` (repeatable)	Learn from agent rollouts already collected into published datasets.
Live Worlds environment	`--world-manifest-id <id>`	Generate fresh experience by rolling out against a Worlds manifest.

Verifier-driven RL

The common case: a prompt dataset supplies the prompts, the capability runs the policy, and a server-side reward recipe decides what counts as success.

dn train rl \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --task security-mutillidae-sqli-login-bypass \
  --prompt-dataset seed-prompts@sqli-v1 \
  --algorithm importance_sampling \
  --reward-recipe task_verifier_v1 \
  --execution-mode fully_async \
  --max-steps-off-policy 3 \
  --num-rollouts 32

--reward-recipe names a server-side recipe; --reward-params passes a JSON blob of parameters. --task REF is what task_verifier_v1 reads to find the expected flag hash — the prompt dataset supplies the prompts, the task supplies the ground truth. See reward recipes for the five available recipes.

Offline RL from trajectories

When the experience already exists as Worlds rollouts:

dn train rl \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --algorithm importance_sampling

Trajectory datasets are resolved at submission and streamed to the trainer without an intermediate conversion step.

Live Worlds rollouts

To let the job generate experience against a live Worlds manifest during training:

dn train rl \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability dreadnode/[email protected] \
  --world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \
  --world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \
  --world-agent-name operator \
  --world-goal "Escalate to Domain Admin in corp.local" \
  --world-reward discovery_v1 \
  --execution-mode fully_async \
  --max-steps-off-policy 3 \
  --num-rollouts 8

--world-runtime-id plus --world-agent-name select a runtime-bound capability snapshot to use for the rollouts. The validator requires --world-manifest-id whenever --world-runtime-id is set, and --world-runtime-id whenever --world-agent-name is set. --world-reward applies an SDK-side reward policy that shapes intermediate signals during the trajectory — see reward recipes for the presets and component-based composition.

--reward-recipe and --world-reward are orthogonal: the recipe scores the completion; the world-reward shapes the trajectory. You can pass both, one, or neither.

Execution modes

--execution-mode controls how rollout generation and optimizer updates interleave:

Mode	What it does
`sync`	One rollout group at a time; no overlap between generation and training.
`one_step_off_async`	Keeps a single rollout group in flight while the previous group updates — one step of staleness.
`fully_async`	Widens the pipeline to multiple queued rollout groups with bounded staleness.

Async modes require --max-steps-off-policy. For one_step_off_async it must be 1; for fully_async it’s the staleness budget.

From the SDK

from dreadnode.app.api.client import ApiClient
from dreadnode.app.api.models import (
    CapabilityRef,
    CreateTinkerRLJobRequest,
    DatasetRef,
    RewardRecipe,
    TinkerRLJobConfig,
)


client = ApiClient("https://app.dreadnode.io", api_key="dn_...")

job = client.create_training_job(
    "acme",
    "research",
    CreateTinkerRLJobRequest(
        model="meta-llama/Llama-3.1-8B-Instruct",
        capability_ref=CapabilityRef(name="web-agent", version="2.0.1"),
        config=TinkerRLJobConfig(
            algorithm="importance_sampling",
            task_ref="security-mutillidae-sqli-login-bypass",
            prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"),
            reward_recipe=RewardRecipe(name="task_verifier_v1"),
            execution_mode="fully_async",
            max_steps_off_policy=3,
            num_rollouts=32,
            lora_rank=16,
            max_new_tokens=128,
            temperature=0.1,
            stop=["</answer>"],
        ),
    ),
)

Every RL option is typed on TinkerRLJobConfig — see the manifest reference for the full field table with defaults and validation rules.

Tuning knobs

The flags you’ll touch most:

Flag	Does
`--algorithm`	`importance_sampling` or `ppo`.
`--num-rollouts <n>`	Rollouts collected per training window.
`--max-turns <n>`	Maximum agent turns per episode.
`--max-episode-steps <n>`	Environment-step cap per episode.
`--weight-sync-interval <n>`	Refresh the sampler’s weights every N optimizer steps.
`--max-new-tokens <n>`	Sampling cap per completion.
`--temperature <float>`	Sampling temperature.
`--stop <token>`	Stop sequence (repeatable).
`--prompt-split <name>`	Dataset split to use for prompt sampling when the prompt dataset has splits.

Full surface: dn train.

After the job starts

RL jobs share the lifecycle surface with SFT. See running training jobs for list / get / wait / logs / cancel / retry, monitoring for the App view, and outputs for the artifacts a completed RL job produces.