Reward recipes
The five server-side reward recipes that turn a rollout into a score, plus Worlds reward policies for live RL.
RL jobs use a reward recipe to turn each rollout completion into a float reward. Pick one by name when you submit:
dn train rl ... --reward-recipe task_verifier_v1Pass parameters as a JSON object when the recipe needs configuration:
dn train rl ... --reward-recipe contains_v1 \ --reward-params '{"needle": "flag", "reward_if_true": 1.0, "reward_if_false": 0.0}'Every recipe receives the completion text plus the dataset row (for prompt-dataset RL) or the task definition (for verifier-driven RL). Recipes return a single float the optimizer maximizes.
Training and optimization share four of these recipes; the
fifth — task_verifier_v1 — is training-specific.
exact_match_v1
Section titled “exact_match_v1”Scores 1.0 when the completion exactly matches the expected answer after whitespace strip,
0.0 otherwise.
| Field | Type | Source |
|---|---|---|
params.expected | string | Optional global expected value. Falls back to the row’s expected_output. |
| Dataset column | — | expected_output — required when params.expected is not set. |
Use this when every prompt has one ground-truth answer and partial matches don’t count.
contains_v1
Section titled “contains_v1”Scores based on whether a fixed substring appears anywhere in the completion.
| Field | Type | Default | Notes |
|---|---|---|---|
params.needle | string | — | Required. Substring to look for. |
params.reward_if_true | float | 1.0 | Returned when the substring is present. |
params.reward_if_false | float | 0.0 | Returned when the substring is absent. |
The needle is global to the run — it does not read per-row fields. Use this when “did the agent mention this term?” is the entire metric.
row_reward_v1
Section titled “row_reward_v1”Passes a per-row reward value from the dataset straight through to the optimizer.
| Field | Type | Source |
|---|---|---|
params.default | float | Fallback when a row has no reward. Defaults to 0.0. |
| Dataset column | — | reward — the per-row numeric value returned unchanged. |
Use this when the metric is already in the dataset — human labels, reward-model scores, anything you computed offline. The recipe adds nothing on top.
trajectory_imitation_v1
Section titled “trajectory_imitation_v1”Returns the row’s reward when the completion matches the expected output; otherwise returns a
fallback.
| Field | Type | Default | Source |
|---|---|---|---|
params.expected | string | — | Optional global expected. Falls back to expected_output. |
params.reward_if_true | float | 1.0 | Used when match succeeds and the row has no reward. |
params.reward_if_false | float | 0.0 | Used when the completion doesn’t match. |
Use this when you want the model to imitate known-good outputs but weight rows differently —
harder examples carry more reward via the row’s reward column.
task_verifier_v1
Section titled “task_verifier_v1”Verifies a completion against a task’s embedded flag. The recipe strips whitespace, SHA-256 hashes the result, and compares it byte-for-byte against the expected hash pinned in the task.
| Field | Type | Default | Notes |
|---|---|---|---|
params.reward_if_true | float | 1.0 | Returned when the hash matches. |
params.reward_if_false | float | 0.0 | Returned when it doesn’t. |
Use this for security tasks that embed a flag or secret solution. The recipe never sees the plaintext — only the hash — so tasks stay checkable without leaking the answer.
Picking a recipe
Section titled “Picking a recipe”| You have… | Reach for |
|---|---|
| Ground-truth answers per row. | exact_match_v1 |
| A single target phrase the agent should produce. | contains_v1 |
| Pre-computed rewards already in the dataset. | row_reward_v1 |
| Ground-truth outputs plus per-row weights. | trajectory_imitation_v1 |
| A task with an embedded flag-style solution. | task_verifier_v1 |
Anything more complex — LLM-as-judge, multi-metric composition, custom scorers — is out of scope
for the hosted training recipes. Author the reward outside and publish pre-scored datasets with
row_reward_v1, or reach for optimization when the knob you want to
turn is prompt or instruction text rather than weights.
World reward policies
Section titled “World reward policies”When you train RL with --world-manifest-id, a separate --world-reward policy shapes
intermediate signals during the live trajectory — distinct from the per-completion recipe above.
dn train rl ... \ --world-manifest-id <id> \ --world-reward discovery_v1 \ --world-reward-params '{"success_reward": 1.5, "error_penalty": -0.5}'Three presets are available:
| Preset | Shapes |
|---|---|
heuristic_v1 | General-purpose: reasoning traces, tool observations, host / credential / privilege discovery, stop-tool bonus, plus terminal state rewards. |
goal_only_v1 | Sparse goal-driven reward: success bonus and penalties for stalls, step limits, and errors. |
discovery_v1 | Red-team shaping: bonuses for host discovery, credential acquisition, and privilege escalation on top of terminal outcomes. |
Each preset accepts params that override its default weights (reasoning_trace_bonus,
host_discovery_reward, success_reward, etc.).
For fully custom shaping, pass a components list instead of a preset name:
dn train rl ... \ --world-reward-params '{ "components": [ {"name": "reasoning_trace", "params": {"value": 0.02}}, {"name": "host_discovery", "params": {"value": 0.15}}, {"name": "terminal_state", "params": {"success_reward": 1.5, "error_penalty": -0.5}} ] }'Available components: reasoning_trace, tool_observation, host_discovery,
credential_discovery, privilege_escalation, tool_stop, tool_error_penalty,
terminal_state.
--reward-recipe vs. --world-reward
Section titled “--reward-recipe vs. --world-reward”Both can be set on the same RL job; they are orthogonal.
--reward-recipe | --world-reward | |
|---|---|---|
| Scores | The completion text. | The trajectory — tool calls, observations, state. |
| When evaluated | Once per rollout, after generation. | Throughout a live rollout, per event. |
| Required for | Any RL job that uses a recipe. | Only --world-manifest-id rollouts. |
Use the recipe when you have a metric for the final output. Use the world reward when the journey matters and you want to shape exploration.
Where to go next
Section titled “Where to go next”- Reinforcement learning for the full RL submission flow.
- Manifest reference for every RL config field.