Skip to content

Reward recipes

The five server-side reward recipes that turn a rollout into a score, plus Worlds reward policies for live RL.

RL jobs use a reward recipe to turn each rollout completion into a float reward. Pick one by name when you submit:

Terminal window
dn train rl ... --reward-recipe task_verifier_v1

Pass parameters as a JSON object when the recipe needs configuration:

Terminal window
dn train rl ... --reward-recipe contains_v1 \
--reward-params '{"needle": "flag", "reward_if_true": 1.0, "reward_if_false": 0.0}'

Every recipe receives the completion text plus the dataset row (for prompt-dataset RL) or the task definition (for verifier-driven RL). Recipes return a single float the optimizer maximizes.

Training and optimization share four of these recipes; the fifth — task_verifier_v1 — is training-specific.

Scores 1.0 when the completion exactly matches the expected answer after whitespace strip, 0.0 otherwise.

FieldTypeSource
params.expectedstringOptional global expected value. Falls back to the row’s expected_output.
Dataset columnexpected_output — required when params.expected is not set.

Use this when every prompt has one ground-truth answer and partial matches don’t count.

Scores based on whether a fixed substring appears anywhere in the completion.

FieldTypeDefaultNotes
params.needlestringRequired. Substring to look for.
params.reward_if_truefloat1.0Returned when the substring is present.
params.reward_if_falsefloat0.0Returned when the substring is absent.

The needle is global to the run — it does not read per-row fields. Use this when “did the agent mention this term?” is the entire metric.

Passes a per-row reward value from the dataset straight through to the optimizer.

FieldTypeSource
params.defaultfloatFallback when a row has no reward. Defaults to 0.0.
Dataset columnreward — the per-row numeric value returned unchanged.

Use this when the metric is already in the dataset — human labels, reward-model scores, anything you computed offline. The recipe adds nothing on top.

Returns the row’s reward when the completion matches the expected output; otherwise returns a fallback.

FieldTypeDefaultSource
params.expectedstringOptional global expected. Falls back to expected_output.
params.reward_if_truefloat1.0Used when match succeeds and the row has no reward.
params.reward_if_falsefloat0.0Used when the completion doesn’t match.

Use this when you want the model to imitate known-good outputs but weight rows differently — harder examples carry more reward via the row’s reward column.

Verifies a completion against a task’s embedded flag. The recipe strips whitespace, SHA-256 hashes the result, and compares it byte-for-byte against the expected hash pinned in the task.

FieldTypeDefaultNotes
params.reward_if_truefloat1.0Returned when the hash matches.
params.reward_if_falsefloat0.0Returned when it doesn’t.

Use this for security tasks that embed a flag or secret solution. The recipe never sees the plaintext — only the hash — so tasks stay checkable without leaking the answer.

You have…Reach for
Ground-truth answers per row.exact_match_v1
A single target phrase the agent should produce.contains_v1
Pre-computed rewards already in the dataset.row_reward_v1
Ground-truth outputs plus per-row weights.trajectory_imitation_v1
A task with an embedded flag-style solution.task_verifier_v1

Anything more complex — LLM-as-judge, multi-metric composition, custom scorers — is out of scope for the hosted training recipes. Author the reward outside and publish pre-scored datasets with row_reward_v1, or reach for optimization when the knob you want to turn is prompt or instruction text rather than weights.

When you train RL with --world-manifest-id, a separate --world-reward policy shapes intermediate signals during the live trajectory — distinct from the per-completion recipe above.

Terminal window
dn train rl ... \
--world-manifest-id <id> \
--world-reward discovery_v1 \
--world-reward-params '{"success_reward": 1.5, "error_penalty": -0.5}'

Three presets are available:

PresetShapes
heuristic_v1General-purpose: reasoning traces, tool observations, host / credential / privilege discovery, stop-tool bonus, plus terminal state rewards.
goal_only_v1Sparse goal-driven reward: success bonus and penalties for stalls, step limits, and errors.
discovery_v1Red-team shaping: bonuses for host discovery, credential acquisition, and privilege escalation on top of terminal outcomes.

Each preset accepts params that override its default weights (reasoning_trace_bonus, host_discovery_reward, success_reward, etc.).

For fully custom shaping, pass a components list instead of a preset name:

Terminal window
dn train rl ... \
--world-reward-params '{
"components": [
{"name": "reasoning_trace", "params": {"value": 0.02}},
{"name": "host_discovery", "params": {"value": 0.15}},
{"name": "terminal_state", "params": {"success_reward": 1.5, "error_penalty": -0.5}}
]
}'

Available components: reasoning_trace, tool_observation, host_discovery, credential_discovery, privilege_escalation, tool_stop, tool_error_penalty, terminal_state.

Both can be set on the same RL job; they are orthogonal.

--reward-recipe--world-reward
ScoresThe completion text.The trajectory — tool calls, observations, state.
When evaluatedOnce per rollout, after generation.Throughout a live rollout, per event.
Required forAny RL job that uses a recipe.Only --world-manifest-id rollouts.

Use the recipe when you have a metric for the final output. Use the world reward when the journey matters and you want to shape exploration.