Reward recipes

The five server-side reward recipes that turn a rollout into a score, plus Worlds reward policies for live RL.

RL jobs use a reward recipe to turn each rollout completion into a float reward. Pick one by name when you submit:

dn train rl ... --reward-recipe task_verifier_v1

Pass parameters as a JSON object when the recipe needs configuration:

dn train rl ... --reward-recipe contains_v1 \
  --reward-params '{"needle": "flag", "reward_if_true": 1.0, "reward_if_false": 0.0}'

Every recipe receives the completion text plus the dataset row (for prompt-dataset RL) or the task definition (for verifier-driven RL). Recipes return a single float the optimizer maximizes.

Training and optimization share four of these recipes; the fifth — task_verifier_v1 — is training-specific.

`exact_match_v1`

Scores 1.0 when the completion exactly matches the expected answer after whitespace strip, 0.0 otherwise.

Field	Type	Source
`params.expected`	string	Optional global expected value. Falls back to the row’s `expected_output`.
Dataset column	—	`expected_output` — required when `params.expected` is not set.

Use this when every prompt has one ground-truth answer and partial matches don’t count.

`contains_v1`

Scores based on whether a fixed substring appears anywhere in the completion.

Field	Type	Default	Notes
`params.needle`	string	—	Required. Substring to look for.
`params.reward_if_true`	float	`1.0`	Returned when the substring is present.
`params.reward_if_false`	float	`0.0`	Returned when the substring is absent.

The needle is global to the run — it does not read per-row fields. Use this when “did the agent mention this term?” is the entire metric.

`row_reward_v1`

Passes a per-row reward value from the dataset straight through to the optimizer.

Field	Type	Source
`params.default`	float	Fallback when a row has no `reward`. Defaults to `0.0`.
Dataset column	—	`reward` — the per-row numeric value returned unchanged.

Use this when the metric is already in the dataset — human labels, reward-model scores, anything you computed offline. The recipe adds nothing on top.

`trajectory_imitation_v1`

Returns the row’s reward when the completion matches the expected output; otherwise returns a fallback.

Field	Type	Default	Source
`params.expected`	string	—	Optional global expected. Falls back to `expected_output`.
`params.reward_if_true`	float	`1.0`	Used when match succeeds and the row has no `reward`.
`params.reward_if_false`	float	`0.0`	Used when the completion doesn’t match.

Use this when you want the model to imitate known-good outputs but weight rows differently — harder examples carry more reward via the row’s reward column.

`task_verifier_v1`

Verifies a completion against a task’s embedded flag. The recipe strips whitespace, SHA-256 hashes the result, and compares it byte-for-byte against the expected hash pinned in the task.

Field	Type	Default	Notes
`params.reward_if_true`	float	`1.0`	Returned when the hash matches.
`params.reward_if_false`	float	`0.0`	Returned when it doesn’t.

Use this for security tasks that embed a flag or secret solution. The recipe never sees the plaintext — only the hash — so tasks stay checkable without leaking the answer.

`task_env_verifier_v1`

Provisions a live task environment per rollout, lets the policy sample one completion, then grades the env’s final state using the task’s verification config. Use this when the reward comes from world state (flag files, database rows, service state) rather than completion text.

dn train rl ... \
  --task-ref [email protected] \
  --reward-recipe task_env_verifier_v1 \
  --reward-params '{"max_concurrent_rollouts": 8, "reward_if_true": 1.0}'

The recipe reads the task’s verification dict (snapshotted onto the env at provision time) and dispatches to env_flag, env_script, or llm_judge — see the Verification page for the methods.

Field	Type	Default	Notes
`params.reward_if_true`	float	`1.0`	Returned when verification passes.
`params.reward_if_false`	float	`0.0`	Returned when verification fails.
`params.max_concurrent_rollouts`	int	`8`	Parallel env provisions per step; cap under tight E2B quota.
`params.env_timeout_sec`	int	`300`	Env lifetime per rollout.

Single-shot only — the policy sees the rendered task instruction once, replies once, and the reward comes from the env. For multi-turn agents that use tools, reach for task_env_agent_v1.

`task_env_agent_v1`

Provisions a task environment, builds an in-process agent from the job’s capability, runs a full tool-use rollout against the env, then grades the env state (same verification methods as above). This is the primary recipe for cyber RL — the policy is an agent that iterates against the target.

dn train rl ... \
  --capability [email protected] \
  --task-ref [email protected] \
  --reward-recipe task_env_agent_v1 \
  --reward-params '{"max_turns": 20, "max_concurrent_rollouts": 8}'

Per-turn credit assignment uses reward-to-go — the terminal reward (from verification) is distributed across the rollout’s assistant turns so the optimizer can credit earlier steps. Works with any capability that runs under optimization today; no capability changes required.

Field	Type	Default	Notes
`params.max_turns`	int	`20`	Cap on agent steps per rollout.
`params.max_concurrent_rollouts`	int	`8`	Parallel env provisions per step.
`params.env_timeout_sec`	int	`600`	Env lifetime per rollout (longer than single-shot — tools need time).
`params.reward_if_true`	float	`1.0`	Returned when verification passes.
`params.reward_if_false`	float	`0.0`	Returned when verification fails.

Picking a recipe

You have…	Reach for
Ground-truth answers per row.	`exact_match_v1`
A single target phrase the agent should produce.	`contains_v1`
Pre-computed rewards already in the dataset.	`row_reward_v1`
Ground-truth outputs plus per-row weights.	`trajectory_imitation_v1`
A task with an embedded flag-style solution.	`task_verifier_v1`
A task whose reward lives in world state (single-shot).	`task_env_verifier_v1`
A task that needs a tool-using agent to solve it.	`task_env_agent_v1`

For multi-metric composition or custom scorers not covered above, publish pre-scored datasets and use row_reward_v1, or reach for optimization when the knob you want to turn is prompt or instruction text rather than weights.

World reward policies

When you train RL with --world-manifest-id, a separate --world-reward policy shapes intermediate signals during the live trajectory — distinct from the per-completion recipe above.

dn train rl ... \
  --world-manifest-id <id> \
  --world-reward discovery_v1 \
  --world-reward-params '{"success_reward": 1.5, "error_penalty": -0.5}'

Three presets are available:

Preset	Shapes
`heuristic_v1`	General-purpose: reasoning traces, tool observations, host / credential / privilege discovery, stop-tool bonus, plus terminal state rewards.
`goal_only_v1`	Sparse goal-driven reward: success bonus and penalties for stalls, step limits, and errors.
`discovery_v1`	Red-team shaping: bonuses for host discovery, credential acquisition, and privilege escalation on top of terminal outcomes.

Each preset accepts params that override its default weights (reasoning_trace_bonus, host_discovery_reward, success_reward, etc.).

For fully custom shaping, pass a components list instead of a preset name:

dn train rl ... \
  --world-reward-params '{
    "components": [
      {"name": "reasoning_trace", "params": {"value": 0.02}},
      {"name": "host_discovery", "params": {"value": 0.15}},
      {"name": "terminal_state", "params": {"success_reward": 1.5, "error_penalty": -0.5}}
    ]
  }'

Available components: reasoning_trace, tool_observation, host_discovery, credential_discovery, privilege_escalation, tool_stop, tool_error_penalty, terminal_state.

`--reward-recipe` vs. `--world-reward`

Both can be set on the same RL job; they are orthogonal.

	`--reward-recipe`	`--world-reward`
Scores	The completion text.	The trajectory — tool calls, observations, state.
When evaluated	Once per rollout, after generation.	Throughout a live rollout, per event.
Required for	Any RL job that uses a recipe.	Only `--world-manifest-id` rollouts.

Use the recipe when you have a metric for the final output. Use the world reward when the journey matters and you want to shape exploration.

Where to go next

Reinforcement learning for the full RL submission flow.
Manifest reference for every RL config field.

Reward recipes

exact_match_v1

contains_v1

row_reward_v1

trajectory_imitation_v1

task_verifier_v1

task_env_verifier_v1

task_env_agent_v1