Training integration

Turn Worlds trajectories into SFT conversations or offline-RL rows, and run online-RL rollouts against manifests with shaped rewards.

Worlds trajectories are first-class training inputs. A completed trajectory job publishes a dataset you can load directly, and manifests can drive online rollouts that emit shaped rewards as the agent runs.

Three patterns, three stages of training:

Pattern	Data source	Stage
SFT conversations	Published trajectory dataset → OpenAI chat format	Supervised fine-tuning
Offline-RL rows	Same dataset, expanded to per-step prompt rows with rewards	Offline RL
Rollouts	Live agent run against a manifest, rewards shaped during generation	Online RL

Load trajectories as SFT conversations

Worlds trajectories are stored in ATIF — a trajectory interchange format the SDK reads directly. load_sft_conversations_from_worlds_dataset strips tool calls and produces OpenAI-style messages ready for SFT:

from dreadnode.training.etl.worlds import load_sft_conversations_from_worlds_dataset

conversations = load_sft_conversations_from_worlds_dataset(
    dataset_ref={"name": "corp-ad-kali", "version": "1"},
)

# conversations[0] is a list of {"role": "...", "content": "..."} messages

If you want the full trajectory including tool calls and reasoning, use iter_atif_trajectories_jsonl or convert_atif_trajectory_to_openai — the latter preserves tool_calls and reasoning_content alongside the chat messages.

Load trajectories as offline-RL rows

For offline RL, each assistant step becomes one prompt row with a derived reward. The reward defaults to the trajectory-level success flag but you can remap it:

from dreadnode.training.etl.worlds import load_rl_prompt_rows_from_worlds_dataset

rows = load_rl_prompt_rows_from_worlds_dataset(
    dataset_ref={"name": "corp-ad-kali", "version": "1"},
)

# rows[i] = {"prompt": "...", "response": "...", "reward": 1.0, ...}

Tool schemas can be extracted from the trajectory’s recorded tool calls using build_tool_schemas_per_tool, so the RL loop has the same tool surface the original trajectory saw.

Hosted training jobs

Hosted training jobs accept Worlds datasets as inputs directly. SFT jobs take trajectory_dataset_refs; RL jobs take either trajectory_dataset_refs for offline RL or world_manifest_id plus world_runtime_id for online agent pre-sampling. References are resolved at submission — missing or mismatched datasets fail the job before any compute is provisioned.

See Training overview for job structure, reference resolution, and artifact handling.

Rollouts

Rollouts are the in-process alternative to stored trajectories. Instead of submitting a trajectory job and waiting for a durable record, you run an SDK agent against a manifest inside your training loop and receive shaped rewards as steps happen:

from dreadnode.training.rollouts.worlds import (
    run_worlds_agent_rollout,
    HeuristicWorldsRewardShaper,
)

result = await run_worlds_agent_rollout(
    agent=my_agent,
    goal="Domain Admins",
    reward_shaper=HeuristicWorldsRewardShaper(),
)
# result.turns[i].reward carries the shaped reward for step i
# result.metrics aggregates across turns

run_worlds_agent_rollout attaches hooks to the agent, runs it to completion, and returns a RolloutResult with per-turn rewards, total metrics, and the underlying trajectory.

Reward shapers

Reward shapers emit signals at four points in an agent’s run — on generation, on tool calls, on tool errors, and at termination. The SDK ships composable shapers you can use directly or combine:

Shaper	Rewards
`ReasoningTraceRewardShaper`	Non-empty reasoning traces on assistant turns
`ToolObservationRewardShaper`	Tool calls that produced a non-empty observation
`HostDiscoveryRewardShaper`	Tool output matching host/service discovery patterns
`CredentialDiscoveryRewardShaper`	Tool output matching credential-related patterns
`PrivilegeEscalationRewardShaper`	Tool output suggesting privilege escalation
`ToolStopRewardShaper`	Explicit stop-tool calls from the agent
`ToolErrorPenaltyShaper`	Penalty for tool execution errors
`TerminalStateRewardShaper`	Terminal outcome bonuses/penalties (success, stall, max-steps, error)
`CompositeWorldsRewardShaper`	Combine multiple shapers additively
`HeuristicWorldsRewardShaper`	Preset composite of the above using `WorldsRewardWeights`

Default weights are defined in WorldsRewardWeights — e.g. +1.00 for terminal success, +0.35 for privilege escalation, -1.00 for terminal error. Override by passing a WorldsRewardWeights(...) instance or construct the shapers individually with custom values.

For a named policy instead of explicit construction, build_worlds_reward_shaper_from_config builds a shaper from heuristic_v1, goal_only_v1, or discovery_v1 preset names, or from an explicit components list.

Trajectories vs. rollouts: which to use

Trajectory jobs are durable and reproducible. They produce records and datasets you can score, replay, and share across runs. Use them for benchmarking, dataset construction, and anything you’ll reference later.
Rollouts are ephemeral and in-process. They emit rewards immediately and tie back into the calling training loop. Use them for online RL where feedback latency matters.

Both bind to the same runtime and capability concepts; the trade-off is durability vs. feedback latency.

What’s next

Field-by-field ATIF reference: Trajectory reference
Agent-mode trajectories as training data source: Agent-mode trajectories
Hosted job structure: Training overview