Training integration
Turn Worlds trajectories into SFT conversations or offline-RL rows, and run online-RL rollouts against manifests with shaped rewards.
Worlds trajectories are first-class training inputs. A completed trajectory job publishes a dataset you can load directly, and manifests can drive online rollouts that emit shaped rewards as the agent runs.
Three patterns, three stages of training:
| Pattern | Data source | Stage |
|---|---|---|
| SFT conversations | Published trajectory dataset → OpenAI chat format | Supervised fine-tuning |
| Offline-RL rows | Same dataset, expanded to per-step prompt rows with rewards | Offline RL |
| Rollouts | Live agent run against a manifest, rewards shaped during generation | Online RL |
Load trajectories as SFT conversations
Section titled “Load trajectories as SFT conversations”Worlds trajectories are stored in ATIF — a trajectory interchange format the SDK reads
directly. load_sft_conversations_from_worlds_dataset strips tool calls and produces
OpenAI-style messages ready for SFT:
from dreadnode.training.etl.worlds import load_sft_conversations_from_worlds_dataset
conversations = load_sft_conversations_from_worlds_dataset( dataset_ref={"name": "corp-ad-kali", "version": "1"},)
# conversations[0] is a list of {"role": "...", "content": "..."} messagesIf you want the full trajectory including tool calls and reasoning, use
iter_atif_trajectories_jsonl or convert_atif_trajectory_to_openai — the latter
preserves tool_calls and reasoning_content alongside the chat messages.
Load trajectories as offline-RL rows
Section titled “Load trajectories as offline-RL rows”For offline RL, each assistant step becomes one prompt row with a derived reward. The reward defaults to the trajectory-level success flag but you can remap it:
from dreadnode.training.etl.worlds import load_rl_prompt_rows_from_worlds_dataset
rows = load_rl_prompt_rows_from_worlds_dataset( dataset_ref={"name": "corp-ad-kali", "version": "1"},)
# rows[i] = {"prompt": "...", "response": "...", "reward": 1.0, ...}Tool schemas can be extracted from the trajectory’s recorded tool calls using
build_tool_schemas_per_tool, so the RL loop has the same tool surface the original
trajectory saw.
Hosted training jobs
Section titled “Hosted training jobs”Hosted training jobs accept Worlds datasets as inputs directly. SFT jobs take
trajectory_dataset_refs; RL jobs take either trajectory_dataset_refs for offline RL
or world_manifest_id plus world_runtime_id for online agent pre-sampling. References
are resolved at submission — missing or mismatched datasets fail the job before any
compute is provisioned.
See Training overview for job structure, reference resolution, and artifact handling.
Rollouts
Section titled “Rollouts”Rollouts are the in-process alternative to stored trajectories. Instead of submitting a trajectory job and waiting for a durable record, you run an SDK agent against a manifest inside your training loop and receive shaped rewards as steps happen:
from dreadnode.training.rollouts.worlds import ( run_worlds_agent_rollout, HeuristicWorldsRewardShaper,)
result = await run_worlds_agent_rollout( agent=my_agent, goal="Domain Admins", reward_shaper=HeuristicWorldsRewardShaper(),)# result.turns[i].reward carries the shaped reward for step i# result.metrics aggregates across turnsrun_worlds_agent_rollout attaches hooks to the agent, runs it to completion, and
returns a RolloutResult with per-turn rewards, total metrics, and the underlying
trajectory.
Reward shapers
Section titled “Reward shapers”Reward shapers emit signals at four points in an agent’s run — on generation, on tool calls, on tool errors, and at termination. The SDK ships composable shapers you can use directly or combine:
| Shaper | Rewards |
|---|---|
ReasoningTraceRewardShaper | Non-empty reasoning traces on assistant turns |
ToolObservationRewardShaper | Tool calls that produced a non-empty observation |
HostDiscoveryRewardShaper | Tool output matching host/service discovery patterns |
CredentialDiscoveryRewardShaper | Tool output matching credential-related patterns |
PrivilegeEscalationRewardShaper | Tool output suggesting privilege escalation |
ToolStopRewardShaper | Explicit stop-tool calls from the agent |
ToolErrorPenaltyShaper | Penalty for tool execution errors |
TerminalStateRewardShaper | Terminal outcome bonuses/penalties (success, stall, max-steps, error) |
CompositeWorldsRewardShaper | Combine multiple shapers additively |
HeuristicWorldsRewardShaper | Preset composite of the above using WorldsRewardWeights |
Default weights are defined in WorldsRewardWeights — e.g. +1.00 for terminal
success, +0.35 for privilege escalation, -1.00 for terminal error. Override by
passing a WorldsRewardWeights(...) instance or construct the shapers individually
with custom values.
For a named policy instead of explicit construction, build_worlds_reward_shaper_from_config
builds a shaper from heuristic_v1, goal_only_v1, or discovery_v1 preset names, or
from an explicit components list.
Trajectories vs. rollouts: which to use
Section titled “Trajectories vs. rollouts: which to use”- Trajectory jobs are durable and reproducible. They produce records and datasets you can score, replay, and share across runs. Use them for benchmarking, dataset construction, and anything you’ll reference later.
- Rollouts are ephemeral and in-process. They emit rewards immediately and tie back into the calling training loop. Use them for online RL where feedback latency matters.
Both bind to the same runtime and capability concepts; the trade-off is durability vs. feedback latency.
What’s next
Section titled “What’s next”- Field-by-field ATIF reference: Trajectory reference
- Agent-mode trajectories as training data source: Agent-mode trajectories
- Hosted job structure: Training overview