Skip to content

Training integration

Turn Worlds trajectories into SFT conversations or offline-RL rows, and run online-RL rollouts against manifests with shaped rewards.

Worlds trajectories are first-class training inputs. A completed trajectory job publishes a dataset you can load directly, and manifests can drive online rollouts that emit shaped rewards as the agent runs.

Three patterns, three stages of training:

PatternData sourceStage
SFT conversationsPublished trajectory dataset → OpenAI chat formatSupervised fine-tuning
Offline-RL rowsSame dataset, expanded to per-step prompt rows with rewardsOffline RL
RolloutsLive agent run against a manifest, rewards shaped during generationOnline RL

Worlds trajectories are stored in ATIF — a trajectory interchange format the SDK reads directly. load_sft_conversations_from_worlds_dataset strips tool calls and produces OpenAI-style messages ready for SFT:

from dreadnode.training.etl.worlds import load_sft_conversations_from_worlds_dataset
conversations = load_sft_conversations_from_worlds_dataset(
dataset_ref={"name": "corp-ad-kali", "version": "1"},
)
# conversations[0] is a list of {"role": "...", "content": "..."} messages

If you want the full trajectory including tool calls and reasoning, use iter_atif_trajectories_jsonl or convert_atif_trajectory_to_openai — the latter preserves tool_calls and reasoning_content alongside the chat messages.

For offline RL, each assistant step becomes one prompt row with a derived reward. The reward defaults to the trajectory-level success flag but you can remap it:

from dreadnode.training.etl.worlds import load_rl_prompt_rows_from_worlds_dataset
rows = load_rl_prompt_rows_from_worlds_dataset(
dataset_ref={"name": "corp-ad-kali", "version": "1"},
)
# rows[i] = {"prompt": "...", "response": "...", "reward": 1.0, ...}

Tool schemas can be extracted from the trajectory’s recorded tool calls using build_tool_schemas_per_tool, so the RL loop has the same tool surface the original trajectory saw.

Hosted training jobs accept Worlds datasets as inputs directly. SFT jobs take trajectory_dataset_refs; RL jobs take either trajectory_dataset_refs for offline RL or world_manifest_id plus world_runtime_id for online agent pre-sampling. References are resolved at submission — missing or mismatched datasets fail the job before any compute is provisioned.

See Training overview for job structure, reference resolution, and artifact handling.

Rollouts are the in-process alternative to stored trajectories. Instead of submitting a trajectory job and waiting for a durable record, you run an SDK agent against a manifest inside your training loop and receive shaped rewards as steps happen:

from dreadnode.training.rollouts.worlds import (
run_worlds_agent_rollout,
HeuristicWorldsRewardShaper,
)
result = await run_worlds_agent_rollout(
agent=my_agent,
goal="Domain Admins",
reward_shaper=HeuristicWorldsRewardShaper(),
)
# result.turns[i].reward carries the shaped reward for step i
# result.metrics aggregates across turns

run_worlds_agent_rollout attaches hooks to the agent, runs it to completion, and returns a RolloutResult with per-turn rewards, total metrics, and the underlying trajectory.

Reward shapers emit signals at four points in an agent’s run — on generation, on tool calls, on tool errors, and at termination. The SDK ships composable shapers you can use directly or combine:

ShaperRewards
ReasoningTraceRewardShaperNon-empty reasoning traces on assistant turns
ToolObservationRewardShaperTool calls that produced a non-empty observation
HostDiscoveryRewardShaperTool output matching host/service discovery patterns
CredentialDiscoveryRewardShaperTool output matching credential-related patterns
PrivilegeEscalationRewardShaperTool output suggesting privilege escalation
ToolStopRewardShaperExplicit stop-tool calls from the agent
ToolErrorPenaltyShaperPenalty for tool execution errors
TerminalStateRewardShaperTerminal outcome bonuses/penalties (success, stall, max-steps, error)
CompositeWorldsRewardShaperCombine multiple shapers additively
HeuristicWorldsRewardShaperPreset composite of the above using WorldsRewardWeights

Default weights are defined in WorldsRewardWeights — e.g. +1.00 for terminal success, +0.35 for privilege escalation, -1.00 for terminal error. Override by passing a WorldsRewardWeights(...) instance or construct the shapers individually with custom values.

For a named policy instead of explicit construction, build_worlds_reward_shaper_from_config builds a shaper from heuristic_v1, goal_only_v1, or discovery_v1 preset names, or from an explicit components list.

  • Trajectory jobs are durable and reproducible. They produce records and datasets you can score, replay, and share across runs. Use them for benchmarking, dataset construction, and anything you’ll reference later.
  • Rollouts are ephemeral and in-process. They emit rewards immediately and tie back into the calling training loop. Use them for online RL where feedback latency matters.

Both bind to the same runtime and capability concepts; the trade-off is durability vs. feedback latency.