Training
Hosted training lets you run fine-tuning and reinforcement learning jobs directly from the platform. Jobs are workspace-scoped and run asynchronously with full lifecycle management.
Job types
Section titled “Job types”The platform supports two training approaches:
Supervised fine-tuning (SFT)
Section titled “Supervised fine-tuning (SFT)”SFT jobs train on conversation datasets:
- Dataset-backed conversation loading
- Worlds trajectory dataset conversion into SFT conversations
- Prompt/answer normalization into chat format
- Capability prompt injection as a system message scaffold
- Cross-entropy training with optional evaluation and checkpoint persistence
Reinforcement learning (RL)
Section titled “Reinforcement learning (RL)”RL jobs train using prompt datasets and reward signals:
- Prompt datasets drive rollout generation
- Worlds trajectory datasets can provide an offline RL baseline
- Worlds manifests can pre-sample agent trajectory datasets for online RL
- Supported algorithms include
grpo,ppo, andimportance_sampling
RL execution modes:
sync— sequential rollout and trainingone_step_off_async— overlaps generation and training with one-step stalenessfully_async— multiple queued rollout groups with bounded staleness
Reference resolution
Section titled “Reference resolution”Training jobs resolve references before execution:
capability_ref— resolved at submission time to a versioned capability snapshottask_ref— resolved to an org-visible task definition before RL executiondataset_refandprompt_dataset_ref— resolved to org-visible dataset artifactstrajectory_dataset_refs— validated on submission for Worlds-backed SFT and offline RLworld_manifest_id— validated on submission for Worlds sampling or live-rollout RL
Reference conventions
Section titled “Reference conventions”- Dataset refs use
{ name, version }objects with explicit versions - Task refs accept
namefor the latest version orname@versionfor a specific version
Policy, environment, and reward boundaries
Section titled “Policy, environment, and reward boundaries”RL jobs use three separate references to keep concerns cleanly separated:
capability_ref— the versioned policy scaffoldtask_ref— the environment or task definitionreward_recipe— the server-side reward or verification logic
Request shape
Section titled “Request shape”Job creation uses typed request bodies:
- SFT — carries dataset and LoRA-oriented settings; supports
dataset_refortrajectory_dataset_refs - RL — carries prompt-dataset fields, reward settings, and rollout controls; can also use
world_manifest_idandworld_runtime_idfor Worlds integration
Current limitations
Section titled “Current limitations”- The built-in
task_verifier_v1recipe only supports flag-based task verification - Async modes are rollout-group schedulers, not partial-rollout continuation runtimes
ray + rlbackend is not yet available