Skip to content

Training

Hosted training lets you run fine-tuning and reinforcement learning jobs directly from the platform. Jobs are workspace-scoped and run asynchronously with full lifecycle management.

The platform supports two training approaches:

SFT jobs train on conversation datasets:

  • Dataset-backed conversation loading
  • Worlds trajectory dataset conversion into SFT conversations
  • Prompt/answer normalization into chat format
  • Capability prompt injection as a system message scaffold
  • Cross-entropy training with optional evaluation and checkpoint persistence

RL jobs train using prompt datasets and reward signals:

  • Prompt datasets drive rollout generation
  • Worlds trajectory datasets can provide an offline RL baseline
  • Worlds manifests can pre-sample agent trajectory datasets for online RL
  • Supported algorithms include grpo, ppo, and importance_sampling

RL execution modes:

  • sync — sequential rollout and training
  • one_step_off_async — overlaps generation and training with one-step staleness
  • fully_async — multiple queued rollout groups with bounded staleness

Training jobs resolve references before execution:

  • capability_ref — resolved at submission time to a versioned capability snapshot
  • task_ref — resolved to an org-visible task definition before RL execution
  • dataset_ref and prompt_dataset_ref — resolved to org-visible dataset artifacts
  • trajectory_dataset_refs — validated on submission for Worlds-backed SFT and offline RL
  • world_manifest_id — validated on submission for Worlds sampling or live-rollout RL
  • Dataset refs use { name, version } objects with explicit versions
  • Task refs accept name for the latest version or name@version for a specific version

Policy, environment, and reward boundaries

Section titled “Policy, environment, and reward boundaries”

RL jobs use three separate references to keep concerns cleanly separated:

  • capability_ref — the versioned policy scaffold
  • task_ref — the environment or task definition
  • reward_recipe — the server-side reward or verification logic

Job creation uses typed request bodies:

  • SFT — carries dataset and LoRA-oriented settings; supports dataset_ref or trajectory_dataset_refs
  • RL — carries prompt-dataset fields, reward settings, and rollout controls; can also use world_manifest_id and world_runtime_id for Worlds integration
  • The built-in task_verifier_v1 recipe only supports flag-based task verification
  • Async modes are rollout-group schedulers, not partial-rollout continuation runtimes
  • ray + rl backend is not yet available