Training

Submit, inspect, wait on, and manage hosted SFT and RL jobs from the dn CLI.

Use dn train ... when the platform should run the training job and track its lifecycle for you.

This is the hosted training surface. It is for jobs that should keep a server-side record, logs, artifacts, and terminal status. If you are still experimenting with prompts or metrics rather than model weights, optimization is usually the better fit.

Before you submit a training job

Have these pieces ready first:

a base model identifier the training backend can access
a published capability ref that defines the agent or behavior you want to adapt
one source of training data: a supervised dataset, trajectory datasets, or a live Worlds target

The training job record is only the control plane. The actual outputs you care about later are usually in dn train artifacts.

Choose the right subcommand

Command	Use it for
`dn train sft`	supervised fine-tuning from datasets or trajectory datasets
`dn train rl`	reinforcement learning from prompt datasets, trajectory datasets, or Worlds inputs
`dn train list/get/wait/logs/artifacts/cancel`	job inspection and lifecycle management

A normal training flow

Most people should think about training in this order:

choose sft or rl
submit one job with a narrow, explicit config
wait or poll until the job settles
read logs for debugging and artifacts for outputs

If you already selected a platform project through --project, environment variables, or a saved profile, dn train sft and dn train rl reuse that key as project_ref unless you pass --project-ref explicitly.

Submit SFT jobs

Use dn train sft when you already have the behavior you want in demonstration form. That usually means one of two things:

you have a normal supervised dataset of prompts and target outputs
you have trajectory datasets from prior Worlds or agent runs and want to learn from them

dn train sft \
  --server http://127.0.0.1:8000 \
  --api-key "$DREADNODE_API_KEY" \
  --organization dreadnode \
  --workspace localdev \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --dataset [email protected] \
  --steps 100 \
  --wait \
  --json

In that example:

--dataset is the direct supervised input
--capability tells the backend which capability context to train around
--wait turns the command into a synchronous shell workflow instead of fire-and-forget submit

You can also train directly from published Worlds trajectory datasets:

dn train sft \
  --server http://127.0.0.1:8000 \
  --api-key "$DREADNODE_API_KEY" \
  --organization dreadnode \
  --workspace localdev \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --steps 50

Use trajectory datasets when the demonstrations already exist as rollouts rather than flat prompt or response rows.

Common SFT flags:

Flag	Description
`--dataset NAME@VERSION`	primary supervised dataset
`--trajectory-dataset NAME@VERSION`	Worlds trajectory dataset input, repeatable
`--eval-dataset NAME@VERSION`	optional eval dataset
`--batch-size <n>`	per-step batch size
`--gradient-accumulation-steps <n>`	gradient accumulation factor
`--learning-rate <float>`	optimizer learning rate
`--checkpoint-interval <n>`	save checkpoint every N steps
`--wait`	poll until terminal state
`--json`	print the full job payload

Submit RL jobs

Use dn train rl when the signal comes from reward logic, verifier outcomes, or environment rollouts rather than from fixed target answers.

RL is the more decision-heavy path, so the most useful first question is: where will the experience come from?

Input source	Use it when
`--prompt-dataset`	you already have prompts and will score the outputs
`--trajectory-dataset`	you want offline RL from previously collected trajectories
`--world-manifest-id`	you want the job to sample from a live Worlds environment

dn train rl \
  --server http://127.0.0.1:8000 \
  --api-key "$DREADNODE_API_KEY" \
  --organization dreadnode \
  --workspace localdev \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --task [email protected] \
  --prompt-dataset [email protected] \
  --algorithm importance_sampling \
  --execution-mode fully_async \
  --max-steps-off-policy 3 \
  --reward-recipe contains_v1 \
  --reward-params '{"needle":"flag"}'

That pattern is verifier- or reward-driven RL: the prompt dataset supplies prompts, and the reward recipe decides what counts as success.

For Worlds-driven offline RL, replace the prompt dataset with trajectory datasets:

dn train rl \
  --server http://127.0.0.1:8000 \
  --api-key "$DREADNODE_API_KEY" \
  --organization dreadnode \
  --workspace localdev \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability [email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --trajectory-dataset dreadnode/[email protected] \
  --algorithm importance_sampling

Worlds-backed RL

When you want the job to sample from a live Worlds manifest, point it at the manifest directly:

dn train rl \
  --server http://127.0.0.1:8000 \
  --api-key "$DREADNODE_API_KEY" \
  --organization dreadnode \
  --workspace localdev \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --capability dreadnode/[email protected] \
  --world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \
  --world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \
  --world-agent-name operator \
  --world-goal "Escalate to Domain Admin in corp.local" \
  --execution-mode fully_async \
  --max-steps-off-policy 3 \
  --num-rollouts 8

Use this when the job should generate fresh experience against an environment instead of learning purely from stored datasets. --world-runtime-id and --world-agent-name are how you tie that rollout to an existing runtime-bound capability snapshot when you need one.

If you also pass --world-reward, the job falls back to the older live-rollout reward path.

Common RL flags

Flag	Description
`--task REF`	task ref for verifier-driven RL
`--prompt-dataset REF`	prompt dataset input
`--trajectory-dataset REF`	Worlds trajectory dataset input, repeatable
`--world-manifest-id ID`	live Worlds manifest target
`--world-runtime-id ID`	runtime whose capability bindings should be used
`--world-agent-name NAME`	optional agent selection inside that runtime-bound capability
`--world-goal TEXT`	optional live rollout goal override
`--world-reward NAME`	named live Worlds reward policy
`--world-reward-params JSON`	JSON params for the selected Worlds reward
`--execution-mode <mode>`	`sync`, `one_step_off_async`, or `fully_async`
`--steps <n>`	number of optimization steps
`--num-rollouts <n>`	rollouts per update
`--max-turns <n>`	maximum turns per episode
`--max-episode-steps <n>`	environment step limit
`--weight-sync-interval <n>`	refresh sampler weights every N updates
`--max-steps-off-policy <n>`	max rollout staleness for async RL
`--stop <token>`	stop token, repeatable

After the job starts

Once the job exists, these commands answer different questions:

dn train list
dn train get <job-id>
dn train wait <job-id> --json
dn train logs <job-id>
dn train artifacts <job-id>
dn train cancel <job-id> --json

Use them like this:

list finds the job again later
get shows the current state and saved config
wait blocks until a terminal state
logs is the first place to look for training failures
artifacts is where checkpoints, adapters, or final outputs show up
cancel stops the job but still preserves the server-side record

Queued jobs cancel immediately. Running jobs first become cancel-requested and may continue to show running until the worker finishes cleanup and writes the terminal state.

dn train wait exits non-zero if the terminal status is failed or cancelled.

Practical rule

Start with:

sft when you already have demonstrations
rl when you have rewards, verifiers, or environment outcomes

If you are still changing the prompt or instructions rather than the model weights, use /cli/optimization/ first.