Skip to content

Training

Submit, inspect, wait on, and manage hosted SFT and RL jobs from the dn CLI.

Use dn train ... when the platform should run the training job and track its lifecycle for you.

This is the hosted training surface. It is for jobs that should keep a server-side record, logs, artifacts, and terminal status. If you are still experimenting with prompts or metrics rather than model weights, optimization is usually the better fit.

Have these pieces ready first:

  • a base model identifier the training backend can access
  • a published capability ref that defines the agent or behavior you want to adapt
  • one source of training data: a supervised dataset, trajectory datasets, or a live Worlds target

The training job record is only the control plane. The actual outputs you care about later are usually in dn train artifacts.

CommandUse it for
dn train sftsupervised fine-tuning from datasets or trajectory datasets
dn train rlreinforcement learning from prompt datasets, trajectory datasets, or Worlds inputs
dn train list/get/wait/logs/artifacts/canceljob inspection and lifecycle management

Most people should think about training in this order:

  1. choose sft or rl
  2. submit one job with a narrow, explicit config
  3. wait or poll until the job settles
  4. read logs for debugging and artifacts for outputs

If you already selected a platform project through --project, environment variables, or a saved profile, dn train sft and dn train rl reuse that key as project_ref unless you pass --project-ref explicitly.

Use dn train sft when you already have the behavior you want in demonstration form. That usually means one of two things:

  • you have a normal supervised dataset of prompts and target outputs
  • you have trajectory datasets from prior Worlds or agent runs and want to learn from them
Terminal window
dn train sft \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--dataset [email protected] \
--steps 100 \
--wait \
--json

In that example:

  • --dataset is the direct supervised input
  • --capability tells the backend which capability context to train around
  • --wait turns the command into a synchronous shell workflow instead of fire-and-forget submit

You can also train directly from published Worlds trajectory datasets:

Terminal window
dn train sft \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--trajectory-dataset dreadnode/[email protected] \
--trajectory-dataset dreadnode/[email protected] \
--steps 50

Use trajectory datasets when the demonstrations already exist as rollouts rather than flat prompt or response rows.

Common SFT flags:

FlagDescription
--dataset NAME@VERSIONprimary supervised dataset
--trajectory-dataset NAME@VERSIONWorlds trajectory dataset input, repeatable
--eval-dataset NAME@VERSIONoptional eval dataset
--batch-size <n>per-step batch size
--gradient-accumulation-steps <n>gradient accumulation factor
--learning-rate <float>optimizer learning rate
--checkpoint-interval <n>save checkpoint every N steps
--waitpoll until terminal state
--jsonprint the full job payload

Use dn train rl when the signal comes from reward logic, verifier outcomes, or environment rollouts rather than from fixed target answers.

RL is the more decision-heavy path, so the most useful first question is: where will the experience come from?

Input sourceUse it when
--prompt-datasetyou already have prompts and will score the outputs
--trajectory-datasetyou want offline RL from previously collected trajectories
--world-manifest-idyou want the job to sample from a live Worlds environment
Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--prompt-dataset [email protected] \
--algorithm importance_sampling \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--reward-recipe contains_v1 \
--reward-params '{"needle":"flag"}'

That pattern is verifier- or reward-driven RL: the prompt dataset supplies prompts, and the reward recipe decides what counts as success.

For Worlds-driven offline RL, replace the prompt dataset with trajectory datasets:

Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--trajectory-dataset dreadnode/[email protected] \
--trajectory-dataset dreadnode/[email protected] \
--algorithm importance_sampling

When you want the job to sample from a live Worlds manifest, point it at the manifest directly:

Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability dreadnode/[email protected] \
--world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \
--world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \
--world-agent-name operator \
--world-goal "Escalate to Domain Admin in corp.local" \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--num-rollouts 8

Use this when the job should generate fresh experience against an environment instead of learning purely from stored datasets. --world-runtime-id and --world-agent-name are how you tie that rollout to an existing runtime-bound capability snapshot when you need one.

If you also pass --world-reward, the job falls back to the older live-rollout reward path.

FlagDescription
--task REFtask ref for verifier-driven RL
--prompt-dataset REFprompt dataset input
--trajectory-dataset REFWorlds trajectory dataset input, repeatable
--world-manifest-id IDlive Worlds manifest target
--world-runtime-id IDruntime whose capability bindings should be used
--world-agent-name NAMEoptional agent selection inside that runtime-bound capability
--world-goal TEXToptional live rollout goal override
--world-reward NAMEnamed live Worlds reward policy
--world-reward-params JSONJSON params for the selected Worlds reward
--execution-mode <mode>sync, one_step_off_async, or fully_async
--steps <n>number of optimization steps
--num-rollouts <n>rollouts per update
--max-turns <n>maximum turns per episode
--max-episode-steps <n>environment step limit
--weight-sync-interval <n>refresh sampler weights every N updates
--max-steps-off-policy <n>max rollout staleness for async RL
--stop <token>stop token, repeatable

Once the job exists, these commands answer different questions:

Terminal window
dn train list
dn train get <job-id>
dn train wait <job-id> --json
dn train logs <job-id>
dn train artifacts <job-id>
dn train cancel <job-id> --json

Use them like this:

  • list finds the job again later
  • get shows the current state and saved config
  • wait blocks until a terminal state
  • logs is the first place to look for training failures
  • artifacts is where checkpoints, adapters, or final outputs show up
  • cancel stops the job but still preserves the server-side record

Queued jobs cancel immediately. Running jobs first become cancel-requested and may continue to show running until the worker finishes cleanup and writes the terminal state.

dn train wait exits non-zero if the terminal status is failed or cancelled.

Start with:

  • sft when you already have demonstrations
  • rl when you have rewards, verifiers, or environment outcomes

If you are still changing the prompt or instructions rather than the model weights, use /cli/optimization/ first.