Training
Submit, inspect, wait on, and manage hosted SFT and RL jobs from the dn CLI.
Use dn train ... when the platform should run the training job and track its lifecycle for you.
This is the hosted training surface. It is for jobs that should keep a server-side record, logs, artifacts, and terminal status. If you are still experimenting with prompts or metrics rather than model weights, optimization is usually the better fit.
Before you submit a training job
Section titled “Before you submit a training job”Have these pieces ready first:
- a base model identifier the training backend can access
- a published capability ref that defines the agent or behavior you want to adapt
- one source of training data: a supervised dataset, trajectory datasets, or a live Worlds target
The training job record is only the control plane. The actual outputs you care about later are
usually in dn train artifacts.
Choose the right subcommand
Section titled “Choose the right subcommand”| Command | Use it for |
|---|---|
dn train sft | supervised fine-tuning from datasets or trajectory datasets |
dn train rl | reinforcement learning from prompt datasets, trajectory datasets, or Worlds inputs |
dn train list/get/wait/logs/artifacts/cancel | job inspection and lifecycle management |
A normal training flow
Section titled “A normal training flow”Most people should think about training in this order:
- choose
sftorrl - submit one job with a narrow, explicit config
- wait or poll until the job settles
- read logs for debugging and artifacts for outputs
If you already selected a platform project through --project, environment variables, or a saved
profile, dn train sft and dn train rl reuse that key as project_ref unless you pass
--project-ref explicitly.
Submit SFT jobs
Section titled “Submit SFT jobs”Use dn train sft when you already have the behavior you want in demonstration form. That usually
means one of two things:
- you have a normal supervised dataset of prompts and target outputs
- you have trajectory datasets from prior Worlds or agent runs and want to learn from them
dn train sft \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --steps 100 \ --wait \ --jsonIn that example:
--datasetis the direct supervised input--capabilitytells the backend which capability context to train around--waitturns the command into a synchronous shell workflow instead of fire-and-forget submit
You can also train directly from published Worlds trajectory datasets:
dn train sft \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --steps 50Use trajectory datasets when the demonstrations already exist as rollouts rather than flat prompt or response rows.
Common SFT flags:
| Flag | Description |
|---|---|
--dataset NAME@VERSION | primary supervised dataset |
--trajectory-dataset NAME@VERSION | Worlds trajectory dataset input, repeatable |
--eval-dataset NAME@VERSION | optional eval dataset |
--batch-size <n> | per-step batch size |
--gradient-accumulation-steps <n> | gradient accumulation factor |
--learning-rate <float> | optimizer learning rate |
--checkpoint-interval <n> | save checkpoint every N steps |
--wait | poll until terminal state |
--json | print the full job payload |
Submit RL jobs
Section titled “Submit RL jobs”Use dn train rl when the signal comes from reward logic, verifier outcomes, or environment
rollouts rather than from fixed target answers.
RL is the more decision-heavy path, so the most useful first question is: where will the experience come from?
| Input source | Use it when |
|---|---|
--prompt-dataset | you already have prompts and will score the outputs |
--trajectory-dataset | you want offline RL from previously collected trajectories |
--world-manifest-id | you want the job to sample from a live Worlds environment |
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --algorithm importance_sampling \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --reward-recipe contains_v1 \ --reward-params '{"needle":"flag"}'That pattern is verifier- or reward-driven RL: the prompt dataset supplies prompts, and the reward recipe decides what counts as success.
For Worlds-driven offline RL, replace the prompt dataset with trajectory datasets:
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --algorithm importance_samplingWorlds-backed RL
Section titled “Worlds-backed RL”When you want the job to sample from a live Worlds manifest, point it at the manifest directly:
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \ --world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \ --world-agent-name operator \ --world-goal "Escalate to Domain Admin in corp.local" \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --num-rollouts 8Use this when the job should generate fresh experience against an environment instead of learning
purely from stored datasets. --world-runtime-id and --world-agent-name are how you tie that
rollout to an existing runtime-bound capability snapshot when you need one.
If you also pass --world-reward, the job falls back to the older live-rollout reward path.
Common RL flags
Section titled “Common RL flags”| Flag | Description |
|---|---|
--task REF | task ref for verifier-driven RL |
--prompt-dataset REF | prompt dataset input |
--trajectory-dataset REF | Worlds trajectory dataset input, repeatable |
--world-manifest-id ID | live Worlds manifest target |
--world-runtime-id ID | runtime whose capability bindings should be used |
--world-agent-name NAME | optional agent selection inside that runtime-bound capability |
--world-goal TEXT | optional live rollout goal override |
--world-reward NAME | named live Worlds reward policy |
--world-reward-params JSON | JSON params for the selected Worlds reward |
--execution-mode <mode> | sync, one_step_off_async, or fully_async |
--steps <n> | number of optimization steps |
--num-rollouts <n> | rollouts per update |
--max-turns <n> | maximum turns per episode |
--max-episode-steps <n> | environment step limit |
--weight-sync-interval <n> | refresh sampler weights every N updates |
--max-steps-off-policy <n> | max rollout staleness for async RL |
--stop <token> | stop token, repeatable |
After the job starts
Section titled “After the job starts”Once the job exists, these commands answer different questions:
dn train listdn train get <job-id>dn train wait <job-id> --jsondn train logs <job-id>dn train artifacts <job-id>dn train cancel <job-id> --jsonUse them like this:
listfinds the job again latergetshows the current state and saved configwaitblocks until a terminal statelogsis the first place to look for training failuresartifactsis where checkpoints, adapters, or final outputs show upcancelstops the job but still preserves the server-side record
Queued jobs cancel immediately. Running jobs first become cancel-requested and may continue to show
running until the worker finishes cleanup and writes the terminal state.
dn train wait exits non-zero if the terminal status is failed or cancelled.
Practical rule
Section titled “Practical rule”Start with:
sftwhen you already have demonstrationsrlwhen you have rewards, verifiers, or environment outcomes
If you are still changing the prompt or instructions rather than the model weights, use /cli/optimization/ first.