Running training jobs
Submit, wait on, inspect, cancel, and retry hosted training jobs from the CLI, the SDK, or the App.
A hosted training job is a server-side record with a lifecycle. Submit creates it in queued,
workers advance it through running → completed / failed / cancelled. (pending is
reserved in the schema for future use; current submissions land in queued directly.) These
commands are how you inspect, wait on, cancel, or retry that record without dropping into the
App.
CLI lifecycle
Section titled “CLI lifecycle”dn train list # in-flight and recent jobsdn train get <job-id> # resolved refs + status + metricsdn train wait <job-id> # block until terminal statedn train logs <job-id> # structured worker log entriesdn train artifacts <job-id> # outputs produced by the rundn train cancel <job-id> # stop a queued or running jobAll subcommands accept --json to dump the raw response payload instead of a rendered summary.
Full flag surface: dn train.
Waiting
Section titled “Waiting”dn train wait <job-id> polls until the job reaches a terminal state. Two flags bound the wait:
--poll-interval-sec <float>(default5.0) — how often to refresh.--timeout-sec <float>(optional) — give up after this many wall-clock seconds.
The command exits non-zero when the final status is anything other than completed — not
just failed or cancelled. If a timeout fires before the job is terminal, that too is a
non-zero exit. Use this in CI to fail the step on anything that isn’t a clean finish.
The same --wait flag on dn train sft and dn train rl submits and then enters the same poll
loop in one shot.
dn train logs <job-id> returns structured log entries — each line carries an ISO-8601
timestamp, a level (debug, info, warning, error), a message, and an optional data
object. Pass --json for the raw payload. Logs persist on the job record and stay available
after the job finishes.
This is the fastest path to a failure root cause. A job that settles to failed with no useful
error string almost always has the real story in the logs.
Cancellation
Section titled “Cancellation”dn train cancel <job-id>Behavior depends on the job state:
- Queued — moves directly to
cancelled. - Running — records
cancel_requested_atand asks the worker to stop. The status staysrunninguntil the worker finishes cleanup and settles the terminal state. - Terminal — no-op.
You can submit cancel any number of times; the backend handles the idempotency.
Retry keeps the saved job config but clears metrics, artifact refs, and worker state before
re-queuing. It only applies to terminal jobs (completed, failed, cancelled).
from dreadnode.app.api.client import ApiClient
client = ApiClient("https://app.dreadnode.io", api_key="dn_...")new_status = client.retry_training_job("acme", "research", job_id)Retry is also available as a button on the App’s Training view.
From the SDK
Section titled “From the SDK”Every CLI command has a one-to-one SDK method on ApiClient:
client.list_training_jobs("acme", "research") # paginatedclient.get_training_job("acme", "research", job_id)client.list_training_job_logs("acme", "research", job_id)client.get_training_job_artifacts("acme", "research", job_id)client.cancel_training_job("acme", "research", job_id)client.retry_training_job("acme", "research", job_id)list_training_jobs supports page, page_size, status, backend, trainer_type, and
project_ref filters. page_size is capped at 100 — page through the list rather than
asking for a larger window. The SDK does not ship a built-in wait helper; loop on
get_training_job with a backoff if you need async SDK waiting, or lean on dn train wait.
From the App
Section titled “From the App”The App’s Training view surfaces the same list of jobs with live
metrics, logs, and Cancel / Retry buttons. It’s the easiest way to watch a long job and pick up
a new one without a terminal. Clicking a row loads the detail pane; the list-side pagination
matches the page/page_size params on the API.
Where to go next
Section titled “Where to go next”- Monitoring for what the App’s Training view shows while a job is live.
- Outputs for the shape of artifacts, metrics, and logs on a completed job.