Running training jobs

Submit, wait on, inspect, cancel, and retry hosted training jobs from the CLI, the SDK, or the App.

A hosted training job is a server-side record with a lifecycle. Submit creates it in queued, workers advance it through running → completed / failed / cancelled. (pending is reserved in the schema for future use; current submissions land in queued directly.) These commands are how you inspect, wait on, cancel, or retry that record without dropping into the App.

CLI lifecycle

dn train list                         # in-flight and recent jobs
dn train get <job-id>                 # resolved refs + status + metrics
dn train wait <job-id>                # block until terminal state
dn train logs <job-id>                # structured worker log entries
dn train artifacts <job-id>           # outputs produced by the run
dn train cancel <job-id>              # stop a queued or running job

All subcommands accept --json to dump the raw response payload instead of a rendered summary. Full flag surface: dn train.

Waiting

dn train wait <job-id> polls until the job reaches a terminal state. Two flags bound the wait:

--poll-interval-sec <float> (default 5.0) — how often to refresh.
--timeout-sec <float> (optional) — give up after this many wall-clock seconds.

The command exits non-zero when the final status is anything other than completed — not just failed or cancelled. If a timeout fires before the job is terminal, that too is a non-zero exit. Use this in CI to fail the step on anything that isn’t a clean finish.

The same --wait flag on dn train sft and dn train rl submits and then enters the same poll loop in one shot.

Logs

dn train logs <job-id> returns structured log entries — each line carries an ISO-8601 timestamp, a level (debug, info, warning, error), a message, and an optional data object. Pass --json for the raw payload. Logs persist on the job record and stay available after the job finishes.

This is the fastest path to a failure root cause. A job that settles to failed with no useful error string almost always has the real story in the logs.

Cancellation

dn train cancel <job-id>

Behavior depends on the job state:

Queued — moves directly to cancelled.
Running — records cancel_requested_at and asks the worker to stop. The status stays running until the worker finishes cleanup and settles the terminal state.
Terminal — no-op.

You can submit cancel any number of times; the backend handles the idempotency.

Retry

Retry keeps the saved job config but clears metrics, artifact refs, and worker state before re-queuing. It only applies to terminal jobs (completed, failed, cancelled).

from dreadnode.app.api.client import ApiClient

client = ApiClient("https://app.dreadnode.io", api_key="dn_...")
new_status = client.retry_training_job("acme", "research", job_id)

Retry is also available as a button on the App’s Training view.

From the SDK

Every CLI command has a one-to-one SDK method on ApiClient:

client.list_training_jobs("acme", "research")           # paginated
client.get_training_job("acme", "research", job_id)
client.list_training_job_logs("acme", "research", job_id)
client.get_training_job_artifacts("acme", "research", job_id)
client.cancel_training_job("acme", "research", job_id)
client.retry_training_job("acme", "research", job_id)

list_training_jobs supports page, page_size, status, backend, trainer_type, and project_ref filters. page_size is capped at 100 — page through the list rather than asking for a larger window. The SDK does not ship a built-in wait helper; loop on get_training_job with a backoff if you need async SDK waiting, or lean on dn train wait.

From the App

The App’s Training view surfaces the same list of jobs with live metrics, logs, and Cancel / Retry buttons. It’s the easiest way to watch a long job and pick up a new one without a terminal. Clicking a row loads the detail pane; the list-side pagination matches the page/page_size params on the API.

Where to go next

Monitoring for what the App’s Training view shows while a job is live.
Outputs for the shape of artifacts, metrics, and logs on a completed job.