Skip to content

Running training jobs

Submit, wait on, inspect, cancel, and retry hosted training jobs from the CLI, the SDK, or the App.

A hosted training job is a server-side record with a lifecycle. Submit creates it in queued, workers advance it through runningcompleted / failed / cancelled. (pending is reserved in the schema for future use; current submissions land in queued directly.) These commands are how you inspect, wait on, cancel, or retry that record without dropping into the App.

Terminal window
dn train list # in-flight and recent jobs
dn train get <job-id> # resolved refs + status + metrics
dn train wait <job-id> # block until terminal state
dn train logs <job-id> # structured worker log entries
dn train artifacts <job-id> # outputs produced by the run
dn train cancel <job-id> # stop a queued or running job

All subcommands accept --json to dump the raw response payload instead of a rendered summary. Full flag surface: dn train.

dn train wait <job-id> polls until the job reaches a terminal state. Two flags bound the wait:

  • --poll-interval-sec <float> (default 5.0) — how often to refresh.
  • --timeout-sec <float> (optional) — give up after this many wall-clock seconds.

The command exits non-zero when the final status is anything other than completed — not just failed or cancelled. If a timeout fires before the job is terminal, that too is a non-zero exit. Use this in CI to fail the step on anything that isn’t a clean finish.

The same --wait flag on dn train sft and dn train rl submits and then enters the same poll loop in one shot.

dn train logs <job-id> returns structured log entries — each line carries an ISO-8601 timestamp, a level (debug, info, warning, error), a message, and an optional data object. Pass --json for the raw payload. Logs persist on the job record and stay available after the job finishes.

This is the fastest path to a failure root cause. A job that settles to failed with no useful error string almost always has the real story in the logs.

Terminal window
dn train cancel <job-id>

Behavior depends on the job state:

  • Queued — moves directly to cancelled.
  • Running — records cancel_requested_at and asks the worker to stop. The status stays running until the worker finishes cleanup and settles the terminal state.
  • Terminal — no-op.

You can submit cancel any number of times; the backend handles the idempotency.

Retry keeps the saved job config but clears metrics, artifact refs, and worker state before re-queuing. It only applies to terminal jobs (completed, failed, cancelled).

from dreadnode.app.api.client import ApiClient
client = ApiClient("https://app.dreadnode.io", api_key="dn_...")
new_status = client.retry_training_job("acme", "research", job_id)

Retry is also available as a button on the App’s Training view.

Every CLI command has a one-to-one SDK method on ApiClient:

client.list_training_jobs("acme", "research") # paginated
client.get_training_job("acme", "research", job_id)
client.list_training_job_logs("acme", "research", job_id)
client.get_training_job_artifacts("acme", "research", job_id)
client.cancel_training_job("acme", "research", job_id)
client.retry_training_job("acme", "research", job_id)

list_training_jobs supports page, page_size, status, backend, trainer_type, and project_ref filters. page_size is capped at 100 — page through the list rather than asking for a larger window. The SDK does not ship a built-in wait helper; loop on get_training_job with a backoff if you need async SDK waiting, or lean on dn train wait.

The App’s Training view surfaces the same list of jobs with live metrics, logs, and Cancel / Retry buttons. It’s the easiest way to watch a long job and pick up a new one without a terminal. Clicking a row loads the detail pane; the list-side pagination matches the page/page_size params on the API.

  • Monitoring for what the App’s Training view shows while a job is live.
  • Outputs for the shape of artifacts, metrics, and logs on a completed job.