Datasets

Versioned data for evaluations, training, and optimization — authored as a directory, published as an artifact, pinned by reference.

A Dreadnode dataset is a directory with a dataset.yaml manifest that the platform packages, versions, and serves back by reference. Author locally, publish a version, then pin that version from an evaluation, training job, or optimization study.

support-prompts/
  dataset.yaml
  splits/
    train.parquet
    validation.parquet
    test.parquet

dn dataset push ./support-prompts    # → acme/[email protected]

Every consumer — training job configs, the SDK pull/load path, and the CLI — resolves the same org/name@version reference.

The lifecycle

Author the directory locally: a dataset.yaml, one or more data files, splits if needed.
Inspect before publishing — dn dataset inspect ./path catches schema and format problems before anything leaves your machine.
Push to the registry with dn dataset push or dn.push_dataset(...).
Share or pin: keep the version private to your organization, or dn dataset publish it to the public catalog.
Consume from evaluations, training, optimization, or ad-hoc SDK code by pinning org/name@version.

Every step is covered on one of the pages below.

Formats and splits

Datasets hold tabular data. Supported artifact formats are parquet, csv, arrow, feather, json, and jsonl — all within one dataset must share one format. Parquet is the default and the cheapest to ship.

Splits are optional. When dataset.yaml declares splits: {train: ..., test: ...}, consumers can ask for one (load_dataset(..., split="train"), dn dataset pull --split train). Without splits, the dataset is a flat set of rows across one or more files.

When a dataset belongs in the registry

Publish a dataset when the rows need to live somewhere reproducible — benchmarks you rerun, training corpora, adversarial goal sets, regression suites. Every rerun of a pinned version loads the same bytes.

Keep rows inline when they are one-shot evaluation inputs scoped to a single config file. Evaluation manifests accept a dataset: block with per-row parameters for exactly this case — see Evaluations → Inputs. Same noun, different mechanic; the registry page is about the durable-artifact side.

Quickstart

Authoring

Publishing

Catalog

Using in code

Manifest reference

Full CLI: dn dataset. The Hub shows the same registry visually — org and public datasets, version history, facet filters, download activity.

Datasets

The lifecycle

Formats and splits

When a dataset belongs in the registry

Related surfaces