Datasets
Versioned data for evaluations, training, and optimization — authored as a directory, published as an artifact, pinned by reference.
A Dreadnode dataset is a directory with a dataset.yaml manifest that the platform packages, versions, and serves back by reference. Author locally, publish a version, then pin that version from an evaluation, training job, or optimization study.
support-prompts/ dataset.yaml splits/ train.parquet validation.parquet test.parquetEvery consumer — training job configs, the SDK pull/load path, and the CLI — resolves the same org/name@version reference.
The lifecycle
Section titled “The lifecycle”- Author the directory locally: a
dataset.yaml, one or more data files, splits if needed. - Inspect before publishing —
dn dataset inspect ./pathcatches schema and format problems before anything leaves your machine. - Push to the registry with
dn dataset pushordn.push_dataset(...). - Share or pin: keep the version private to your organization, or
dn dataset publishit to the public catalog. - Consume from evaluations, training, optimization, or ad-hoc SDK code by pinning
org/name@version.
Every step is covered on one of the pages below.
Formats and splits
Section titled “Formats and splits”Datasets hold tabular data. Supported artifact formats are parquet, csv, arrow, feather, json, and jsonl — all within one dataset must share one format. Parquet is the default and the cheapest to ship.
Splits are optional. When dataset.yaml declares splits: {train: ..., test: ...}, consumers can ask for one (load_dataset(..., split="train"), dn dataset pull --split train). Without splits, the dataset is a flat set of rows across one or more files.
When a dataset belongs in the registry
Section titled “When a dataset belongs in the registry”Publish a dataset when the rows need to live somewhere reproducible — benchmarks you rerun, training corpora, adversarial goal sets, regression suites. Every rerun of a pinned version loads the same bytes.
Keep rows inline when they are one-shot evaluation inputs scoped to a single config file. Evaluation manifests accept a dataset: block with per-row parameters for exactly this case — see Evaluations → Inputs. Same noun, different mechanic; the registry page is about the durable-artifact side.
Related surfaces
Section titled “Related surfaces”Full CLI: dn dataset. The Hub shows the same registry visually — org and public datasets, version history, facet filters, download activity.