dataset.yaml reference

Every field of the dataset manifest, accepted values, and defaults.

Every dataset published to Dreadnode is a directory with a dataset.yaml manifest at the root. This page enumerates every field accepted by that manifest.

For authoring guidance, see Authoring a dataset.

Top-level fields

Field	Type	Required	Default	Notes
`name`	string	No	directory name	Registry name. Override with `--name` on `dn dataset push`.
`version`	string	No	`0.1.0`	Fixed semver (`X.Y.Z`). Pre-release and build suffixes are rejected.
`summary`	string	No	none	One-line description shown in list output and the Hub.
`description`	string	No	none	Alias for `summary`. `summary` wins if both are set.
`format`	string	No	inferred from file extensions	One of `parquet`, `csv`, `arrow`, `feather`, `json`, `jsonl`. Applied across every artifact.
`data_schema`	mapping of string	No	inferred from first artifact	Column name → type string (e.g. `string`, `int64`, `timestamp[us]`).
`row_count`	integer	No	summed across artifacts	Total rows. Override when the true count differs from what the loader sees.
`splits`	mapping of string	No	none	Split name → relative artifact path. Takes precedence over `files` if both are set.
`files`	list of strings	No	all files with known extensions	Explicit artifact paths relative to the directory root. Ignored when `splits` is also set.

Artifact discovery

One of three paths decides which files enter the manifest:

Manifest has	Behavior
`splits:`	Each value is a path relative to the directory root. Paths must stay inside it.
`files:`	Each entry is a path relative to the directory root. Paths must stay inside it.
Neither	Every file whose extension is `.parquet`, `.csv`, `.arrow`, `.feather`, `.json`, or `.jsonl` is included. Everything else — including `dataset.yaml` itself — is excluded.

.git, __pycache__, and .DS_Store are always excluded.

Schema strings

data_schema values are PyArrow type strings. Common values:

Category	Examples
Integers	`int8`, `int16`, `int32`, `int64`, `uint32`
Floats	`float16`, `float32`, `float64`
Strings	`string`, `large_string`
Temporal	`date32[day]`, `timestamp[ms]`, `timestamp[us]`
Logical	`bool`
Nested	`list<string>`, `struct<a: int32, b: string>`

When data_schema is omitted, the first artifact is loaded and {field.name: str(field.type)} is recorded for each column.

Formats

format determines how each artifact is read by dn.load_dataset and dn dataset inspect.

Value	Reader	Notes
`parquet`	`pyarrow.parquet`	Default and recommended.
`csv`	`pyarrow.csv`	No format-level options.
`arrow`	`pyarrow.feather`	Alias for `feather`.
`feather`	`pyarrow.feather`
`json`	`pyarrow.json`	One JSON value per file.
`jsonl`	`pyarrow.json`	One value per line.

All artifacts in one dataset must share a format. Mixed-format datasets are not supported.

Version rules

Versions use fixed semver: three integers joined by dots. 1.0.0 is valid; 1.0, 1.0.0-rc1, and 1.0.0+build are not. dn dataset push rejects invalid versions before uploading.

Example

name: support-prompts
version: 1.2.0
summary: Labeled support tickets for intent classification.
format: parquet
row_count: 50_000
splits:
  train: ./splits/train.parquet
  validation: ./splits/val.parquet
  test: ./splits/test.parquet
data_schema:
  ticket_id: string
  body: large_string
  intent: string
  priority: int32
  created_at: timestamp[us]