Skip to content

dataset.yaml reference

Every field of the dataset manifest, accepted values, and defaults.

Every dataset published to Dreadnode is a directory with a dataset.yaml manifest at the root. This page enumerates every field accepted by that manifest.

For authoring guidance, see Authoring a dataset.

FieldTypeRequiredDefaultNotes
namestringNodirectory nameRegistry name. Override with --name on dn dataset push.
versionstringNo0.1.0Fixed semver (X.Y.Z). Pre-release and build suffixes are rejected.
summarystringNononeOne-line description shown in list output and the Hub.
descriptionstringNononeAlias for summary. summary wins if both are set.
formatstringNoinferred from file extensionsOne of parquet, csv, arrow, feather, json, jsonl. Applied across every artifact.
data_schemamapping of stringNoinferred from first artifactColumn name → type string (e.g. string, int64, timestamp[us]).
row_countintegerNosummed across artifactsTotal rows. Override when the true count differs from what the loader sees.
splitsmapping of stringNononeSplit name → relative artifact path. Takes precedence over files if both are set.
fileslist of stringsNoall files with known extensionsExplicit artifact paths relative to the directory root. Ignored when splits is also set.

One of three paths decides which files enter the manifest:

Manifest hasBehavior
splits:Each value is a path relative to the directory root. Paths must stay inside it.
files:Each entry is a path relative to the directory root. Paths must stay inside it.
NeitherEvery file whose extension is .parquet, .csv, .arrow, .feather, .json, or .jsonl is included. Everything else — including dataset.yaml itself — is excluded.

.git, __pycache__, and .DS_Store are always excluded.

data_schema values are PyArrow type strings. Common values:

CategoryExamples
Integersint8, int16, int32, int64, uint32
Floatsfloat16, float32, float64
Stringsstring, large_string
Temporaldate32[day], timestamp[ms], timestamp[us]
Logicalbool
Nestedlist<string>, struct<a: int32, b: string>

When data_schema is omitted, the first artifact is loaded and {field.name: str(field.type)} is recorded for each column.

format determines how each artifact is read by dn.load_dataset and dn dataset inspect.

ValueReaderNotes
parquetpyarrow.parquetDefault and recommended.
csvpyarrow.csvNo format-level options.
arrowpyarrow.featherAlias for feather.
featherpyarrow.feather
jsonpyarrow.jsonOne JSON value per file.
jsonlpyarrow.jsonOne value per line.

All artifacts in one dataset must share a format. Mixed-format datasets are not supported.

Versions use fixed semver: three integers joined by dots. 1.0.0 is valid; 1.0, 1.0.0-rc1, and 1.0.0+build are not. dn dataset push rejects invalid versions before uploading.

name: support-prompts
version: 1.2.0
summary: Labeled support tickets for intent classification.
format: parquet
row_count: 50_000
splits:
train: ./splits/train.parquet
validation: ./splits/val.parquet
test: ./splits/test.parquet
data_schema:
ticket_id: string
body: large_string
intent: string
priority: int32
created_at: timestamp[us]