dataset.yaml reference
Every field of the dataset manifest, accepted values, and defaults.
Every dataset published to Dreadnode is a directory with a dataset.yaml manifest at the root. This page enumerates every field accepted by that manifest.
For authoring guidance, see Authoring a dataset.
Top-level fields
Section titled “Top-level fields”| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
name | string | No | directory name | Registry name. Override with --name on dn dataset push. |
version | string | No | 0.1.0 | Fixed semver (X.Y.Z). Pre-release and build suffixes are rejected. |
summary | string | No | none | One-line description shown in list output and the Hub. |
description | string | No | none | Alias for summary. summary wins if both are set. |
format | string | No | inferred from file extensions | One of parquet, csv, arrow, feather, json, jsonl. Applied across every artifact. |
data_schema | mapping of string | No | inferred from first artifact | Column name → type string (e.g. string, int64, timestamp[us]). |
row_count | integer | No | summed across artifacts | Total rows. Override when the true count differs from what the loader sees. |
splits | mapping of string | No | none | Split name → relative artifact path. Takes precedence over files if both are set. |
files | list of strings | No | all files with known extensions | Explicit artifact paths relative to the directory root. Ignored when splits is also set. |
Artifact discovery
Section titled “Artifact discovery”One of three paths decides which files enter the manifest:
| Manifest has | Behavior |
|---|---|
splits: | Each value is a path relative to the directory root. Paths must stay inside it. |
files: | Each entry is a path relative to the directory root. Paths must stay inside it. |
| Neither | Every file whose extension is .parquet, .csv, .arrow, .feather, .json, or .jsonl is included. Everything else — including dataset.yaml itself — is excluded. |
.git, __pycache__, and .DS_Store are always excluded.
Schema strings
Section titled “Schema strings”data_schema values are PyArrow type strings. Common values:
| Category | Examples |
|---|---|
| Integers | int8, int16, int32, int64, uint32 |
| Floats | float16, float32, float64 |
| Strings | string, large_string |
| Temporal | date32[day], timestamp[ms], timestamp[us] |
| Logical | bool |
| Nested | list<string>, struct<a: int32, b: string> |
When data_schema is omitted, the first artifact is loaded and {field.name: str(field.type)} is recorded for each column.
Formats
Section titled “Formats”format determines how each artifact is read by dn.load_dataset and dn dataset inspect.
| Value | Reader | Notes |
|---|---|---|
parquet | pyarrow.parquet | Default and recommended. |
csv | pyarrow.csv | No format-level options. |
arrow | pyarrow.feather | Alias for feather. |
feather | pyarrow.feather | |
json | pyarrow.json | One JSON value per file. |
jsonl | pyarrow.json | One value per line. |
All artifacts in one dataset must share a format. Mixed-format datasets are not supported.
Version rules
Section titled “Version rules”Versions use fixed semver: three integers joined by dots. 1.0.0 is valid; 1.0, 1.0.0-rc1, and 1.0.0+build are not. dn dataset push rejects invalid versions before uploading.
Example
Section titled “Example”name: support-promptsversion: 1.2.0summary: Labeled support tickets for intent classification.format: parquetrow_count: 50_000splits: train: ./splits/train.parquet validation: ./splits/val.parquet test: ./splits/test.parquetdata_schema: ticket_id: string body: large_string intent: string priority: int32 created_at: timestamp[us]