Authoring a dataset
Structure a dataset directory, write dataset.yaml, declare splits and schema, and inspect locally before publishing.
A dataset source is a directory, a manifest, and one or more data files. The authoring loop is “edit → inspect → fix” until the local preview matches what you want the registry to store.
The directory shape
Section titled “The directory shape”support-prompts/ dataset.yaml # required — the manifest splits/ train.parquet validation.parquet test.parquetOne file per split is idiomatic, but nothing stops you from putting everything in data.parquet at the root. Files can live anywhere under the directory — dataset.yaml addresses them with paths relative to the root.
See the manifest reference for every accepted field. This page covers the decisions worth thinking about.
Minimum manifest
Section titled “Minimum manifest”name: support-promptsversion: 0.1.0That’s enough to push. Every other field is derived or optional:
formatis inferred from the first artifact’s extension.data_schemais inferred from the first artifact’s columns.row_countis summed across artifacts.- Artifact paths default to every file under the directory with a known extension (
.parquet,.csv,.arrow,.feather,.json,.jsonl).
Set those fields explicitly when you want the Hub record to reflect a curated intent rather than inference.
Declare splits
Section titled “Declare splits”When a consumer should be able to ask for train or test by name, declare splits:
name: support-promptsversion: 0.1.0format: parquetsplits: train: ./splits/train.parquet validation: ./splits/val.parquet test: ./splits/test.parquetThe keys become the names you pass to load_dataset(..., split="train") and dn dataset pull --split train. Paths are relative to the directory root and must stay inside it.
Use files: instead when the dataset is one flat set of rows without named partitions:
files: - ./data.parquetIf both splits and files are set, splits wins — the files list is ignored. When neither is set, every file with a known tabular extension is included.
Declare schema
Section titled “Declare schema”Inferred schema is fine for most cases. Declare it explicitly when the inferred PyArrow type is wrong (e.g. JSON loaders that read every number as double) or when you want the Hub record to show the columns you care about:
data_schema: ticket_id: string body: large_string intent: string priority: int32 created_at: timestamp[us]row_count is the same deal — set it when the loader count is wrong (streaming files, known deduplication), otherwise let dataset.yaml omit it.
Load from HuggingFace
Section titled “Load from HuggingFace”To bring a HuggingFace dataset into your local store without a source directory, use dn.load_dataset from the SDK:
import dreadnode as dn
local_ds = dn.load_dataset("squad", split="train[:500]")print(local_ds.to_pandas().head())That pulls from the HuggingFace Hub, stores the rows in Dreadnode’s content-addressable storage, and returns a LocalDataset. To publish a HuggingFace-sourced dataset back to the Dreadnode registry, re-emit it as a directory first — write the parquet files and a dataset.yaml — and push that. See Using in code for the full mechanics of LocalDataset.
Inspect before pushing
Section titled “Inspect before pushing”dn dataset inspect ./support-prompts format: parquet rows: 48,213 splits: train, validation, test
Schema┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓┃ Column ┃ Type ┃┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩│ ticket_id │ string ││ body │ large_string ││ intent │ string ││ priority │ int32 ││ created_at │ timestamp[us] │└─────────────┴────────────────┘inspect does three things:
- Validates the manifest —
dataset.yamlparses,versionis semver, paths resolve. - Loads every artifact — a bad parquet file fails here, not after an upload.
- Confirms the schema matches what you declared (or infers one you didn’t).
Add --json when you want the same output as machine-readable JSON.
Version numbers
Section titled “Version numbers”Versions are fixed semver (X.Y.Z). Pre-release tags and build suffixes are rejected. Bump the version in dataset.yaml before every push; the registry rejects a push that collides with an existing version.
What to reach for next
Section titled “What to reach for next”- Push the dataset → Publishing
- Load it in Python after it’s published → Using in code
- Every
dataset.yamlfield → Manifest reference