Skip to content

Authoring a dataset

Structure a dataset directory, write dataset.yaml, declare splits and schema, and inspect locally before publishing.

A dataset source is a directory, a manifest, and one or more data files. The authoring loop is “edit → inspect → fix” until the local preview matches what you want the registry to store.

support-prompts/
dataset.yaml # required — the manifest
splits/
train.parquet
validation.parquet
test.parquet

One file per split is idiomatic, but nothing stops you from putting everything in data.parquet at the root. Files can live anywhere under the directory — dataset.yaml addresses them with paths relative to the root.

See the manifest reference for every accepted field. This page covers the decisions worth thinking about.

name: support-prompts
version: 0.1.0

That’s enough to push. Every other field is derived or optional:

  • format is inferred from the first artifact’s extension.
  • data_schema is inferred from the first artifact’s columns.
  • row_count is summed across artifacts.
  • Artifact paths default to every file under the directory with a known extension (.parquet, .csv, .arrow, .feather, .json, .jsonl).

Set those fields explicitly when you want the Hub record to reflect a curated intent rather than inference.

When a consumer should be able to ask for train or test by name, declare splits:

name: support-prompts
version: 0.1.0
format: parquet
splits:
train: ./splits/train.parquet
validation: ./splits/val.parquet
test: ./splits/test.parquet

The keys become the names you pass to load_dataset(..., split="train") and dn dataset pull --split train. Paths are relative to the directory root and must stay inside it.

Use files: instead when the dataset is one flat set of rows without named partitions:

files:
- ./data.parquet

If both splits and files are set, splits wins — the files list is ignored. When neither is set, every file with a known tabular extension is included.

Inferred schema is fine for most cases. Declare it explicitly when the inferred PyArrow type is wrong (e.g. JSON loaders that read every number as double) or when you want the Hub record to show the columns you care about:

data_schema:
ticket_id: string
body: large_string
intent: string
priority: int32
created_at: timestamp[us]

row_count is the same deal — set it when the loader count is wrong (streaming files, known deduplication), otherwise let dataset.yaml omit it.

To bring a HuggingFace dataset into your local store without a source directory, use dn.load_dataset from the SDK:

import dreadnode as dn
local_ds = dn.load_dataset("squad", split="train[:500]")
print(local_ds.to_pandas().head())

That pulls from the HuggingFace Hub, stores the rows in Dreadnode’s content-addressable storage, and returns a LocalDataset. To publish a HuggingFace-sourced dataset back to the Dreadnode registry, re-emit it as a directory first — write the parquet files and a dataset.yaml — and push that. See Using in code for the full mechanics of LocalDataset.

Terminal window
dn dataset inspect ./support-prompts
format: parquet
rows: 48,213
splits: train, validation, test
Schema
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Column ┃ Type ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ ticket_id │ string │
│ body │ large_string │
│ intent │ string │
│ priority │ int32 │
│ created_at │ timestamp[us] │
└─────────────┴────────────────┘

inspect does three things:

  1. Validates the manifest — dataset.yaml parses, version is semver, paths resolve.
  2. Loads every artifact — a bad parquet file fails here, not after an upload.
  3. Confirms the schema matches what you declared (or infers one you didn’t).

Add --json when you want the same output as machine-readable JSON.

Versions are fixed semver (X.Y.Z). Pre-release tags and build suffixes are rejected. Bump the version in dataset.yaml before every push; the registry rejects a push that collides with an existing version.