Authoring a dataset

Structure a dataset directory, write dataset.yaml, declare splits and schema, and inspect locally before publishing.

A dataset source is a directory, a manifest, and one or more data files. The authoring loop is “edit → inspect → fix” until the local preview matches what you want the registry to store.

The directory shape

support-prompts/
  dataset.yaml             # required — the manifest
  splits/
    train.parquet
    validation.parquet
    test.parquet

One file per split is idiomatic, but nothing stops you from putting everything in data.parquet at the root. Files can live anywhere under the directory — dataset.yaml addresses them with paths relative to the root.

See the manifest reference for every accepted field. This page covers the decisions worth thinking about.

Minimum manifest

name: support-prompts
version: 0.1.0

That’s enough to push. Every other field is derived or optional:

format is inferred from the first artifact’s extension.
data_schema is inferred from the first artifact’s columns.
row_count is summed across artifacts.
Artifact paths default to every file under the directory with a known extension (.parquet, .csv, .arrow, .feather, .json, .jsonl).

Set those fields explicitly when you want the Hub record to reflect a curated intent rather than inference.

Declare splits

When a consumer should be able to ask for train or test by name, declare splits:

name: support-prompts
version: 0.1.0
format: parquet
splits:
  train: ./splits/train.parquet
  validation: ./splits/val.parquet
  test: ./splits/test.parquet

The keys become the names you pass to load_dataset(..., split="train") and dn dataset pull --split train. Paths are relative to the directory root and must stay inside it.

Use files: instead when the dataset is one flat set of rows without named partitions:

files:
  - ./data.parquet

If both splits and files are set, splits wins — the files list is ignored. When neither is set, every file with a known tabular extension is included.

Declare schema

Inferred schema is fine for most cases. Declare it explicitly when the inferred PyArrow type is wrong (e.g. JSON loaders that read every number as double) or when you want the Hub record to show the columns you care about:

data_schema:
  ticket_id: string
  body: large_string
  intent: string
  priority: int32
  created_at: timestamp[us]

row_count is the same deal — set it when the loader count is wrong (streaming files, known deduplication), otherwise let dataset.yaml omit it.

Load from HuggingFace

To bring a HuggingFace dataset into your local store without a source directory, use dn.load_dataset from the SDK:

import dreadnode as dn

local_ds = dn.load_dataset("squad", split="train[:500]")
print(local_ds.to_pandas().head())

That pulls from the HuggingFace Hub, stores the rows in Dreadnode’s content-addressable storage, and returns a LocalDataset. To publish a HuggingFace-sourced dataset back to the Dreadnode registry, re-emit it as a directory first — write the parquet files and a dataset.yaml — and push that. See Using in code for the full mechanics of LocalDataset.

Inspect before pushing

dn dataset inspect ./support-prompts

[email protected]
  format:  parquet
  rows:    48,213
  splits:  train, validation, test

              Schema
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Column      ┃ Type           ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ ticket_id   │ string         │
│ body        │ large_string   │
│ intent      │ string         │
│ priority    │ int32          │
│ created_at  │ timestamp[us]  │
└─────────────┴────────────────┘

inspect does three things:

Validates the manifest — dataset.yaml parses, version is semver, paths resolve.
Loads every artifact — a bad parquet file fails here, not after an upload.
Confirms the schema matches what you declared (or infers one you didn’t).

Add --json when you want the same output as machine-readable JSON.

Version numbers

Versions are fixed semver (X.Y.Z). Pre-release tags and build suffixes are rejected. Bump the version in dataset.yaml before every push; the registry rejects a push that collides with an existing version.

What to reach for next

Push the dataset → Publishing
Load it in Python after it’s published → Using in code
Every dataset.yaml field → Manifest reference