Skip to content

dreadnode.datasets

API reference for the dreadnode.datasets module.

Dataset(
name: str,
storage: Storage | None = None,
version: str | None = None,
)

Published dataset loader backed by local storage manifests.

LocalDataset(
name: str, storage: Storage, version: str | None = None
)

Dataset stored in CAS, usable without package installation.

This class provides a way to work with datasets stored in the Content-Addressable Storage without requiring them to be installed as Python packages with entry points.

Example

from dreadnode.datasets import LocalDataset from dreadnode.storage import Storage

storage = Storage()

from datasets import load_dataset hf_ds = load_dataset(“squad”, split=“train[:100]”) local_ds = LocalDataset.from_hf(hf_ds, “my-squad”, storage)

ds = local_ds.to_hf() ds = ds.map(lambda x: {“lower”: x[“question”].lower()})

local_ds = LocalDataset(“my-squad”, storage)

Load a local dataset by name.

Parameters:

  • name (str) –Dataset name.
  • storage (Storage) –Storage instance for CAS access.
  • version (str | None, default: None ) –Specific version to load. If None, loads latest.
files: list[str]

List of artifact file paths.

format: str

Data format (parquet, csv, arrow, etc.).

manifest: DatasetManifest

Load and cache the manifest.

row_count: int | None

Number of rows.

schema: dict[str, str]

Column schema.

splits: list[str] | None

Available splits, if any.

from_dir(
source_dir: str | Path,
storage: Storage,
*,
name: str | None = None,
version: str | None = None,
) -> LocalDataset

Store a dataset source directory described by dataset.yaml in CAS.

from_hf(
hf_dataset: Dataset | DatasetDict,
name: str,
storage: Storage,
format: Literal[
"parquet", "arrow", "feather"
] = "parquet",
version: str = "0.1.0",
) -> LocalDataset

Store HuggingFace Dataset in CAS and return LocalDataset.

Parameters:

  • hf_dataset (Dataset | DatasetDict) –HuggingFace Dataset or DatasetDict to store.
  • name (str) –Name for the dataset.
  • storage (Storage) –Storage instance for CAS access.
  • format (Literal['parquet', 'arrow', 'feather'], default: 'parquet' ) –Output format (parquet, arrow, feather).
  • version (str, default: '0.1.0' ) –Version string.

Returns:

  • LocalDataset –LocalDataset instance for the stored data.

Example

from datasets import load_dataset hf_ds = load_dataset(“squad”, split=“train[:100]”) local_ds = LocalDataset.from_hf(hf_ds, “my-squad”, storage)

load(split: str | None = None) -> pa.Table

Load dataset as PyArrow Table.

Parameters:

  • split (str | None, default: None ) –Optional split name to load (e.g., “train”, “test”). If None, loads the first/only file.

Returns:

  • Table –PyArrow Table with the data.
publish(version: str | None = None) -> None

Create a DN package for signing and distribution.

This converts the local dataset into a proper Python package with entry points that can be installed and discovered.

Parameters:

  • version (str | None, default: None ) –Version for the package. If None, uses current version.

Raises:

  • NotImplementedError –Package creation not yet implemented.
to_hf(split: str | None = None) -> datasets.Dataset

Load and convert to HuggingFace Dataset.

Parameters:

  • split (str | None, default: None ) –Optional split to load.

Returns:

  • Dataset –HuggingFace Dataset with full functionality.
to_pandas(split: str | None = None) -> Any

Load as pandas DataFrame.

Parameters:

  • split (str | None, default: None ) –Optional split to load.

Returns:

  • Any –pandas DataFrame.
load_dataset(
path: str | Path,
*,
dataset_name: str | None = None,
storage: Storage | None = None,
split: str | None = None,
format: Literal[
"parquet", "arrow", "feather"
] = "parquet",
version: str | None = None,
**kwargs: Any,
) -> LocalDataset

Load a dataset from HuggingFace Hub or a local source directory.

Parameters:

  • path (str | Path) –HuggingFace dataset path or a local dataset source directory.
  • dataset_name (str | None, default: None ) –Name to store the dataset as locally. Defaults to the path.
  • storage (Storage | None, default: None ) –Storage instance. If None, creates default storage.
  • split (str | None, default: None ) –Dataset split to load (e.g., “train”, “test”, “train[:100]”).
  • format (Literal['parquet', 'arrow', 'feather'], default: 'parquet' ) –Storage format (parquet, arrow, feather).
  • version (str | None, default: None ) –Version string for the stored dataset.
  • **kwargs (Any, default: {} ) –Additional arguments passed to HuggingFace’s load_dataset.

Returns:

  • LocalDataset –LocalDataset instance with the loaded data.

Example

from dreadnode.datasets import load_dataset

ds = load_dataset(“squad”, split=“train[:100]”) ds = ds.to_hf().map(lambda x: {“lower”: x[“question”].lower()})

ds = load_dataset(“imdb”, dataset_name=“my-imdb”, split=“train”)