dreadnode.datasets
API reference for the dreadnode.datasets module.
Dataset
Section titled “Dataset”Dataset( name: str, storage: Storage | None = None, version: str | None = None,)Published dataset loader backed by local storage manifests.
LocalDataset
Section titled “LocalDataset”LocalDataset( name: str, storage: Storage, version: str | None = None)Dataset stored in CAS, usable without package installation.
This class provides a way to work with datasets stored in the Content-Addressable Storage without requiring them to be installed as Python packages with entry points.
Example
from dreadnode.datasets import LocalDataset from dreadnode.storage import Storage
storage = Storage()
Create from HuggingFace dataset
Section titled “Create from HuggingFace dataset”from datasets import load_dataset hf_ds = load_dataset(“squad”, split=“train[:100]”) local_ds = LocalDataset.from_hf(hf_ds, “my-squad”, storage)
Use with HuggingFace features
Section titled “Use with HuggingFace features”ds = local_ds.to_hf() ds = ds.map(lambda x: {“lower”: x[“question”].lower()})
Load existing dataset
Section titled “Load existing dataset”local_ds = LocalDataset(“my-squad”, storage)
Load a local dataset by name.
Parameters:
name(str) –Dataset name.storage(Storage) –Storage instance for CAS access.version(str | None, default:None) –Specific version to load. If None, loads latest.
files: list[str]List of artifact file paths.
format
Section titled “format”format: strData format (parquet, csv, arrow, etc.).
manifest
Section titled “manifest”manifest: DatasetManifestLoad and cache the manifest.
row_count
Section titled “row_count”row_count: int | NoneNumber of rows.
schema
Section titled “schema”schema: dict[str, str]Column schema.
splits
Section titled “splits”splits: list[str] | NoneAvailable splits, if any.
from_dir
Section titled “from_dir”from_dir( source_dir: str | Path, storage: Storage, *, name: str | None = None, version: str | None = None,) -> LocalDatasetStore a dataset source directory described by dataset.yaml in CAS.
from_hf
Section titled “from_hf”from_hf( hf_dataset: Dataset | DatasetDict, name: str, storage: Storage, format: Literal[ "parquet", "arrow", "feather" ] = "parquet", version: str = "0.1.0",) -> LocalDatasetStore HuggingFace Dataset in CAS and return LocalDataset.
Parameters:
hf_dataset(Dataset | DatasetDict) –HuggingFace Dataset or DatasetDict to store.name(str) –Name for the dataset.storage(Storage) –Storage instance for CAS access.format(Literal['parquet', 'arrow', 'feather'], default:'parquet') –Output format (parquet, arrow, feather).version(str, default:'0.1.0') –Version string.
Returns:
LocalDataset–LocalDataset instance for the stored data.
Example
from datasets import load_dataset hf_ds = load_dataset(“squad”, split=“train[:100]”) local_ds = LocalDataset.from_hf(hf_ds, “my-squad”, storage)
load(split: str | None = None) -> pa.TableLoad dataset as PyArrow Table.
Parameters:
split(str | None, default:None) –Optional split name to load (e.g., “train”, “test”). If None, loads the first/only file.
Returns:
Table–PyArrow Table with the data.
publish
Section titled “publish”publish(version: str | None = None) -> NoneCreate a DN package for signing and distribution.
This converts the local dataset into a proper Python package with entry points that can be installed and discovered.
Parameters:
version(str | None, default:None) –Version for the package. If None, uses current version.
Raises:
NotImplementedError–Package creation not yet implemented.
to_hf(split: str | None = None) -> datasets.DatasetLoad and convert to HuggingFace Dataset.
Parameters:
split(str | None, default:None) –Optional split to load.
Returns:
Dataset–HuggingFace Dataset with full functionality.
to_pandas
Section titled “to_pandas”to_pandas(split: str | None = None) -> AnyLoad as pandas DataFrame.
Parameters:
split(str | None, default:None) –Optional split to load.
Returns:
Any–pandas DataFrame.
load_dataset
Section titled “load_dataset”load_dataset( path: str | Path, *, dataset_name: str | None = None, storage: Storage | None = None, split: str | None = None, format: Literal[ "parquet", "arrow", "feather" ] = "parquet", version: str | None = None, **kwargs: Any,) -> LocalDatasetLoad a dataset from HuggingFace Hub or a local source directory.
Parameters:
path(str | Path) –HuggingFace dataset path or a local dataset source directory.dataset_name(str | None, default:None) –Name to store the dataset as locally. Defaults to the path.storage(Storage | None, default:None) –Storage instance. If None, creates default storage.split(str | None, default:None) –Dataset split to load (e.g., “train”, “test”, “train[:100]”).format(Literal['parquet', 'arrow', 'feather'], default:'parquet') –Storage format (parquet, arrow, feather).version(str | None, default:None) –Version string for the stored dataset.**kwargs(Any, default:{}) –Additional arguments passed to HuggingFace’s load_dataset.
Returns:
LocalDataset–LocalDataset instance with the loaded data.
Example
from dreadnode.datasets import load_dataset
Load and store a HuggingFace dataset
Section titled “Load and store a HuggingFace dataset”ds = load_dataset(“squad”, split=“train[:100]”) ds = ds.to_hf().map(lambda x: {“lower”: x[“question”].lower()})
Load with custom name and storage
Section titled “Load with custom name and storage”ds = load_dataset(“imdb”, dataset_name=“my-imdb”, split=“train”)