Using in code

Load dataset rows in Python for evaluations, training, and AIRT suites — from HuggingFace, local sources, or published versions.

The SDK gives you two entry points to a dataset: loading a source (from HuggingFace or a local directory) into content-addressable storage, and opening a published package already in the registry.

Goal	Use
Cache a HuggingFace dataset or read a local source	`dn.load_dataset(path_or_hf_id, split=...)`
Download a registry dataset so code can load it	`dn.pull_package(["dataset://org/name:version"])`
Open a registry dataset already cached locally	`dn.load_package("dataset://org/name@version")` or `Dataset("org/name")`
Publish a local source back to the registry	`dn.push_dataset("./path")` (see Publishing)

The loaded object is a LocalDataset (or its subclass Dataset). Both expose the same conversion helpers: to_pandas(), to_hf(), and direct load() for PyArrow.

Cache a HuggingFace dataset

import dreadnode as dn

local_ds = dn.load_dataset("squad", split="train[:500]")
print(local_ds.to_pandas().head())

load_dataset forwards extra keyword arguments to HuggingFace’s datasets.load_dataset. Rows land in Dreadnode’s content-addressable store — re-running the same call reads from disk instead of re-downloading.

Read a local dataset source

If the path points at a directory containing dataset.yaml, load_dataset reads it directly:

local_ds = dn.load_dataset("./support-prompts")
train_df = local_ds.to_pandas(split="train")

See Authoring for the directory layout.

Open a published dataset

Pull the registry version first, then open it by name:

import dreadnode as dn
from dreadnode.datasets import Dataset

dn.pull_package(["dataset://acme/support-prompts:1.2.0"])
dataset = Dataset("acme/support-prompts", version="1.2.0")

df = dataset.to_pandas()

dn.load_package is equivalent when you already have the package locally:

dataset = dn.load_package("dataset://acme/[email protected]")

Both return a Dataset, which shares the full LocalDataset API. Omitting the version opens the latest cached version — fine for inspection, risky for reproducibility.

Convert to a DataFrame or HF Dataset

df = dataset.to_pandas(split="train")
hf_ds = dataset.to_hf(split="train")

to_hf() returns a HuggingFace datasets.Dataset — use this for .map(), .filter(), and training loops that expect the HF API. to_pandas() is handier for exploration, notebooks, and custom preprocessing.

For direct PyArrow access, call dataset.load(split="train").

Feed an evaluation

Evaluation expects inline rows or a dataset file path — it doesn’t take a Dataset object directly. Convert first:

from dreadnode.evaluations import Evaluation

rows = dataset.to_pandas().to_dict(orient="records")
evaluation = Evaluation(task="acme.tasks.classify_intent", dataset=rows)

For hosted evaluations, the rows still go into the manifest inline — pull the dataset, shape the rows, and write them into the dataset block. See Evaluations → Inputs for the per-row input mechanics.

Feed a training job

Training job configs take DatasetRef objects keyed by pinned reference:

from dreadnode.app.api.models import DatasetRef, TinkerSFTJobConfig

config = TinkerSFTJobConfig(
    dataset_ref=DatasetRef(name="support-prompts", version="1.2.0"),
    eval_dataset_ref=DatasetRef(name="support-eval", version="1.0.0"),
    batch_size=8,
    lora_rank=16,
    learning_rate=1e-4,
    steps=100,
)

The training control plane resolves each reference against the registry — you don’t pull_package first. See Supervised fine-tuning or Reinforcement learning for the full submission flow.

Feed an AIRT suite

Adversarial datasets are loaded like any other published dataset:

from dreadnode.datasets import Dataset

goals = Dataset("acme/airt-goals", version="1.0.0").to_pandas()

for _, row in goals.iterrows():
    # drive your attack loop with row["goal"], row["category"], etc.
    ...

See AI Red Teaming → Datasets for AIRT-specific dataset conventions and goal schemas.

Properties worth knowing

dataset.name          # "acme/support-prompts"
dataset.version       # "1.2.0"
dataset.format        # "parquet"
dataset.row_count     # 48_213
dataset.splits        # ["train", "validation", "test"] or None
dataset.schema        # {"ticket_id": "string", "intent": "string", ...}
dataset.files         # list of artifact paths inside the package
dataset.manifest      # DatasetManifest (Pydantic)

These are all metadata reads — they hit the local manifest, not the network.

What to reach for next

Publish your own dataset → Authoring then Publishing
Find datasets to load → Catalog
Full SDK API → dreadnode.datasets