Inputs

Configure what an evaluation runs on — a flat list of task references (task_names) or rows with per-item parameters (dataset).

Every evaluation needs to know which tasks to run and with what per-item context. Pick one of two inputs:

task_names — a flat list. Each entry becomes one evaluation item.
dataset — rows with per-item parameters. Each row becomes one evaluation item.

Use task_names when every run of the task should be identical. Use dataset when you need per-row inputs — different tenants, difficulties, input URLs — fed into the task through instruction templates.

`task_names` — flat list

Each entry is a task reference, optionally pinned to a version:

name: nightly-regression
model: openai/gpt-4.1-mini
task_names:
  - [email protected]
  - [email protected]

An unpinned name like flag-file-http resolves to the latest visible version when the worker loads the task. Use name@version when you need a stable regression target.

`dataset` — per-row parameters

A dataset is a list of rows. Each row must include task_name; anything else is a per-row field the task instruction can reference:

name: regression-by-tenant
model: openai/gpt-4.1-mini
concurrency: 4
dataset:
  rows:
    - task_name: [email protected]
      tenant: acme
      difficulty: 1
    - task_name: [email protected]
      tenant: bravo
      difficulty: 2
    - task_name: [email protected]
      tenant: acme
      difficulty: 3

In the task’s instruction, {{tenant}} and {{difficulty}} fill at evaluation time. Only string, int, and null row values become template variables — see Instruction templates for the resolution rules.

The CLI does not expose row data directly; use --file evaluation.yaml for dataset-backed runs.

Rules you can’t work around

Two asymmetries matter:

task_names wins. If both task_names and dataset appear in the same request, the worker uses task_names and ignores the dataset. Pick one.
Every dataset row needs task_name. There is no mode where task_names picks the tasks and dataset supplies per-row inputs. A dataset-backed run must carry the task reference on every row.

Using a registry dataset as input

Registry datasets are pulled and shaped into the manifest — there’s no direct ref resolution for the dataset: field today. The common pattern:

import yaml
import dreadnode as dn
from dreadnode.datasets import Dataset

dn.pull_package(["dataset://acme/regression-inputs:1.0.0"])
ds = Dataset("acme/regression-inputs", version="1.0.0")

rows = ds.to_pandas().to_dict(orient="records")
manifest = {
    "name": "regression",
    "model": "openai/gpt-4.1-mini",
    "dataset": {"rows": rows},
}
yaml.safe_dump(manifest, open("evaluation.yaml", "w"))

dn evaluation create --file evaluation.yaml --wait

See Datasets → Using in code for the full registry-consumer mechanics.