Skip to content

Inputs

Configure what an evaluation runs on — a flat list of task references (task_names) or rows with per-item parameters (dataset).

Every evaluation needs to know which tasks to run and with what per-item context. Pick one of two inputs:

  • task_names — a flat list. Each entry becomes one evaluation item.
  • dataset — rows with per-item parameters. Each row becomes one evaluation item.

Use task_names when every run of the task should be identical. Use dataset when you need per-row inputs — different tenants, difficulties, input URLs — fed into the task through instruction templates.

Each entry is a task reference, optionally pinned to a version:

evaluation.yaml
name: nightly-regression
model: openai/gpt-4.1-mini
task_names:

An unpinned name like flag-file-http resolves to the latest visible version when the worker loads the task. Use name@version when you need a stable regression target.

A dataset is a list of rows. Each row must include task_name; anything else is a per-row field the task instruction can reference:

evaluation.yaml
name: regression-by-tenant
model: openai/gpt-4.1-mini
concurrency: 4
dataset:
rows:
- task_name: [email protected]
tenant: acme
difficulty: 1
- task_name: [email protected]
tenant: bravo
difficulty: 2
- task_name: [email protected]
tenant: acme
difficulty: 3

In the task’s instruction, {{tenant}} and {{difficulty}} fill at evaluation time. Only string, int, and null row values become template variables — see Instruction templates for the resolution rules.

The CLI does not expose row data directly; use --file evaluation.yaml for dataset-backed runs.

Two asymmetries matter:

  • task_names wins. If both task_names and dataset appear in the same request, the worker uses task_names and ignores the dataset. Pick one.
  • Every dataset row needs task_name. There is no mode where task_names picks the tasks and dataset supplies per-row inputs. A dataset-backed run must carry the task reference on every row.

Registry datasets are pulled and shaped into the manifest — there’s no direct ref resolution for the dataset: field today. The common pattern:

import yaml
import dreadnode as dn
from dreadnode.datasets import Dataset
dn.pull_package(["dataset://acme/regression-inputs:1.0.0"])
ds = Dataset("acme/regression-inputs", version="1.0.0")
rows = ds.to_pandas().to_dict(orient="records")
manifest = {
"name": "regression",
"model": "openai/gpt-4.1-mini",
"dataset": {"rows": rows},
}
yaml.safe_dump(manifest, open("evaluation.yaml", "w"))
Terminal window
dn evaluation create --file evaluation.yaml --wait

See Datasets → Using in code for the full registry-consumer mechanics.