Scorers

Turn outputs into metrics with built-in scorers, composition algebra, and custom scoring functions.

from dreadnode.scorers import contains, detect_pii, system_prompt_leaked

mentions_platform = contains("dreadnode")
pii_risk = detect_pii()
prompt_leak = system_prompt_leaked()

A scorer turns an output into a Metric. Use them to check that the agent’s response contained the required content, didn’t leak secrets or PII, meets a pass/fail gate, or rolls up to a single quality-and-safety number you can compare across runs.

Scorers are Python-first and live in the SDK. They plug into local evaluations, agent hooks, and optimization studies — the same scorer can serve as a metric in one context and a gate in another.

Built-in scorers

The Python SDK ships with 100+ scorers across categories like security, PII detection, exfiltration, MCP/agentic safety, reasoning, and IDE workflows. Start with built-ins — they stay consistent across evaluations and are less likely to drift than one-off local scoring logic.

Use built-ins first. They are easier to compare across evaluations and less likely to drift than one-off local scoring logic.

Composition algebra

Combine scorers with operators and helpers:

& / | / ~ for logical composition
+ / - / * for arithmetic composition
>> / // to rename scorers (log all vs log primary)
threshold(), normalize(), invert(), remap_range(), scale(), clip(), weighted_avg()

import dreadnode as dn
from dreadnode.scorers import contains, detect_pii, normalize, weighted_avg

mentions = contains("agent")
quality = normalize(mentions, known_max=1.0)
safety = ~detect_pii()

overall = weighted_avg((quality, 0.6), (safety, 0.4)) >> "overall_score"
combined = (quality & safety) // "quality_and_safety"

The usual pattern is:

build a few narrow scorers
normalize them onto a comparable scale
combine them into one or two rollout metrics that are easy to reason about

Threshold conditions for hooks

Use scorer thresholds in agent hooks and conditions with .above(), .below(), or .as_condition():

from dreadnode.scorers import contains

quality = contains("well-structured")
must_pass = quality.above(0.5)
just_record = quality.as_condition()

Thresholds are especially useful when you want one scorer to do double duty:

as a numeric metric in evaluations
as a gate in hooks, reactions, or stop conditions

Build a custom scorer

import dreadnode as dn

@dn.scorer(name="length_bonus")
def length_bonus(text: str) -> float:
    return 1.0 if len(text) > 120 else 0.0

metric = await length_bonus.score("Short response.")
print(metric.value)

Good custom scorers are:

deterministic
cheap enough to run repeatedly
clearly bounded or normalized when they will be combined with other metrics
named in a way that will still make sense in logs and evaluation summaries

If a scorer is intended to be a hard pass/fail condition, either wrap it with threshold(...) or use assert_scores in the evaluation layer so the outcome is explicit.