Skip to content

Scorers

Turn outputs into metrics with built-in scorers, composition algebra, and custom scoring functions.

from dreadnode.scorers import contains, detect_pii, system_prompt_leaked
mentions_platform = contains("dreadnode")
pii_risk = detect_pii()
prompt_leak = system_prompt_leaked()

A scorer turns an output into a Metric. Use them to check that the agent’s response contained the required content, didn’t leak secrets or PII, meets a pass/fail gate, or rolls up to a single quality-and-safety number you can compare across runs.

Scorers are Python-first and live in the SDK. They plug into local evaluations, agent hooks, and optimization studies — the same scorer can serve as a metric in one context and a gate in another.

The Python SDK ships with 100+ scorers across categories like security, PII detection, exfiltration, MCP/agentic safety, reasoning, and IDE workflows. Start with built-ins — they stay consistent across evaluations and are less likely to drift than one-off local scoring logic.

Use built-ins first. They are easier to compare across evaluations and less likely to drift than one-off local scoring logic.

Combine scorers with operators and helpers:

  • & / | / ~ for logical composition
  • + / - / * for arithmetic composition
  • >> / // to rename scorers (log all vs log primary)
  • threshold(), normalize(), invert(), remap_range(), scale(), clip(), weighted_avg()
import dreadnode as dn
from dreadnode.scorers import contains, detect_pii, normalize, weighted_avg
mentions = contains("agent")
quality = normalize(mentions, known_max=1.0)
safety = ~detect_pii()
overall = weighted_avg((quality, 0.6), (safety, 0.4)) >> "overall_score"
combined = (quality & safety) // "quality_and_safety"

The usual pattern is:

  • build a few narrow scorers
  • normalize them onto a comparable scale
  • combine them into one or two rollout metrics that are easy to reason about

Use scorer thresholds in agent hooks and conditions with .above(), .below(), or .as_condition():

from dreadnode.scorers import contains
quality = contains("well-structured")
must_pass = quality.above(0.5)
just_record = quality.as_condition()

Thresholds are especially useful when you want one scorer to do double duty:

  • as a numeric metric in evaluations
  • as a gate in hooks, reactions, or stop conditions
import dreadnode as dn
@dn.scorer(name="length_bonus")
def length_bonus(text: str) -> float:
return 1.0 if len(text) > 120 else 0.0
metric = await length_bonus.score("Short response.")
print(metric.value)

Good custom scorers are:

  • deterministic
  • cheap enough to run repeatedly
  • clearly bounded or normalized when they will be combined with other metrics
  • named in a way that will still make sense in logs and evaluation summaries

If a scorer is intended to be a hard pass/fail condition, either wrap it with threshold(...) or use assert_scores in the evaluation layer so the outcome is explicit.