May 21, 2026

Compound eval verification — script_and_judge and flag_and_judge

9 new 7 improved 18 fixed

Eval tasks now support compound verification strategies that chain mechanical checks with an LLM judge to close reward-hacking gaps.

New

Compound eval verification: script_and_judge and flag_and_judge. Two new verification methods let eval task authors run a script or flag check first, then invoke an outcome judge as a second line of defense against reward hacking — no custom rubric required. (dreadnode/dreadnode-tiger#1534, dreadnode/dreadnode-tiger#1537)
OutcomeJudge agentic verification. A new outcome_judge verification method lets eval tasks use an agentic LLM to inspect agent trajectories and emit pass/fail verdicts, alongside the existing script and flag methods; exposed via the new dn outcome-judge run CLI subcommand. (dreadnode/dreadnode-tiger#1475)
intent_plus_outputs_summary guard strategy. A new opt-in policy replaces raw tool outputs with LLM-generated summaries in judge prompts, reducing prompt-injection surface while preserving tool-call context. (dreadnode/dreadnode-tiger#1524)
Six new web app pentesting skills. ESI injection, gRPC-Web pentest, H2C/WebSocket smuggling, HTTP connection contamination, timing-attack recon, and XSLT injection are now available as knowledge distillations in the web-security capability. (dreadnode/capabilities#16)
waymore added to web-security capability runtime. Historical URL and response retrieval from Wayback Machine, CommonCrawl, OTX, URLScan, and VirusTotal is now available for recon and JS archaeology. (dreadnode/capabilities#17)
Free-text search in evaluation trajectories. Matches highlight inline and auto-expand collapsed tool calls, with the query synced to the URL. (dreadnode/dreadnode-tiger#1491)
Expanded AIRT goal categories. The AI red-teaming agent now surfaces all 15 goal categories — including reasoning_exploitation, supply_chain, and resource_exhaustion — instead of the previous incomplete list of 9. (dreadnode/capabilities#14)
Simplified AIRT workspace layout. The AI red-teaming capability now uses a cleaner ~/.dreadnode/airt/[org]/[workspace]/ path structure with clearer error messages, replacing the previous env-var-based path system. (dreadnode/capabilities#10)
New setting for org admins re: model usage. Organization admins can now control which AI models each member is allowed to use, set right from the org members admin page

Improvements

TUI keystroke performance. Keystroke routing is now ~400× faster in long conversations — typing in the composer no longer slows down as conversation history grows. (dreadnode/dreadnode-tiger#1507)
TUI event loop and grep tool overhaul. The TUI no longer stalls when sync tools run blocking calls; grep gains output modes, context lines, and smarter filtering; tool result summaries now reflect actual output. (dreadnode/dreadnode-tiger#1510)
TUI session commands rationalized. /clear and /reset both now start a fresh session (preserving the old one), and the destructive in-place wipe is removed. (dreadnode/dreadnode-tiger#1511)
Assistant-led CLI guide modals. The in-product CLI guides for capabilities, tasks, environments, and models are rewritten with an assistant-led flow, accurate commands, and an ‘Ask an agent’ tab that generates a copyable dn --prompt launcher. (dreadnode/dreadnode-tiger#1538, dreadnode/dreadnode-tiger#1541, dreadnode/dreadnode-tiger#1543)
Consistent empty states across the platform. Empty states now distinguish ‘nothing here yet’ from ‘no filter matches’, include inline docs links, and show ghost visuals that hint at what each surface will contain. (dreadnode/dreadnode-tiger#1498)
AIRT Details glossary consolidated. AIRT assessments now show a single glossary popover at the section header instead of three separate per-cell tooltips. (dreadnode/dreadnode-tiger#1493)
Consistent browser tab titles. All pages now show titles in the format {Page} | Dreadnode, replacing a mix of inconsistent formats and blank tabs. (dreadnode/dreadnode-tiger#1490)

Fixes

Compound eval verification false negatives resolved. script_and_judge and flag_and_judge no longer report failed results when the agent successfully completes a task. (dreadnode/dreadnode-tiger#1545)
AIRT CLI assessment creation now surfaces in the UI. dn airt run and dn airt run-suite now correctly connect to the platform, so runs and results appear in the assessment UI instead of silently disappearing or showing Assessment: None. (dreadnode/dreadnode-tiger#1499, dreadnode/dreadnode-tiger#1501)
Monitoring tab reports now display content. The Reports panel now shows the actual markdown report body instead of an S3 path pointer for reports over 10 KB. (dreadnode/dreadnode-tiger#1514)
Reports tab no longer stuck on ‘Loading report content…’. The Reports tab spinner no longer gets stuck after a deep-link reload when the report body was already cached. (dreadnode/dreadnode-tiger#1520)
Sandbox readiness timeout raised to 300 seconds. Capability-heavy sandboxes (e.g. web-security + zero-day-research) no longer time out during startup. (dreadnode/dreadnode-tiger#1503)
Cross-provider AIRT runs label the correct target model. The assessment UI now shows the target model instead of the attacker/orchestrator model for cross-provider runs. (dreadnode/dreadnode-tiger#1519)
Bundled capability disable state now persists. Toggling off bundled capabilities (e.g. self-improvement) in the TUI correctly disables them on reload instead of silently reverting to enabled. (dreadnode/dreadnode-tiger#1517)
Capability preflight checks now use the capability root as working directory. Checks referencing relative paths no longer fail with ‘No such file or directory’. (dreadnode/dreadnode-tiger#1516)
Invitation links show Sign In prompt for unauthenticated users. Invitation links no longer show ‘Invalid Invitation’ — unauthenticated visitors now see a Sign In / Create Account prompt instead. (dreadnode/dreadnode-tiger#1518)
BYOK OpenAI quota errors surface immediately. Agents using BYOK OpenAI no longer retry up to 8 times on insufficient_quota — the failure surfaces on the first attempt. (dreadnode/dreadnode-tiger#1512)
save_workflow now detects silent overwrites. The tool verifies file writes succeeded and warns when content is unchanged, preventing agents from operating on stale workflow scripts. (dreadnode/capabilities#15)
Deep links to evaluation samples no longer show ‘sample not found’. Navigating directly to a specific evaluation sample now resolves correctly. (dreadnode/dreadnode-tiger#1484)
Hosted sandboxes page shows accurate active counts. Sandbox counts are now backed by a facets API that aggregates state correctly. (dreadnode/dreadnode-tiger#1483)
Errored evaluation statuses tracked correctly. Errored statuses are now included in eval phase summaries instead of being silently dropped. (dreadnode/dreadnode-tiger#1523)
TUI tool meta line no longer drops. Tool calls now display the ↳ <summary> meta line beneath the tool name instead of showing a blank gutter. (dreadnode/dreadnode-tiger#1546)
TUI spinner stays visible when Esc drains a queued message. The spinner no longer disappears while a queued message is still running. (dreadnode/dreadnode-tiger#1513)
TUI tool output wrapping stays within the gutter. Wrapped lines at narrow terminal widths now stay aligned under the gutter border instead of leaking to column 0. (dreadnode/dreadnode-tiger#1522)
Docs code blocks render correct spacing. CLI commands like dn --capability now display with correct spacing in docs code blocks (Geist Mono ligatures were collapsing the space before --). (dreadnode/dreadnode-tiger#1500)