Tri-Judge Grader: Harden (not Rip)

ADR-052 — Tri-Judge Grader: Harden (not Rip)

Date: 2026-05-15
Status: Accepted
Deciders: Mishaal Murawala
Supersedes: (none — first ADR for grader subsystem)

Context

The tri-judge grader (src/grader/*) + supporting crons (judge-model-discovery, grader-tick) has produced empty or timed-out verdicts across multiple sessions dating to the initial deploy. Four production consumers gate deploy-safety on grader_verdicts.quality:

Consumer	File	How it uses quality
Canary t-test	`src/cron/bridge-controller.ts:232,255`	Welch’s t-test over quality scores
Regression detection	`src/cron/harness-verify.ts:175`	Flags quality drops post-deploy
Auto-triage	`src/cron/harness-triage.ts:331`	Severity routing
Baseline snapshot	`scripts/post-deploy-record-version.ts:130`	Seeds baseline on promotion

Ripping the grader blinds deploy-canary detection. Decision: Harden, with an explicit kill-switch that rips automatically if the harden fails.

Root Causes (confirmed)

H1 (PRIME) — Token budget too small for reasoning-class models

src/grader/judges.ts had a hardcoded MAX_OUTPUT_TOKENS = 512. Reasoning models (e.g. deepseek/deepseek-r1) emit chain-of-thought tokens before the visible tool_calls payload. 512 tokens is consumed entirely by CoT, producing finish_reason: "length" with empty tool_calls. runOpenRouterJudge returns VALIDATION_ERROR "no tool_call returned". runTriJudge records null. grader-tick skips trace silently.

Direct empirical evidence: DeepSeek-flash attempt logged finish_reason=length with ≥1200 reasoning tokens visible and zero tool output.

H2 (PRIME) — Discovery cron selected reasoning-class models as judges

src/cron/judge-model-discovery.ts filtered OpenRouter /models?supported_parameters=tools by promptPrice > $0.30/M and selected the max-priced model per author. No exclusion of reasoning-class models. deepseek-r1 won because it was the priciest DeepSeek tools-capable model. The same logic also selected legacy openai/gpt-4 (over GPT-4.1) and x-ai/grok-3-beta (beta channel).

Additionally, the discovery filter only required supported_parameters includes 'tools'; it did not require 'tool_choice'. Missing tool_choice support is a second silent-failure path.

Secondary

api_config:grader key absent in KV — cron relied on hardcoded defaults.
No per-judge integration test — bad model selections only surfaced when grader-tick consumed them and silently skipped traces.
STALE_THRESHOLD_MS = 36h in resolve-judge-models.ts — tolerated two missed daily runs before firing, allowing stale models to linger undetected.

Decision

Four layers of hardening, shipped in one PR:

Layer 1 — Token budget + error code (structural fix)

src/grader/judges.ts

Replace MAX_OUTPUT_TOKENS = 512 with tokenBudgetForModel(model): number
- Reasoning-class models: 8192 (covers CoT + tool_call envelope)
- All other models: 4096 (was 512 — 8× increase for non-reasoning)
Reasoning detection: isReasoningModel(model) — regex /(^|[\/-])(r1|o[1-9]|reasoner|reasoning|thinking|thought)([\/-]|$)/i on model ID
New TRUNCATED error code: fires when finish_reason === 'length' AND tool_calls empty. Previously both paths returned generic VALIDATION_ERROR, masking budget exhaustion in error_ledger.

src/cron/judge-model-discovery.ts

isReasoningModelId() filter — excludes reasoning-class models from judge candidates. Judges need fast structured output; CoT models are wrong tool.
Require supported_parameters includes both 'tools' AND 'tool_choice'.
On startup: if api_config:grader absent in KV, seed default { enabled_providers: ['openai','xai','deepseek'], disabled_providers: [] }.

src/grader/resolve-judge-models.ts

STALE_THRESHOLD_MS from 36h → 30h (tolerates one missed daily run).
JudgeResolutionContext.onStale callback — caller (grader-tick) fires Slack alert via ctx.waitUntil() with per-provider KV dedupe (grader:stale_alert:{provider}, 1h TTL).

Layer 2 — Per-judge integration test (guard)

New: src/cron/judge-canary.ts

Multiplexed into 0 6 * * * cron slot, runs after judge-model-discovery within the same invocation (CF free plan is at 5/5 cron slots).
Reads judge_config:{provider}:current_model for all 3 providers.
Sends each a known-good fixture trace (HubSpot contacts fetch, unambiguous crm_read) with submit_evaluation tool and forced tool_choice.
Asserts: HTTP 200 · tool_calls[0].function.arguments parses as JSON · VerdictSchema.safeParse passes.
On success: writes judge_config:{provider}:last_good_model (10d TTL). This key is the only rollback target.
On failure: reads last_good_model, overwrites current_model with it (rollback). Writes error_ledger row. Fires deduplicated Slack alert (grader:canary_alert:{provider}, 24h TTL).

scripts/checks/21-judge-model-quality.ts

Added reasoning-model pattern check with segment-boundary guard (avoids false-positive on bare property names like 'reasoning').
Token budget floor assertion: compiled default must be ≥ 4096.

Layer 3 — Per-judge 7d observability (weekly digest)

src/cron/bridge-controller.ts — runJudgeWeeklyAudit()

Runs Monday 0 6 * * * slot alongside existing weekly digest.
Two queries (D1 cold path):
- fetchJudgeVerdictCounts — counts rows in grader_verdicts per provider in last 7d (only successful verdicts are stored in that table; failures are tracked via canary).
- fetchCanaryFailureCounts — counts error_ledger rows where provider LIKE 'judge-canary:%' in last 7d.
Posts Slack info message with per-judge verdict counts and canary failure counts. Warning flag if any provider has canary failures > 0.

Layer 4 — Kill-switch (auto-rip if harden fails)

Trip conditions in evaluateKillSwitch() (weekly):

Condition A: ≥2 providers with verdictCount === 0 AND canaryFailures ≥ 1 (grader producing nothing AND canary confirmed bad)
Condition B: ≥2 providers with canaryFailures ≥ 5 (sustained canary failures)

On trip, tripKillSwitch():

Sets api_config:grader.mode = 'single-judge-fallback' and api_config:grader.fallback_model = 'deepseek/deepseek-chat' in KV.
Writes grader:kill_switch_tripped with 7d TTL (prevents re-trip within window).
Fires critical Slack page via ctx.waitUntil(sendAlert(...)).

runTriJudge (future follow-on): reads mode flag → if 'single-judge-fallback', routes to deepseek/deepseek-chat only (non-reasoning, confirmed working, accepts vendor-bias temporarily). Full rip PR is auto-proposed if mode stays tripped at the next weekly tick.

New KV Keys

Key	Written by	Read by	TTL
`judge_config:{p}:last_good_model`	`judge-canary` on pass	`judge-canary` on fail (rollback)	10d
`grader:canary_alert:{p}`	`judge-canary` on first fail	`judge-canary` (dedupe)	24h
`grader:kill_switch_tripped`	`tripKillSwitch`	`evaluateKillSwitch` (guard)	7d

Consequences

Positive

Reasoning models can no longer be selected as judge candidates.
Even if a bad model slips through discovery, the nightly canary catches it and rolls back within 24h without manual intervention.
TRUNCATED error code makes budget-exhaustion failures distinguishable in error_ledger from schema validation failures.
Kill-switch guarantees no zombie grader: if the harden fails, the system degrades gracefully to a known-working single-judge path rather than silently producing zero verdicts for weeks.

Neutral

Cron slot multiplexing means judge-canary runs sequentially after discovery in the same 0 6 * * * invocation rather than having its own 0 6:30 * * * slot. Ordering is preserved (discovery writes → canary validates). No functional difference.
grader_verdicts schema unchanged. No migration needed.

Negative / Risks

tokenBudgetForModel returns 8192 for reasoning models. If a reasoning model somehow slips past the discovery filter (new naming convention not matching the regex), it consumes more tokens per call. Acceptable: canary will catch and roll back within 24h.
Kill-switch Condition A requires verdictCount === 0, not just low counts. A grader producing 1 verdict/week for a provider won’t trip it. Accepted: intent is to catch total failure, not low-quality output — low-quality is caught by the canary integration test.

Invariants Preserved

#2 (KV-only hot path): all new D1 access is in cron context.
#9 (CF Cron): multiplexed into existing slots, no external cron services.
#11 (30s AbortController): canary runner sets JUDGE_TIMEOUT_MS = 30_000 on every outbound fetch.
#12 (LLM via AI Gateway): canary routes via resolveProviderBaseUrl (AI Gateway when env is configured, direct otherwise).

Stop Criterion

This ADR is satisfied when:

Per-judge canary green 7 consecutive days.
grader_verdicts 7d success rate ≥ 90% across all providers.
Bridge weekly digest shows no judge in warning or red state.

If not met within 8 weeks → execute Layer 4 rip path (full grader removal, replace 4 consumers with rollback-rate + error-rate deltas).