Tri-Judge Grader: Harden (not Rip)
ADR-052 — Tri-Judge Grader: Harden (not Rip)
Date: 2026-05-15
Status: Accepted
Deciders: Mishaal Murawala
Supersedes: (none — first ADR for grader subsystem)
Context
The tri-judge grader (src/grader/*) + supporting crons (judge-model-discovery,
grader-tick) has produced empty or timed-out verdicts across multiple sessions
dating to the initial deploy. Four production consumers gate deploy-safety on
grader_verdicts.quality:
| Consumer | File | How it uses quality |
|---|---|---|
| Canary t-test | src/cron/bridge-controller.ts:232,255 | Welch’s t-test over quality scores |
| Regression detection | src/cron/harness-verify.ts:175 | Flags quality drops post-deploy |
| Auto-triage | src/cron/harness-triage.ts:331 | Severity routing |
| Baseline snapshot | scripts/post-deploy-record-version.ts:130 | Seeds baseline on promotion |
Ripping the grader blinds deploy-canary detection. Decision: Harden, with an explicit kill-switch that rips automatically if the harden fails.
Root Causes (confirmed)
H1 (PRIME) — Token budget too small for reasoning-class models
src/grader/judges.ts had a hardcoded MAX_OUTPUT_TOKENS = 512. Reasoning models
(e.g. deepseek/deepseek-r1) emit chain-of-thought tokens before the visible
tool_calls payload. 512 tokens is consumed entirely by CoT, producing
finish_reason: "length" with empty tool_calls. runOpenRouterJudge returns
VALIDATION_ERROR "no tool_call returned". runTriJudge records null. grader-tick
skips trace silently.
Direct empirical evidence: DeepSeek-flash attempt logged finish_reason=length with
≥1200 reasoning tokens visible and zero tool output.
H2 (PRIME) — Discovery cron selected reasoning-class models as judges
src/cron/judge-model-discovery.ts filtered OpenRouter /models?supported_parameters=tools
by promptPrice > $0.30/M and selected the max-priced model per author. No exclusion
of reasoning-class models. deepseek-r1 won because it was the priciest
DeepSeek tools-capable model. The same logic also selected legacy openai/gpt-4
(over GPT-4.1) and x-ai/grok-3-beta (beta channel).
Additionally, the discovery filter only required supported_parameters includes 'tools';
it did not require 'tool_choice'. Missing tool_choice support is a second
silent-failure path.
Secondary
api_config:graderkey absent in KV — cron relied on hardcoded defaults.- No per-judge integration test — bad model selections only surfaced when
grader-tickconsumed them and silently skipped traces. STALE_THRESHOLD_MS = 36hinresolve-judge-models.ts— tolerated two missed daily runs before firing, allowing stale models to linger undetected.
Decision
Four layers of hardening, shipped in one PR:
Layer 1 — Token budget + error code (structural fix)
src/grader/judges.ts
- Replace
MAX_OUTPUT_TOKENS = 512withtokenBudgetForModel(model): number- Reasoning-class models: 8192 (covers CoT + tool_call envelope)
- All other models: 4096 (was 512 — 8× increase for non-reasoning)
- Reasoning detection:
isReasoningModel(model)— regex/(^|[\/-])(r1|o[1-9]|reasoner|reasoning|thinking|thought)([\/-]|$)/ion model ID - New
TRUNCATEDerror code: fires whenfinish_reason === 'length'ANDtool_callsempty. Previously both paths returned genericVALIDATION_ERROR, masking budget exhaustion inerror_ledger.
src/cron/judge-model-discovery.ts
isReasoningModelId()filter — excludes reasoning-class models from judge candidates. Judges need fast structured output; CoT models are wrong tool.- Require
supported_parametersincludes both'tools'AND'tool_choice'. - On startup: if
api_config:graderabsent in KV, seed default{ enabled_providers: ['openai','xai','deepseek'], disabled_providers: [] }.
src/grader/resolve-judge-models.ts
STALE_THRESHOLD_MSfrom36h → 30h(tolerates one missed daily run).JudgeResolutionContext.onStalecallback — caller (grader-tick) fires Slack alert viactx.waitUntil()with per-provider KV dedupe (grader:stale_alert:{provider}, 1h TTL).
Layer 2 — Per-judge integration test (guard)
New: src/cron/judge-canary.ts
- Multiplexed into
0 6 * * *cron slot, runs afterjudge-model-discoverywithin the same invocation (CF free plan is at 5/5 cron slots). - Reads
judge_config:{provider}:current_modelfor all 3 providers. - Sends each a known-good fixture trace (HubSpot contacts fetch, unambiguous
crm_read) withsubmit_evaluationtool and forcedtool_choice. - Asserts: HTTP 200 ·
tool_calls[0].function.argumentsparses as JSON ·VerdictSchema.safeParsepasses. - On success: writes
judge_config:{provider}:last_good_model(10d TTL). This key is the only rollback target. - On failure: reads
last_good_model, overwritescurrent_modelwith it (rollback). Writeserror_ledgerrow. Fires deduplicated Slack alert (grader:canary_alert:{provider}, 24h TTL).
scripts/checks/21-judge-model-quality.ts
- Added reasoning-model pattern check with segment-boundary guard (avoids false-positive
on bare property names like
'reasoning'). - Token budget floor assertion: compiled default must be ≥ 4096.
Layer 3 — Per-judge 7d observability (weekly digest)
src/cron/bridge-controller.ts — runJudgeWeeklyAudit()
- Runs Monday
0 6 * * *slot alongside existing weekly digest. - Two queries (D1 cold path):
fetchJudgeVerdictCounts— counts rows ingrader_verdictsper provider in last 7d (only successful verdicts are stored in that table; failures are tracked via canary).fetchCanaryFailureCounts— countserror_ledgerrows whereprovider LIKE 'judge-canary:%'in last 7d.
- Posts Slack info message with per-judge verdict counts and canary failure counts. Warning flag if any provider has canary failures > 0.
Layer 4 — Kill-switch (auto-rip if harden fails)
Trip conditions in evaluateKillSwitch() (weekly):
- Condition A: ≥2 providers with
verdictCount === 0ANDcanaryFailures ≥ 1(grader producing nothing AND canary confirmed bad) - Condition B: ≥2 providers with
canaryFailures ≥ 5(sustained canary failures)
On trip, tripKillSwitch():
- Sets
api_config:grader.mode = 'single-judge-fallback'andapi_config:grader.fallback_model = 'deepseek/deepseek-chat'in KV. - Writes
grader:kill_switch_trippedwith 7d TTL (prevents re-trip within window). - Fires critical Slack page via
ctx.waitUntil(sendAlert(...)).
runTriJudge (future follow-on): reads mode flag → if 'single-judge-fallback',
routes to deepseek/deepseek-chat only (non-reasoning, confirmed working, accepts
vendor-bias temporarily). Full rip PR is auto-proposed if mode stays tripped at the
next weekly tick.
New KV Keys
| Key | Written by | Read by | TTL |
|---|---|---|---|
judge_config:{p}:last_good_model | judge-canary on pass | judge-canary on fail (rollback) | 10d |
grader:canary_alert:{p} | judge-canary on first fail | judge-canary (dedupe) | 24h |
grader:kill_switch_tripped | tripKillSwitch | evaluateKillSwitch (guard) | 7d |
Consequences
Positive
- Reasoning models can no longer be selected as judge candidates.
- Even if a bad model slips through discovery, the nightly canary catches it and rolls back within 24h without manual intervention.
TRUNCATEDerror code makes budget-exhaustion failures distinguishable inerror_ledgerfrom schema validation failures.- Kill-switch guarantees no zombie grader: if the harden fails, the system degrades gracefully to a known-working single-judge path rather than silently producing zero verdicts for weeks.
Neutral
- Cron slot multiplexing means judge-canary runs sequentially after discovery in the
same
0 6 * * *invocation rather than having its own0 6:30 * * *slot. Ordering is preserved (discovery writes → canary validates). No functional difference. grader_verdictsschema unchanged. No migration needed.
Negative / Risks
tokenBudgetForModelreturns 8192 for reasoning models. If a reasoning model somehow slips past the discovery filter (new naming convention not matching the regex), it consumes more tokens per call. Acceptable: canary will catch and roll back within 24h.- Kill-switch Condition A requires
verdictCount === 0, not just low counts. A grader producing 1 verdict/week for a provider won’t trip it. Accepted: intent is to catch total failure, not low-quality output — low-quality is caught by the canary integration test.
Invariants Preserved
- #2 (KV-only hot path): all new D1 access is in cron context.
- #9 (CF Cron): multiplexed into existing slots, no external cron services.
- #11 (30s AbortController): canary runner sets
JUDGE_TIMEOUT_MS = 30_000on every outbound fetch. - #12 (LLM via AI Gateway): canary routes via
resolveProviderBaseUrl(AI Gateway when env is configured, direct otherwise).
Stop Criterion
This ADR is satisfied when:
- Per-judge canary green 7 consecutive days.
grader_verdicts7d success rate ≥ 90% across all providers.- Bridge weekly digest shows no judge in warning or red state.
If not met within 8 weeks → execute Layer 4 rip path (full grader removal, replace 4 consumers with rollback-rate + error-rate deltas).