Skip to content

Tri-Judge Grader: Harden (not Rip)

ADR-052 — Tri-Judge Grader: Harden (not Rip)

Date: 2026-05-15
Status: Accepted
Deciders: Mishaal Murawala
Supersedes: (none — first ADR for grader subsystem)


Context

The tri-judge grader (src/grader/*) + supporting crons (judge-model-discovery, grader-tick) has produced empty or timed-out verdicts across multiple sessions dating to the initial deploy. Four production consumers gate deploy-safety on grader_verdicts.quality:

ConsumerFileHow it uses quality
Canary t-testsrc/cron/bridge-controller.ts:232,255Welch’s t-test over quality scores
Regression detectionsrc/cron/harness-verify.ts:175Flags quality drops post-deploy
Auto-triagesrc/cron/harness-triage.ts:331Severity routing
Baseline snapshotscripts/post-deploy-record-version.ts:130Seeds baseline on promotion

Ripping the grader blinds deploy-canary detection. Decision: Harden, with an explicit kill-switch that rips automatically if the harden fails.


Root Causes (confirmed)

H1 (PRIME) — Token budget too small for reasoning-class models

src/grader/judges.ts had a hardcoded MAX_OUTPUT_TOKENS = 512. Reasoning models (e.g. deepseek/deepseek-r1) emit chain-of-thought tokens before the visible tool_calls payload. 512 tokens is consumed entirely by CoT, producing finish_reason: "length" with empty tool_calls. runOpenRouterJudge returns VALIDATION_ERROR "no tool_call returned". runTriJudge records null. grader-tick skips trace silently.

Direct empirical evidence: DeepSeek-flash attempt logged finish_reason=length with ≥1200 reasoning tokens visible and zero tool output.

H2 (PRIME) — Discovery cron selected reasoning-class models as judges

src/cron/judge-model-discovery.ts filtered OpenRouter /models?supported_parameters=tools by promptPrice > $0.30/M and selected the max-priced model per author. No exclusion of reasoning-class models. deepseek-r1 won because it was the priciest DeepSeek tools-capable model. The same logic also selected legacy openai/gpt-4 (over GPT-4.1) and x-ai/grok-3-beta (beta channel).

Additionally, the discovery filter only required supported_parameters includes 'tools'; it did not require 'tool_choice'. Missing tool_choice support is a second silent-failure path.

Secondary

  • api_config:grader key absent in KV — cron relied on hardcoded defaults.
  • No per-judge integration test — bad model selections only surfaced when grader-tick consumed them and silently skipped traces.
  • STALE_THRESHOLD_MS = 36h in resolve-judge-models.ts — tolerated two missed daily runs before firing, allowing stale models to linger undetected.

Decision

Four layers of hardening, shipped in one PR:

Layer 1 — Token budget + error code (structural fix)

src/grader/judges.ts

  • Replace MAX_OUTPUT_TOKENS = 512 with tokenBudgetForModel(model): number
    • Reasoning-class models: 8192 (covers CoT + tool_call envelope)
    • All other models: 4096 (was 512 — 8× increase for non-reasoning)
  • Reasoning detection: isReasoningModel(model) — regex /(^|[\/-])(r1|o[1-9]|reasoner|reasoning|thinking|thought)([\/-]|$)/i on model ID
  • New TRUNCATED error code: fires when finish_reason === 'length' AND tool_calls empty. Previously both paths returned generic VALIDATION_ERROR, masking budget exhaustion in error_ledger.

src/cron/judge-model-discovery.ts

  • isReasoningModelId() filter — excludes reasoning-class models from judge candidates. Judges need fast structured output; CoT models are wrong tool.
  • Require supported_parameters includes both 'tools' AND 'tool_choice'.
  • On startup: if api_config:grader absent in KV, seed default { enabled_providers: ['openai','xai','deepseek'], disabled_providers: [] }.

src/grader/resolve-judge-models.ts

  • STALE_THRESHOLD_MS from 36h → 30h (tolerates one missed daily run).
  • JudgeResolutionContext.onStale callback — caller (grader-tick) fires Slack alert via ctx.waitUntil() with per-provider KV dedupe (grader:stale_alert:{provider}, 1h TTL).

Layer 2 — Per-judge integration test (guard)

New: src/cron/judge-canary.ts

  • Multiplexed into 0 6 * * * cron slot, runs after judge-model-discovery within the same invocation (CF free plan is at 5/5 cron slots).
  • Reads judge_config:{provider}:current_model for all 3 providers.
  • Sends each a known-good fixture trace (HubSpot contacts fetch, unambiguous crm_read) with submit_evaluation tool and forced tool_choice.
  • Asserts: HTTP 200 · tool_calls[0].function.arguments parses as JSON · VerdictSchema.safeParse passes.
  • On success: writes judge_config:{provider}:last_good_model (10d TTL). This key is the only rollback target.
  • On failure: reads last_good_model, overwrites current_model with it (rollback). Writes error_ledger row. Fires deduplicated Slack alert (grader:canary_alert:{provider}, 24h TTL).

scripts/checks/21-judge-model-quality.ts

  • Added reasoning-model pattern check with segment-boundary guard (avoids false-positive on bare property names like 'reasoning').
  • Token budget floor assertion: compiled default must be ≥ 4096.

Layer 3 — Per-judge 7d observability (weekly digest)

src/cron/bridge-controller.tsrunJudgeWeeklyAudit()

  • Runs Monday 0 6 * * * slot alongside existing weekly digest.
  • Two queries (D1 cold path):
    • fetchJudgeVerdictCounts — counts rows in grader_verdicts per provider in last 7d (only successful verdicts are stored in that table; failures are tracked via canary).
    • fetchCanaryFailureCounts — counts error_ledger rows where provider LIKE 'judge-canary:%' in last 7d.
  • Posts Slack info message with per-judge verdict counts and canary failure counts. Warning flag if any provider has canary failures > 0.

Layer 4 — Kill-switch (auto-rip if harden fails)

Trip conditions in evaluateKillSwitch() (weekly):

  • Condition A: ≥2 providers with verdictCount === 0 AND canaryFailures ≥ 1 (grader producing nothing AND canary confirmed bad)
  • Condition B: ≥2 providers with canaryFailures ≥ 5 (sustained canary failures)

On trip, tripKillSwitch():

  1. Sets api_config:grader.mode = 'single-judge-fallback' and api_config:grader.fallback_model = 'deepseek/deepseek-chat' in KV.
  2. Writes grader:kill_switch_tripped with 7d TTL (prevents re-trip within window).
  3. Fires critical Slack page via ctx.waitUntil(sendAlert(...)).

runTriJudge (future follow-on): reads mode flag → if 'single-judge-fallback', routes to deepseek/deepseek-chat only (non-reasoning, confirmed working, accepts vendor-bias temporarily). Full rip PR is auto-proposed if mode stays tripped at the next weekly tick.


New KV Keys

KeyWritten byRead byTTL
judge_config:{p}:last_good_modeljudge-canary on passjudge-canary on fail (rollback)10d
grader:canary_alert:{p}judge-canary on first failjudge-canary (dedupe)24h
grader:kill_switch_trippedtripKillSwitchevaluateKillSwitch (guard)7d

Consequences

Positive

  • Reasoning models can no longer be selected as judge candidates.
  • Even if a bad model slips through discovery, the nightly canary catches it and rolls back within 24h without manual intervention.
  • TRUNCATED error code makes budget-exhaustion failures distinguishable in error_ledger from schema validation failures.
  • Kill-switch guarantees no zombie grader: if the harden fails, the system degrades gracefully to a known-working single-judge path rather than silently producing zero verdicts for weeks.

Neutral

  • Cron slot multiplexing means judge-canary runs sequentially after discovery in the same 0 6 * * * invocation rather than having its own 0 6:30 * * * slot. Ordering is preserved (discovery writes → canary validates). No functional difference.
  • grader_verdicts schema unchanged. No migration needed.

Negative / Risks

  • tokenBudgetForModel returns 8192 for reasoning models. If a reasoning model somehow slips past the discovery filter (new naming convention not matching the regex), it consumes more tokens per call. Acceptable: canary will catch and roll back within 24h.
  • Kill-switch Condition A requires verdictCount === 0, not just low counts. A grader producing 1 verdict/week for a provider won’t trip it. Accepted: intent is to catch total failure, not low-quality output — low-quality is caught by the canary integration test.

Invariants Preserved

  • #2 (KV-only hot path): all new D1 access is in cron context.
  • #9 (CF Cron): multiplexed into existing slots, no external cron services.
  • #11 (30s AbortController): canary runner sets JUDGE_TIMEOUT_MS = 30_000 on every outbound fetch.
  • #12 (LLM via AI Gateway): canary routes via resolveProviderBaseUrl (AI Gateway when env is configured, direct otherwise).

Stop Criterion

This ADR is satisfied when:

  1. Per-judge canary green 7 consecutive days.
  2. grader_verdicts 7d success rate ≥ 90% across all providers.
  3. Bridge weekly digest shows no judge in warning or red state.

If not met within 8 weeks → execute Layer 4 rip path (full grader removal, replace 4 consumers with rollback-rate + error-rate deltas).