Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge

ADR-034: Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge

Status: Proposed Date: 2026-05-01 Author: Claude Code (claude/quality-harness-and-knowledge-layer) Supersedes: — Related: ADR-024 (OAuth 2.1), ADR-027 (Hosted-OSS-first routing), ADR-031 (Hermes), ADR-035 (KL+BF), ADR-036 (Risk + Rationale) Plan: V5-QUALITY-HARNESS-AND-KNOWLEDGE-LAYER.md

Context

V5 currently ships ~31 MCP tools to ~5 active surfaces (Claude Code, Cursor, Codex, ChatGPT, Hermes). Every error is logged to error_ledger. Every successful tool call is logged to kv_audit. Neither tells us whether the response was actually any good. The silent “agent gave a confident wrong answer” failure mode — the most damaging one for a B2B agent platform — is invisible to us today.

CREAO published their production self-healing harness pattern in April 2026. It maps almost 1:1 onto V5’s existing primitives:

AI Gateway → judges’ inference path.
D1 (cold path) → verdict storage.
CF Cron + CF Workflows → engineering pipeline.
Worker Versioning + Gradual Deployments (CF feature) → Bridge canary infrastructure.

We have everything required to build this. The only thing we lack is the harness itself.

The cost of waiting:

Every silent regression we ship erodes trust in V5.
Every undocumented quality drift makes “the MCP layer competitors can’t touch” claim less defensible.
Every manual deploy approval is a bottleneck.

The cost of building:

~30 engineering hours over Phases 4-6.
~$50/mo grader inference budget at current volume.

Decision

Build a three-component self-healing harness. All three components live inside the existing two-plane architecture (invariant 1). No new top-level Worker.

Component 1 — The Grader

Async tri-judge panel scoring every agent turn out-of-band.

Trigger. Fired via ctx.executionCtx.waitUntil() from src/handlers/mcp.ts after every tool-call response is returned to the user. Zero added latency to the user-facing response (invariant 10 preserved).
Categorical router (Job 0). Single-shot claude-haiku-4-5-20251001 call that maps the turn to one of 12 V5 domains: crm_read, crm_write, ads_read, ads_mutate, analytics_report, content_generate, email_send, calendar_op, search_query, data_query, agent_orchestration, error_recovery. Cached per message_id in KV for 1h.
Three judges, three model families (CREAO bias-mitigation). Parallel via AI Gateway:
- claude-haiku-4-5-20251001 (Anthropic)
- gpt-4o-mini (OpenAI)
- gemini-1.5-flash (Google)
Schema-locked verdict via tool call submit_evaluation. Five required fields:
- reasoning: string (2-3 sentences).
- category: string (echoed for cross-validation).
- quality: 'poor' | 'acceptable' | 'good' | 'excellent' → mapped to 1-4 scale.
- issues: string[] drawn from 9-item taxonomy: incomplete, hallucination, tool_misuse, missed_context, verbose, wrong_domain, unsafe_action, format_violation, regression.
- confidence: number (0-1 float).
Mathematical consensus. Average across surviving judges (timeouts allowed → quorum size flexes). Per-judge persistence so drift between judges is auditable.
Sampling. Per-model, not per-traffic.
- Dominant model (the one serving the response): 10%.
- Minority models: 100%.
- First 7 days post-deploy: 100% for everything (cold-start baseline).
Storage. D1 table grader_verdicts. Cold path. Invariant 2 preserved.
Calibration. 5% of verdicts daily-sampled to a manual-review queue. Mishaal triages weekly. >0.3 persistent gap between judge consensus and human review = rubric bug.
Failure mode. Best-effort. AI Gateway down → verdict not persisted. Never blocks the user response. Never retries (invariant 3).
Cost. Budgeted at ~$0.0003 per grade × 10% sampling × current ~10K calls/day = ~$9/mo. Hard cap $50/mo via cost-tracker cron with auto-disable if exceeded.

Component 2 — The Engineering Pipeline

Six daily/hourly jobs that turn low scores into verified fixes.

Job 1 — Detect & Triage. CF Cron, hourly. Pulls last 24h verdicts with avg_quality < 2.5. Clusters by (category, tool_called, issue_type, error_signature). Scores each cluster on a 9-dimension severity engine. Above urgency cutoff → opens GitHub issue. Below → log to grader_clusters for trend tracking.
Job 2 — Investigate. CF Workflow (per invariant 9 — multi-step, idempotent steps via step.do). Triggered for top-3 clusters per Job 1 run. Walks error_ledger stack traces, pulls Workers logs (last 24h), checks recent deploys against the worker_version field on verdicts, queries D1 for related clusters. Posts root-cause hypothesis + evidence as an issue comment.
Job 3 — Auto-Fix. CF Workflow. Triggered only for high-confidence Job 2 results (confidence >= 0.85). Branches code, applies fix, runs npm run typecheck && npm test (offloaded to a GitHub Action via gh workflow run since CF Workers can’t run tsc), opens draft PR. Hard guardrails:
- Max 3 PRs per run.
- Auto-close any diff touching .env*, .github/, wrangler.toml, src/core/auth.ts, src/core/tokens.ts.
- Type errors block PR creation.
- Failing tests block PR creation.
Job 4 — Verify. CF Cron, every 6h. For tickets In Review: query verdicts for the past 6h in the same cluster. Zero recurrences → close with telemetry evidence. Still failing → update with new occurrence count.
Job 5 — Re-grade. Modifies sampling for closed clusters: 100% sampling for 24h post-close. Regression → reopen ticket + revert PR via gh pr revert.
Job 6 — Report. Nightly digest cron. Posts to Slack #v5-quality: clusters detected, PRs shipped, PRs reverted, score deltas per category, per-model leaderboard.

Component 3 — The Bridge

AI-gated grey rollouts on every deploy. No staging environment.

Routing. Cloudflare Workers Versioning + Gradual Deployments. Every wrangler deploy creates a new version; routing starts at 10% canary, 90% baseline.
Bridge controller. New cron src/cron/bridge-controller.ts, runs every 10 min while a canary is active. Pulls verdicts for the active canary, compares avg_quality to baseline (last 1000 pre-canary verdicts in the same category mix). Welch’s t-test, 200-interaction minimum window, p<0.05.
Decision tree.
- avg_quality drop >= 0.15 (statistically significant) → abort: flip to 100% baseline, post Slack alert with cohort attached, file Job 1 input issue. Loop closed.
- Hold or improve → promote: 10% → 20% → 50% → 100%, each step gated by a fresh window.

Consequences

Positive

First-ever quality signal on V5’s own production output. We become aware of regressions before customers report them.
Eval and QA fuse into one loop (CREAO thesis). No separate “ML evaluation team” vs “engineering QA team” — both are the same automated pipeline.
Deploy velocity goes up, not down. Manual approval bottleneck removed. Bridge does the safety check.
Compounding moat. Every customer interaction makes the grader’s per-category baselines tighter, which tightens every future canary decision.
Defensible commercial positioning. “V5 grades its own output and refuses to ship regressions” is a feature competitors can’t ship without rebuilding their stack.
Costs are tractable — $50/mo cap is far below the savings from one prevented production incident.

Negative

Engineering investment — ~30 hrs across Phases 4-6, plus ongoing rubric maintenance (~1 hr/wk).
Cost floor — even at low volume, the harness costs ~$9-50/mo.
Judge bias correlation risk — three frontier models may agree because of shared training data. Mitigated by weekly human calibration + persistent per-judge tracking.
False-positive abort risk on Bridge — a deploy with high but variant scores might trip abort. Mitigated by per-category thresholds + 200-interaction minimum window + p<0.05.
Auto-fix Job 3 risk surface — generated PRs could be wrong. Mitigated by hard allowlist on touchable files + type/test gates + Bridge catches anything that does ship.

Neutral

New D1 tables: grader_verdicts, grader_clusters, grader_calibration_samples. All cold path.
New cron triggers (5 added): grader-calibration, harness-job1-detect, harness-job4-verify, harness-job6-report, bridge-controller.
New CF Workflows (2 added): harness-investigate, harness-autofix.
Tool count unchanged by this ADR.

Alternatives considered

Continue with error_ledger only. Rejected — silent quality regressions stay invisible. Already the status quo and already failing on the “is the answer good” question.
Single-judge grader. Rejected — known self-preference bias when a Claude judge grades a Claude response. CREAO’s tri-judge mitigates this.
Synchronous grader in the request path. Rejected — adds 500ms-2s of judge latency to every user response. Violates invariant 10.
Run the harness as a third Worker. Rejected — invariant 1. The harness lives inside ascend-gateway-v5 (graders + Bridge) and ascend-context-worker (verdict aggregation if needed). Service Bindings between the two planes are sufficient.
Use a third-party eval vendor (Braintrust, Helicone, Galileo). Rejected — vendor lock-in on the load-bearing quality signal of the product. Also, they don’t ship the engineering pipeline (Component 2) or the Bridge (Component 3) — only the Grader. Building all three in-house is the moat.
Use Anthropic’s own MCP-side eval features. Anthropic does not currently ship a tri-judge eval primitive. If they ship one in Q3 2026, revisit.

Reversal criteria

Roll back the harness if any one holds for 30 consecutive days:

Tri-judge consensus correlates >0.95 with claude-haiku alone, indicating no panel value.
Bridge aborts ≥30% of canaries with no real regression (false-positive overload).
Cost exceeds $200/mo (4× budget).
Mishaal weekly triage burden exceeds 2 hrs/wk on average.

If reversed, retain the verdicts table for offline analysis even if scoring stops.

Acceptance

Mishaal Murawala approves the architecture and authorizes Phase 4 implementation upon plan-first PR merge.