Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge
ADR-034: Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge
Status: Proposed Date: 2026-05-01 Author: Claude Code (claude/quality-harness-and-knowledge-layer) Supersedes: — Related: ADR-024 (OAuth 2.1), ADR-027 (Hosted-OSS-first routing), ADR-031 (Hermes), ADR-035 (KL+BF), ADR-036 (Risk + Rationale) Plan: V5-QUALITY-HARNESS-AND-KNOWLEDGE-LAYER.md
Context
V5 currently ships ~31 MCP tools to ~5 active surfaces (Claude Code, Cursor, Codex, ChatGPT, Hermes). Every error is logged to error_ledger. Every successful tool call is logged to kv_audit. Neither tells us whether the response was actually any good. The silent “agent gave a confident wrong answer” failure mode — the most damaging one for a B2B agent platform — is invisible to us today.
CREAO published their production self-healing harness pattern in April 2026. It maps almost 1:1 onto V5’s existing primitives:
- AI Gateway → judges’ inference path.
- D1 (cold path) → verdict storage.
- CF Cron + CF Workflows → engineering pipeline.
- Worker Versioning + Gradual Deployments (CF feature) → Bridge canary infrastructure.
We have everything required to build this. The only thing we lack is the harness itself.
The cost of waiting:
- Every silent regression we ship erodes trust in V5.
- Every undocumented quality drift makes “the MCP layer competitors can’t touch” claim less defensible.
- Every manual deploy approval is a bottleneck.
The cost of building:
- ~30 engineering hours over Phases 4-6.
- ~$50/mo grader inference budget at current volume.
Decision
Build a three-component self-healing harness. All three components live inside the existing two-plane architecture (invariant 1). No new top-level Worker.
Component 1 — The Grader
Async tri-judge panel scoring every agent turn out-of-band.
- Trigger. Fired via
ctx.executionCtx.waitUntil()fromsrc/handlers/mcp.tsafter every tool-call response is returned to the user. Zero added latency to the user-facing response (invariant 10 preserved). - Categorical router (Job 0). Single-shot
claude-haiku-4-5-20251001call that maps the turn to one of 12 V5 domains:crm_read,crm_write,ads_read,ads_mutate,analytics_report,content_generate,email_send,calendar_op,search_query,data_query,agent_orchestration,error_recovery. Cached permessage_idin KV for 1h. - Three judges, three model families (CREAO bias-mitigation). Parallel via AI Gateway:
claude-haiku-4-5-20251001(Anthropic)gpt-4o-mini(OpenAI)gemini-1.5-flash(Google)
- Schema-locked verdict via tool call
submit_evaluation. Five required fields:reasoning: string(2-3 sentences).category: string(echoed for cross-validation).quality: 'poor' | 'acceptable' | 'good' | 'excellent'→ mapped to 1-4 scale.issues: string[]drawn from 9-item taxonomy:incomplete,hallucination,tool_misuse,missed_context,verbose,wrong_domain,unsafe_action,format_violation,regression.confidence: number(0-1 float).
- Mathematical consensus. Average across surviving judges (timeouts allowed → quorum size flexes). Per-judge persistence so drift between judges is auditable.
- Sampling. Per-model, not per-traffic.
- Dominant model (the one serving the response): 10%.
- Minority models: 100%.
- First 7 days post-deploy: 100% for everything (cold-start baseline).
- Storage. D1 table
grader_verdicts. Cold path. Invariant 2 preserved. - Calibration. 5% of verdicts daily-sampled to a manual-review queue. Mishaal triages weekly. >0.3 persistent gap between judge consensus and human review = rubric bug.
- Failure mode. Best-effort. AI Gateway down → verdict not persisted. Never blocks the user response. Never retries (invariant 3).
- Cost. Budgeted at ~$0.0003 per grade × 10% sampling × current ~10K calls/day = ~$9/mo. Hard cap $50/mo via cost-tracker cron with auto-disable if exceeded.
Component 2 — The Engineering Pipeline
Six daily/hourly jobs that turn low scores into verified fixes.
- Job 1 — Detect & Triage. CF Cron, hourly. Pulls last 24h verdicts with
avg_quality < 2.5. Clusters by(category, tool_called, issue_type, error_signature). Scores each cluster on a 9-dimension severity engine. Above urgency cutoff → opens GitHub issue. Below → log tograder_clustersfor trend tracking. - Job 2 — Investigate. CF Workflow (per invariant 9 — multi-step, idempotent steps via
step.do). Triggered for top-3 clusters per Job 1 run. Walkserror_ledgerstack traces, pulls Workers logs (last 24h), checks recent deploys against theworker_versionfield on verdicts, queries D1 for related clusters. Posts root-cause hypothesis + evidence as an issue comment. - Job 3 — Auto-Fix. CF Workflow. Triggered only for high-confidence Job 2 results (
confidence >= 0.85). Branches code, applies fix, runsnpm run typecheck && npm test(offloaded to a GitHub Action viagh workflow runsince CF Workers can’t run tsc), opens draft PR. Hard guardrails:- Max 3 PRs per run.
- Auto-close any diff touching
.env*,.github/,wrangler.toml,src/core/auth.ts,src/core/tokens.ts. - Type errors block PR creation.
- Failing tests block PR creation.
- Job 4 — Verify. CF Cron, every 6h. For tickets
In Review: query verdicts for the past 6h in the same cluster. Zero recurrences → close with telemetry evidence. Still failing → update with new occurrence count. - Job 5 — Re-grade. Modifies sampling for closed clusters: 100% sampling for 24h post-close. Regression → reopen ticket + revert PR via
gh pr revert. - Job 6 — Report. Nightly digest cron. Posts to Slack
#v5-quality: clusters detected, PRs shipped, PRs reverted, score deltas per category, per-model leaderboard.
Component 3 — The Bridge
AI-gated grey rollouts on every deploy. No staging environment.
- Routing. Cloudflare Workers Versioning + Gradual Deployments. Every
wrangler deploycreates a new version; routing starts at 10% canary, 90% baseline. - Bridge controller. New cron
src/cron/bridge-controller.ts, runs every 10 min while a canary is active. Pulls verdicts for the active canary, compares avg_quality to baseline (last 1000 pre-canary verdicts in the same category mix). Welch’s t-test, 200-interaction minimum window, p<0.05. - Decision tree.
avg_quality drop >= 0.15(statistically significant) → abort: flip to 100% baseline, post Slack alert with cohort attached, file Job 1 input issue. Loop closed.- Hold or improve → promote: 10% → 20% → 50% → 100%, each step gated by a fresh window.
Consequences
Positive
- First-ever quality signal on V5’s own production output. We become aware of regressions before customers report them.
- Eval and QA fuse into one loop (CREAO thesis). No separate “ML evaluation team” vs “engineering QA team” — both are the same automated pipeline.
- Deploy velocity goes up, not down. Manual approval bottleneck removed. Bridge does the safety check.
- Compounding moat. Every customer interaction makes the grader’s per-category baselines tighter, which tightens every future canary decision.
- Defensible commercial positioning. “V5 grades its own output and refuses to ship regressions” is a feature competitors can’t ship without rebuilding their stack.
- Costs are tractable — $50/mo cap is far below the savings from one prevented production incident.
Negative
- Engineering investment — ~30 hrs across Phases 4-6, plus ongoing rubric maintenance (~1 hr/wk).
- Cost floor — even at low volume, the harness costs ~$9-50/mo.
- Judge bias correlation risk — three frontier models may agree because of shared training data. Mitigated by weekly human calibration + persistent per-judge tracking.
- False-positive abort risk on Bridge — a deploy with high but variant scores might trip abort. Mitigated by per-category thresholds + 200-interaction minimum window + p<0.05.
- Auto-fix Job 3 risk surface — generated PRs could be wrong. Mitigated by hard allowlist on touchable files + type/test gates + Bridge catches anything that does ship.
Neutral
- New D1 tables:
grader_verdicts,grader_clusters,grader_calibration_samples. All cold path. - New cron triggers (5 added): grader-calibration, harness-job1-detect, harness-job4-verify, harness-job6-report, bridge-controller.
- New CF Workflows (2 added): harness-investigate, harness-autofix.
- Tool count unchanged by this ADR.
Alternatives considered
- Continue with
error_ledgeronly. Rejected — silent quality regressions stay invisible. Already the status quo and already failing on the “is the answer good” question. - Single-judge grader. Rejected — known self-preference bias when a Claude judge grades a Claude response. CREAO’s tri-judge mitigates this.
- Synchronous grader in the request path. Rejected — adds 500ms-2s of judge latency to every user response. Violates invariant 10.
- Run the harness as a third Worker. Rejected — invariant 1. The harness lives inside
ascend-gateway-v5(graders + Bridge) andascend-context-worker(verdict aggregation if needed). Service Bindings between the two planes are sufficient. - Use a third-party eval vendor (Braintrust, Helicone, Galileo). Rejected — vendor lock-in on the load-bearing quality signal of the product. Also, they don’t ship the engineering pipeline (Component 2) or the Bridge (Component 3) — only the Grader. Building all three in-house is the moat.
- Use Anthropic’s own MCP-side eval features. Anthropic does not currently ship a tri-judge eval primitive. If they ship one in Q3 2026, revisit.
Reversal criteria
Roll back the harness if any one holds for 30 consecutive days:
- Tri-judge consensus correlates >0.95 with
claude-haikualone, indicating no panel value. - Bridge aborts ≥30% of canaries with no real regression (false-positive overload).
- Cost exceeds $200/mo (4× budget).
- Mishaal weekly triage burden exceeds 2 hrs/wk on average.
If reversed, retain the verdicts table for offline analysis even if scoring stops.
Acceptance
Mishaal Murawala approves the architecture and authorizes Phase 4 implementation upon plan-first PR merge.