Skip to content

Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge

ADR-034: Self-Healing Agent Harness — Grader, Engineering Pipeline, Bridge

Status: Proposed Date: 2026-05-01 Author: Claude Code (claude/quality-harness-and-knowledge-layer) Supersedes:Related: ADR-024 (OAuth 2.1), ADR-027 (Hosted-OSS-first routing), ADR-031 (Hermes), ADR-035 (KL+BF), ADR-036 (Risk + Rationale) Plan: V5-QUALITY-HARNESS-AND-KNOWLEDGE-LAYER.md


Context

V5 currently ships ~31 MCP tools to ~5 active surfaces (Claude Code, Cursor, Codex, ChatGPT, Hermes). Every error is logged to error_ledger. Every successful tool call is logged to kv_audit. Neither tells us whether the response was actually any good. The silent “agent gave a confident wrong answer” failure mode — the most damaging one for a B2B agent platform — is invisible to us today.

CREAO published their production self-healing harness pattern in April 2026. It maps almost 1:1 onto V5’s existing primitives:

  • AI Gateway → judges’ inference path.
  • D1 (cold path) → verdict storage.
  • CF Cron + CF Workflows → engineering pipeline.
  • Worker Versioning + Gradual Deployments (CF feature) → Bridge canary infrastructure.

We have everything required to build this. The only thing we lack is the harness itself.

The cost of waiting:

  • Every silent regression we ship erodes trust in V5.
  • Every undocumented quality drift makes “the MCP layer competitors can’t touch” claim less defensible.
  • Every manual deploy approval is a bottleneck.

The cost of building:

  • ~30 engineering hours over Phases 4-6.
  • ~$50/mo grader inference budget at current volume.

Decision

Build a three-component self-healing harness. All three components live inside the existing two-plane architecture (invariant 1). No new top-level Worker.

Component 1 — The Grader

Async tri-judge panel scoring every agent turn out-of-band.

  • Trigger. Fired via ctx.executionCtx.waitUntil() from src/handlers/mcp.ts after every tool-call response is returned to the user. Zero added latency to the user-facing response (invariant 10 preserved).
  • Categorical router (Job 0). Single-shot claude-haiku-4-5-20251001 call that maps the turn to one of 12 V5 domains: crm_read, crm_write, ads_read, ads_mutate, analytics_report, content_generate, email_send, calendar_op, search_query, data_query, agent_orchestration, error_recovery. Cached per message_id in KV for 1h.
  • Three judges, three model families (CREAO bias-mitigation). Parallel via AI Gateway:
    • claude-haiku-4-5-20251001 (Anthropic)
    • gpt-4o-mini (OpenAI)
    • gemini-1.5-flash (Google)
  • Schema-locked verdict via tool call submit_evaluation. Five required fields:
    • reasoning: string (2-3 sentences).
    • category: string (echoed for cross-validation).
    • quality: 'poor' | 'acceptable' | 'good' | 'excellent' → mapped to 1-4 scale.
    • issues: string[] drawn from 9-item taxonomy: incomplete, hallucination, tool_misuse, missed_context, verbose, wrong_domain, unsafe_action, format_violation, regression.
    • confidence: number (0-1 float).
  • Mathematical consensus. Average across surviving judges (timeouts allowed → quorum size flexes). Per-judge persistence so drift between judges is auditable.
  • Sampling. Per-model, not per-traffic.
    • Dominant model (the one serving the response): 10%.
    • Minority models: 100%.
    • First 7 days post-deploy: 100% for everything (cold-start baseline).
  • Storage. D1 table grader_verdicts. Cold path. Invariant 2 preserved.
  • Calibration. 5% of verdicts daily-sampled to a manual-review queue. Mishaal triages weekly. >0.3 persistent gap between judge consensus and human review = rubric bug.
  • Failure mode. Best-effort. AI Gateway down → verdict not persisted. Never blocks the user response. Never retries (invariant 3).
  • Cost. Budgeted at ~$0.0003 per grade × 10% sampling × current ~10K calls/day = ~$9/mo. Hard cap $50/mo via cost-tracker cron with auto-disable if exceeded.

Component 2 — The Engineering Pipeline

Six daily/hourly jobs that turn low scores into verified fixes.

  • Job 1 — Detect & Triage. CF Cron, hourly. Pulls last 24h verdicts with avg_quality < 2.5. Clusters by (category, tool_called, issue_type, error_signature). Scores each cluster on a 9-dimension severity engine. Above urgency cutoff → opens GitHub issue. Below → log to grader_clusters for trend tracking.
  • Job 2 — Investigate. CF Workflow (per invariant 9 — multi-step, idempotent steps via step.do). Triggered for top-3 clusters per Job 1 run. Walks error_ledger stack traces, pulls Workers logs (last 24h), checks recent deploys against the worker_version field on verdicts, queries D1 for related clusters. Posts root-cause hypothesis + evidence as an issue comment.
  • Job 3 — Auto-Fix. CF Workflow. Triggered only for high-confidence Job 2 results (confidence >= 0.85). Branches code, applies fix, runs npm run typecheck && npm test (offloaded to a GitHub Action via gh workflow run since CF Workers can’t run tsc), opens draft PR. Hard guardrails:
    • Max 3 PRs per run.
    • Auto-close any diff touching .env*, .github/, wrangler.toml, src/core/auth.ts, src/core/tokens.ts.
    • Type errors block PR creation.
    • Failing tests block PR creation.
  • Job 4 — Verify. CF Cron, every 6h. For tickets In Review: query verdicts for the past 6h in the same cluster. Zero recurrences → close with telemetry evidence. Still failing → update with new occurrence count.
  • Job 5 — Re-grade. Modifies sampling for closed clusters: 100% sampling for 24h post-close. Regression → reopen ticket + revert PR via gh pr revert.
  • Job 6 — Report. Nightly digest cron. Posts to Slack #v5-quality: clusters detected, PRs shipped, PRs reverted, score deltas per category, per-model leaderboard.

Component 3 — The Bridge

AI-gated grey rollouts on every deploy. No staging environment.

  • Routing. Cloudflare Workers Versioning + Gradual Deployments. Every wrangler deploy creates a new version; routing starts at 10% canary, 90% baseline.
  • Bridge controller. New cron src/cron/bridge-controller.ts, runs every 10 min while a canary is active. Pulls verdicts for the active canary, compares avg_quality to baseline (last 1000 pre-canary verdicts in the same category mix). Welch’s t-test, 200-interaction minimum window, p<0.05.
  • Decision tree.
    • avg_quality drop >= 0.15 (statistically significant) → abort: flip to 100% baseline, post Slack alert with cohort attached, file Job 1 input issue. Loop closed.
    • Hold or improvepromote: 10% → 20% → 50% → 100%, each step gated by a fresh window.

Consequences

Positive

  • First-ever quality signal on V5’s own production output. We become aware of regressions before customers report them.
  • Eval and QA fuse into one loop (CREAO thesis). No separate “ML evaluation team” vs “engineering QA team” — both are the same automated pipeline.
  • Deploy velocity goes up, not down. Manual approval bottleneck removed. Bridge does the safety check.
  • Compounding moat. Every customer interaction makes the grader’s per-category baselines tighter, which tightens every future canary decision.
  • Defensible commercial positioning. “V5 grades its own output and refuses to ship regressions” is a feature competitors can’t ship without rebuilding their stack.
  • Costs are tractable — $50/mo cap is far below the savings from one prevented production incident.

Negative

  • Engineering investment — ~30 hrs across Phases 4-6, plus ongoing rubric maintenance (~1 hr/wk).
  • Cost floor — even at low volume, the harness costs ~$9-50/mo.
  • Judge bias correlation risk — three frontier models may agree because of shared training data. Mitigated by weekly human calibration + persistent per-judge tracking.
  • False-positive abort risk on Bridge — a deploy with high but variant scores might trip abort. Mitigated by per-category thresholds + 200-interaction minimum window + p<0.05.
  • Auto-fix Job 3 risk surface — generated PRs could be wrong. Mitigated by hard allowlist on touchable files + type/test gates + Bridge catches anything that does ship.

Neutral

  • New D1 tables: grader_verdicts, grader_clusters, grader_calibration_samples. All cold path.
  • New cron triggers (5 added): grader-calibration, harness-job1-detect, harness-job4-verify, harness-job6-report, bridge-controller.
  • New CF Workflows (2 added): harness-investigate, harness-autofix.
  • Tool count unchanged by this ADR.

Alternatives considered

  1. Continue with error_ledger only. Rejected — silent quality regressions stay invisible. Already the status quo and already failing on the “is the answer good” question.
  2. Single-judge grader. Rejected — known self-preference bias when a Claude judge grades a Claude response. CREAO’s tri-judge mitigates this.
  3. Synchronous grader in the request path. Rejected — adds 500ms-2s of judge latency to every user response. Violates invariant 10.
  4. Run the harness as a third Worker. Rejected — invariant 1. The harness lives inside ascend-gateway-v5 (graders + Bridge) and ascend-context-worker (verdict aggregation if needed). Service Bindings between the two planes are sufficient.
  5. Use a third-party eval vendor (Braintrust, Helicone, Galileo). Rejected — vendor lock-in on the load-bearing quality signal of the product. Also, they don’t ship the engineering pipeline (Component 2) or the Bridge (Component 3) — only the Grader. Building all three in-house is the moat.
  6. Use Anthropic’s own MCP-side eval features. Anthropic does not currently ship a tri-judge eval primitive. If they ship one in Q3 2026, revisit.

Reversal criteria

Roll back the harness if any one holds for 30 consecutive days:

  • Tri-judge consensus correlates >0.95 with claude-haiku alone, indicating no panel value.
  • Bridge aborts ≥30% of canaries with no real regression (false-positive overload).
  • Cost exceeds $200/mo (4× budget).
  • Mishaal weekly triage burden exceeds 2 hrs/wk on average.

If reversed, retain the verdicts table for offline analysis even if scoring stops.


Acceptance

Mishaal Murawala approves the architecture and authorizes Phase 4 implementation upon plan-first PR merge.