Tool Risk Ratings, Rationale Field, and Agent Feedback Tool

ADR-036: Tool Risk Ratings, Rationale Field, and Agent Feedback Tool

Status: Proposed Date: 2026-05-01 Author: Claude Code (claude/quality-harness-and-knowledge-layer) Supersedes: — Related: ADR-024 (OAuth 2.1), ADR-030 (Universal MCP Access), ADR-034 (Harness), ADR-035 (KL+BF) Plan: V5-QUALITY-HARNESS-AND-KNOWLEDGE-LAYER.md

Context

Three findings from production agent platforms (Ramp, OpenAI, multiple early-V5 user reports) point to the same gap: V5 today does not give calling agents (Claude Code, Cursor, ChatGPT, Codex, Hermes) enough structure to reliably succeed.

OpenAI’s Practical Guide to Building Agents (2025) §Tool safeguards. Every tool should carry a risk rating (low / med / high). High-risk tools should automatically pause for confirmation or escalate to a human. V5 today executes aws_ses_send (sends real customer emails) on the same code path as ga4_report (read-only).
Ramp’s “Designing for Agents” (Apr 2026, Teddy Riker). Ramp shipped two near-zero-cost upgrades to their MCP that materially improved agent success: (a) every tool requires a rationale parameter, used to reconstruct intent from chat traffic Ramp can’t see; (b) a standalone submit_feedback tool that calling agents invoke when blocked. They learn product-features-to-build directly from agent feedback. V5 today has neither.
Notion’s MCP tool description pattern. notion-create-pages opens with: “For the complete Markdown specification, always first fetch the MCP resource at notion://docs/enhanced-markdown-spec. Do NOT guess or hallucinate Markdown syntax.” Every Notion-MCP-via-Claude write succeeds first time. V5’s tool descriptions today are generic — agents hallucinate property names on hubspot_crm, salesforce_crm, meta_ads writes routinely.

These three are independent fixes that share the same shape: small upgrades to every existing tool that compound dramatically across the 31-tool surface.

Decision

Three coordinated upgrades to the MCP tool surface. All three are non-breaking for existing callers (additive parameters with safe defaults).

Decision 1 — `rationale` parameter required on every MCP tool

Add to every Zod input schema in src/tools/*.ts:

rationale: z.string().min(10).max(500).describe(
  "One-sentence explanation of why you are making this call. " +
  "Used by V5 to reconstruct intent for product analytics — never shown to end users. " +
  "Required."
)

Persist rationale to kv_audit (D1) on every call.
Validation enforced by Zod .strict() (already invariant). Missing rationale → VALIDATION_ERROR at the gateway.
Cost. ~10 input tokens per call. Negligible. Rationale is logged but does not modify upstream behavior.

Decision 2 — `submit_feedback` MCP tool

New tool, registered in src/handlers/mcp.ts.

Inputs:

{
  attempted_action: string;       // "Tried to create a HubSpot deal with custom property X"
  expected_outcome: string;       // "Deal created with property X populated"
  actual_outcome: string;         // "API returned error: property X does not exist on deal object"
  tool_called?: string;           // "hubspot_crm"
  error_seen?: string;            // raw error message
  severity: 'low' | 'med' | 'high';
  rationale: string;              // (required by Decision 1)
}

Persists to D1 agent_feedback (migration 0009_agent_feedback.sql).
Admin endpoint GET /admin/feedback?since=&severity= for weekly triage.
Mishaal triages weekly. Patterns become product features.

Decision 3 — Tool risk ratings + auto-approve gate

New field on ProviderMeta (and per-tool override): risk: 'low' | 'medium' | 'high'.
Tagging rules:
- low — read-only operations (ga4_report, gsc_performance, search_knowledge, context_query, *_query for read-only intents).
- medium — write operations on internal records (hubspot_crm upsert, salesforce_crm update, agent_state write).
- high — financial / destructive / external-messaging operations (aws_ses_send, gmail with send, any *_mutate on advertising platforms, salesforce_crm delete, hubspot_crm delete).

Execution gate: high-risk tool call without confirm: true AND tenant config auto_approve_high_risk = false (default false) → returns:

{
  "requires_human_approval": true,
  "approval_id": "uuid-here",
  "approval_url": "/admin/approvals/uuid-here",
  "summary": "<rendered action summary>",
  "ttl_seconds": 86400
}

Pending approval persisted to KV pending_approval:{id} with 24h TTL.
Admin endpoints: POST /admin/approvals/{id}/approve → executes the original call; POST /admin/approvals/{id}/reject → marks rejected.
Tenant override: tenant_config:{tenant}.auto_approve_high_risk: true → high-risk calls execute directly. Documented as opt-in for advanced tenants only.

Decision 4 — Spec-first tool descriptions (Notion pattern)

For every provider with non-trivial schema (HubSpot, Salesforce, Google Ads, Meta Ads, LinkedIn Ads, GA4): create one MCP Resource at mcp://{provider}/cheatsheet containing property names, association types, common query patterns, gotchas.
Every affected tool description opens with: “Before any write or non-trivial query, fetch mcp://{provider}/cheatsheet first. Do NOT guess property names or association types.”
Cheatsheets are markdown files in src/resources/cheatsheets/{provider}.md, registered as MCP Resources alongside tools.

Consequences

Positive

rationale reconstructs intent. Even though V5 can’t see ChatGPT/Cursor/Codex chat content, rationales aggregate into clear product-discovery signal. Cluster the rationales weekly → see what calling agents are actually trying to do.
submit_feedback is a direct line from calling agents to our roadmap. Ramp’s experience: agents are more specific and more consistent in feedback than human users. We get to the third decimal place of “what to build next” without surveying.
Risk ratings prevent expensive accidents. A miscalibrated agent can no longer auto-fire an SES batch send to 10K subscribers without explicit confirmation.
Spec-first descriptions cut the most common failure mode. “Agent hallucinated a property name and the call 4xx’d” goes from baseline-frequent to near-zero on the providers we cheatsheet.
All four decisions compose. The Grader (ADR-034) reads rationale to category-condition rubrics. Risk ratings inform Bridge abort thresholds (high-risk regressions abort faster). Feedback feeds the engineering pipeline directly.

Negative

Breaking change risk for existing API callers — rationale becomes required. Mitigated by:
- Two-week grace period: rationale optional with deprecation warning logged + Slack alert per offending caller.
- Then enforced strictly.
- Only non-MCP REST callers are affected; MCP callers (/mcp route) get the new schema automatically because they read schemas from the server.
Per-call latency from cheatsheet fetches. Each mcp://*/cheatsheet resource read is one extra round-trip. Mitigated by aggressive caching (ETag + 1h TTL on the resource).
Approval flow adds friction for legitimate batch operations. Mitigated by auto_approve_high_risk: true opt-in on a per-tenant basis.

Neutral

New D1 table agent_feedback, pending_approvals (or KV-only for the latter — TBD in implementation).
Tool count: +1 (submit_feedback). Combined with ADR-035’s +3, hits the 35 ceiling.
kv_audit schema gains a rationale column. Migration 0011_kv_audit_rationale.sql.

Alternatives considered

Skip rationale, infer intent from tool sequence. Rejected — too lossy. Ramp’s quoted experience (“rationales reconstruct intent we can’t see”) is the empirical case for this.
Skip risk ratings, rely on per-tool human-approval flags ad hoc. Rejected — already failing today (we have no flags). Categorical taxonomy is cleaner than per-tool flags.
Make rationale optional. Rejected — optional fields are silently skipped. Required is the only enforcement that survives.
Use OpenAI’s standard “tool risk” parameter (if it exists in MCP spec). MCP spec 2025-11-25 does not currently surface a standardized risk field. We use the V5-internal pattern; if MCP spec adds one in 2026, migrate.
Build cheatsheets dynamically (LLM-generated from API docs). Rejected for v1 — manual cheatsheets are higher quality and only need to be written once per provider. Revisit if cheatsheets get out of sync with provider versions.
Approval flow via Slack interactive message instead of admin endpoint. Acceptable enhancement. Phase 2 ships the admin endpoint; Slack interactivity can be added in a follow-up if Mishaal wants approve-from-mobile.

Reversal criteria

rationale regret signal: <5% of calls have meaningful rationales after 30 days, AND clustering rationales produces no actionable product insights. Then make rationale optional and downgrade to kv_audit-only logging.
submit_feedback regret signal: <10 unique feedback submissions per month after 90 days. Then deprecate the tool (don’t delete the table).
Risk-rating regret signal: >50% of high-risk calls trigger the approval flow but get unanimously approved within 5 min. Then either re-tier those tools or default auto_approve_high_risk: true and document.

Acceptance

Mishaal Murawala approves all four decisions and authorizes Phase 1 + Phase 2 implementation upon plan-first PR merge.