OAuth Re-Auth Escalation Path (Fail-Fast + Human-In-The-Loop)
ADR-047 — OAuth Re-Auth Escalation Path (Fail-Fast + Human-In-The-Loop)
- Status: Accepted
- Date: 2026-05-07
- Decider: Mishaal Murawala (delegated engineering judgment to Claude Code as engineering lead)
- Supersedes: none
- Related:
ASCEND_OPERATOR_OS_ENGINEERING_STANDARD.md§7, ADR-024, ADR-038, V5 Invariant 6 (request path never touches DO), V5 Invariant 8 (multi-accounttokens:{tenant}:{provider}:{account_id})
Context
The Operator OS depends on tokens for ~25 external SaaS APIs (HubSpot, Google Ads, Salesforce, Gmail, GA4, GSC, LinkedIn Ads, Microsoft Ads, Meta Ads, SEMrush, Apollo, Gong, etc.). Tokens fail in three distinct modes, and conflating them produces silent multi-day outages:
- Transient refresh failure. Network blip, upstream 5xx, brief rate limit. Resolves on retry.
- Recoverable refresh failure. Refresh token still valid but auth server rejected the specific refresh attempt (clock skew, header misconfig). Resolves on a different attempt with corrected request.
- Terminal refresh failure. Refresh token revoked, scopes changed, connection deleted, account password reset, IT admin rotated SSO. Cannot be auto-fixed. Requires a human to re-authorize via the OAuth consent screen.
Until now, V5’s TokenManager DO retries refresh on alarm and writes terminal failures into a status field, but there is no enforced escalation path: the gateway proxy logs TOKEN_EXPIRED and the agent run dies without surfacing the actionable next step (re-auth) to the human who can fix it.
The cost of getting this wrong is high: a portco’s SDR Agent runs for 3 days returning empty results because the Gmail token expired and nobody noticed. Multi-model review flagged this as the second-highest-likelihood Q1 incident class (after Workflows payload limits, addressed in ADR-046).
This ADR codifies the contract: fail-fast in the request path, escalate to a queued human-in-the-loop re-auth, alert on terminal failure, and report SLA on time-to-reauth.
Decision
Three failure-mode contracts
Mode 1 — Transient refresh failure
- TokenManager DO alarm fires; refresh succeeds on retry within the alarm tick.
- Implementation: exponential backoff inside the alarm, max 3 attempts within a single alarm fire, total wall time ≤ 30 s.
- KV is updated with the new token; status remains
active. - No human notification.
Mode 2 — Recoverable refresh failure
- First alarm fire fails after 3 attempts.
- Status updated to
refresh_failing(transient state, not yet terminal). - Next scheduled alarm (at the normal 10-minute-before-expiry mark) tries again.
- If next alarm also fails, escalate to Mode 3.
- Slack info-level alert on first transition into
refresh_failingso an operator sees it but no action is required.
Mode 3 — Terminal refresh failure
- Two consecutive alarm fires fail OR the upstream returned an unrecoverable error (e.g., HTTP 400 with
error: invalid_grant, indicating refresh token revoked). - TokenManager DO marks the token row in KV with
status: needs_reauthAND writes a row to D1oauth_reauth_queue(cold-path table; see schema below). - Slack alert at WARN level: tenant, provider, account, last error, link to admin re-auth flow.
- DO stops trying. No further alarms fire on this token until human re-auth resolves it.
- The DO’s alarm is rescheduled to the next normal interval only after re-auth completes.
Hot-path behavior
- Per V5 Invariant 6, the gateway request path never touches a DO and never attempts a refresh.
- When the gateway proxy reads
tokens:{tenant}:{provider}:{account_id}and finds eitherstatus: needs_reauthOR an expired token without a fresh value, it returns immediately:{"error": "token requires re-authorization","code": "TOKEN_EXPIRED","status": 401,"tenant_id": "<tenant>","provider": "<provider>","account_id": "<account>","reauth_url": "https://ascend-gateway-v5.ascendgtm.workers.dev/oauth/{provider}/start?tenant={tenant}&account={account}"} - The agent runner that consumed this response surfaces the
reauth_urlto its caller (the human or the orchestration layer) — fail-fast, no retry.
oauth_reauth_queue D1 table
CREATE TABLE IF NOT EXISTS oauth_reauth_queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, tenant_id TEXT NOT NULL, provider TEXT NOT NULL, account_id TEXT NOT NULL, failed_at INTEGER NOT NULL, -- unix seconds, time of first terminal failure last_error TEXT NOT NULL, -- short error string from upstream status TEXT NOT NULL DEFAULT 'queued', -- queued | in_progress | resolved | abandoned resolved_at INTEGER, -- unix seconds, set when re-auth completes resolved_by TEXT, -- email of the operator who re-authed notes TEXT);CREATE INDEX IF NOT EXISTS idx_oauth_reauth_tenant ON oauth_reauth_queue(tenant_id, status);CREATE INDEX IF NOT EXISTS idx_oauth_reauth_status ON oauth_reauth_queue(status, failed_at);Class label: ascend_platform_metric (the operating queue itself), with tenant_id foreign-key column allowing per-tenant filtering. Per ADR-045.
Admin dashboard surface
GET /admin/oauth-reauth-queue(CF Access gated per V5 Invariant 13) returns the queue filtered by status.- Each row links to
/oauth/{provider}/start?tenant={tenant}&account={account}to begin the re-auth flow. - On successful re-auth callback, the OAuth handler updates the token in KV AND marks the queue row
status: resolved.
SLA reporting
- Dashboard surfaces
time_to_reauth = resolved_at - failed_atper resolved row. - Weekly cron computes p50/p95/p99 across all resolved rows in the past 7 days, writes to
decision_logfor trend tracking. - Goal during open beta: p95 < 24 hours. Post-launch goal: p95 < 4 hours.
Slack alert template
:rotating_light: OAuth re-auth requiredTenant: {tenant_id}Provider: {provider}Account: {account_id}Failed since: {failed_at_iso} ({elapsed_minutes} min ago)Last error: {last_error}Re-auth URL: {reauth_url}Queue status: {dashboard_url}Alternatives considered
- Auto-retry forever. Already rejected by V5 Invariant 3 (fail-fast, no retries in request path) and operationally: a revoked refresh token never recovers. Indefinite retry consumes CPU and pollutes alerts.
- Email user directly. Some failure modes correspond to a specific tenant admin (the person who originally authorized the connection) and emailing them feels personal. But during open beta, Mishaal is the operator for every tenant, and email is slower than Slack. Reversal trigger: first paying client where the tenant admin is not Mishaal — extend Slack alert to also email a per-tenant admin contact stored in
tenant_config:{tenant}.admin_email. - Auto-disable agent runs that touch the failed provider. Considered but rejected: agents legitimately can run on tools that don’t need the failed provider (e.g., SDR Agent might use Gmail + HubSpot; if Gmail breaks but HubSpot works, the agent should still partially function and surface the Gmail breakage). Fail-fast at the tool level is the right granularity.
- Use a third-party token broker (Nango, Composio). Rejected by V5 Invariant 5 (no external vendors in the token path). The ADR-038 Nango-light-management decision specifically excludes ceding the token store. Re-auth UX is a UX problem we own, not a vendor problem we delegate.
Consequences
Wins
- Multi-day silent failures are eliminated. Every token that breaks gets a Slack alert and a queue entry.
- Hot path stays fast: no refresh attempts, no DO calls, no surprise latency.
- Tenant isolation: re-auth queue is per-tenant; one tenant’s broken connection never affects another’s.
- Auditable: queue + Slack + decision_log together produce a clear timeline of every connection failure.
Costs
- One new D1 table + indexes (cold path; cheap).
- One new admin dashboard surface (reuses existing CF-Access gate).
- One Slack channel needs an
oauth-reauthtopic (or repurposegateway-alerts). - Operator overhead: re-authing connections is now a human task. Acceptable; this work is unsafe to automate per ADR rationale above.
Open items (tracked)
- D1 migration creating
oauth_reauth_queue— to be added when first Q1 Workflow needs it OR when the first terminal failure happens, whichever first. Tech-debt rowOO-047-001. GET /admin/oauth-reauth-queueadmin route — same.- Slack alert wiring in TokenManager DO — same.
- SLA cron for weekly p50/p95/p99 — Q2 follow-up; not blocking.
Reversal criteria
- Cloudflare or a partner ships a managed re-auth UX that meets V5 Invariants 5 and 13 — could replace our admin dashboard surface but not the underlying queue table.
- A paying client requires a stricter SLA than this contract supports — tighten the alert thresholds and possibly add SMS, but the queue + fail-fast + HITL pattern remains.
The contract itself is not reversed. Auto-retrying terminal failures is a known anti-pattern and never worth revisiting.