OAuth Re-Auth Escalation Path (Fail-Fast + Human-In-The-Loop)

ADR-047 — OAuth Re-Auth Escalation Path (Fail-Fast + Human-In-The-Loop)

Status: Accepted
Date: 2026-05-07
Decider: Mishaal Murawala (delegated engineering judgment to Claude Code as engineering lead)
Supersedes: none
Related: ASCEND_OPERATOR_OS_ENGINEERING_STANDARD.md §7, ADR-024, ADR-038, V5 Invariant 6 (request path never touches DO), V5 Invariant 8 (multi-account tokens:{tenant}:{provider}:{account_id})

Context

The Operator OS depends on tokens for ~25 external SaaS APIs (HubSpot, Google Ads, Salesforce, Gmail, GA4, GSC, LinkedIn Ads, Microsoft Ads, Meta Ads, SEMrush, Apollo, Gong, etc.). Tokens fail in three distinct modes, and conflating them produces silent multi-day outages:

Transient refresh failure. Network blip, upstream 5xx, brief rate limit. Resolves on retry.
Recoverable refresh failure. Refresh token still valid but auth server rejected the specific refresh attempt (clock skew, header misconfig). Resolves on a different attempt with corrected request.
Terminal refresh failure. Refresh token revoked, scopes changed, connection deleted, account password reset, IT admin rotated SSO. Cannot be auto-fixed. Requires a human to re-authorize via the OAuth consent screen.

Until now, V5’s TokenManager DO retries refresh on alarm and writes terminal failures into a status field, but there is no enforced escalation path: the gateway proxy logs TOKEN_EXPIRED and the agent run dies without surfacing the actionable next step (re-auth) to the human who can fix it.

The cost of getting this wrong is high: a portco’s SDR Agent runs for 3 days returning empty results because the Gmail token expired and nobody noticed. Multi-model review flagged this as the second-highest-likelihood Q1 incident class (after Workflows payload limits, addressed in ADR-046).

This ADR codifies the contract: fail-fast in the request path, escalate to a queued human-in-the-loop re-auth, alert on terminal failure, and report SLA on time-to-reauth.

Decision

Three failure-mode contracts

Mode 1 — Transient refresh failure

TokenManager DO alarm fires; refresh succeeds on retry within the alarm tick.
Implementation: exponential backoff inside the alarm, max 3 attempts within a single alarm fire, total wall time ≤ 30 s.
KV is updated with the new token; status remains active.
No human notification.

Mode 2 — Recoverable refresh failure

First alarm fire fails after 3 attempts.
Status updated to refresh_failing (transient state, not yet terminal).
Next scheduled alarm (at the normal 10-minute-before-expiry mark) tries again.
If next alarm also fails, escalate to Mode 3.
Slack info-level alert on first transition into refresh_failing so an operator sees it but no action is required.

Mode 3 — Terminal refresh failure

Two consecutive alarm fires fail OR the upstream returned an unrecoverable error (e.g., HTTP 400 with error: invalid_grant, indicating refresh token revoked).
TokenManager DO marks the token row in KV with status: needs_reauth AND writes a row to D1 oauth_reauth_queue (cold-path table; see schema below).
Slack alert at WARN level: tenant, provider, account, last error, link to admin re-auth flow.
DO stops trying. No further alarms fire on this token until human re-auth resolves it.
The DO’s alarm is rescheduled to the next normal interval only after re-auth completes.

Hot-path behavior

Per V5 Invariant 6, the gateway request path never touches a DO and never attempts a refresh.

When the gateway proxy reads tokens:{tenant}:{provider}:{account_id} and finds either status: needs_reauth OR an expired token without a fresh value, it returns immediately:

{
  "error": "token requires re-authorization",
  "code": "TOKEN_EXPIRED",
  "status": 401,
  "tenant_id": "<tenant>",
  "provider": "<provider>",
  "account_id": "<account>",
  "reauth_url": "https://ascend-gateway-v5.ascendgtm.workers.dev/oauth/{provider}/start?tenant={tenant}&account={account}"
}

The agent runner that consumed this response surfaces the reauth_url to its caller (the human or the orchestration layer) — fail-fast, no retry.

`oauth_reauth_queue` D1 table

CREATE TABLE IF NOT EXISTS oauth_reauth_queue (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  tenant_id TEXT NOT NULL,
  provider TEXT NOT NULL,
  account_id TEXT NOT NULL,
  failed_at INTEGER NOT NULL,                  -- unix seconds, time of first terminal failure
  last_error TEXT NOT NULL,                    -- short error string from upstream
  status TEXT NOT NULL DEFAULT 'queued',       -- queued | in_progress | resolved | abandoned
  resolved_at INTEGER,                         -- unix seconds, set when re-auth completes
  resolved_by TEXT,                            -- email of the operator who re-authed
  notes TEXT
);
CREATE INDEX IF NOT EXISTS idx_oauth_reauth_tenant ON oauth_reauth_queue(tenant_id, status);
CREATE INDEX IF NOT EXISTS idx_oauth_reauth_status ON oauth_reauth_queue(status, failed_at);

Class label: ascend_platform_metric (the operating queue itself), with tenant_id foreign-key column allowing per-tenant filtering. Per ADR-045.

Admin dashboard surface

GET /admin/oauth-reauth-queue (CF Access gated per V5 Invariant 13) returns the queue filtered by status.
Each row links to /oauth/{provider}/start?tenant={tenant}&account={account} to begin the re-auth flow.
On successful re-auth callback, the OAuth handler updates the token in KV AND marks the queue row status: resolved.

SLA reporting

Dashboard surfaces time_to_reauth = resolved_at - failed_at per resolved row.
Weekly cron computes p50/p95/p99 across all resolved rows in the past 7 days, writes to decision_log for trend tracking.
Goal during open beta: p95 < 24 hours. Post-launch goal: p95 < 4 hours.

Slack alert template

:rotating_light: OAuth re-auth required
Tenant: {tenant_id}
Provider: {provider}
Account: {account_id}
Failed since: {failed_at_iso} ({elapsed_minutes} min ago)
Last error: {last_error}
Re-auth URL: {reauth_url}
Queue status: {dashboard_url}

Alternatives considered

Auto-retry forever. Already rejected by V5 Invariant 3 (fail-fast, no retries in request path) and operationally: a revoked refresh token never recovers. Indefinite retry consumes CPU and pollutes alerts.
Email user directly. Some failure modes correspond to a specific tenant admin (the person who originally authorized the connection) and emailing them feels personal. But during open beta, Mishaal is the operator for every tenant, and email is slower than Slack. Reversal trigger: first paying client where the tenant admin is not Mishaal — extend Slack alert to also email a per-tenant admin contact stored in tenant_config:{tenant}.admin_email.
Auto-disable agent runs that touch the failed provider. Considered but rejected: agents legitimately can run on tools that don’t need the failed provider (e.g., SDR Agent might use Gmail + HubSpot; if Gmail breaks but HubSpot works, the agent should still partially function and surface the Gmail breakage). Fail-fast at the tool level is the right granularity.
Use a third-party token broker (Nango, Composio). Rejected by V5 Invariant 5 (no external vendors in the token path). The ADR-038 Nango-light-management decision specifically excludes ceding the token store. Re-auth UX is a UX problem we own, not a vendor problem we delegate.

Consequences

Wins

Multi-day silent failures are eliminated. Every token that breaks gets a Slack alert and a queue entry.
Hot path stays fast: no refresh attempts, no DO calls, no surprise latency.
Tenant isolation: re-auth queue is per-tenant; one tenant’s broken connection never affects another’s.
Auditable: queue + Slack + decision_log together produce a clear timeline of every connection failure.

Costs

One new D1 table + indexes (cold path; cheap).
One new admin dashboard surface (reuses existing CF-Access gate).
One Slack channel needs an oauth-reauth topic (or repurpose gateway-alerts).
Operator overhead: re-authing connections is now a human task. Acceptable; this work is unsafe to automate per ADR rationale above.

Open items (tracked)

D1 migration creating oauth_reauth_queue — to be added when first Q1 Workflow needs it OR when the first terminal failure happens, whichever first. Tech-debt row OO-047-001.
GET /admin/oauth-reauth-queue admin route — same.
Slack alert wiring in TokenManager DO — same.
SLA cron for weekly p50/p95/p99 — Q2 follow-up; not blocking.

Reversal criteria

Cloudflare or a partner ships a managed re-auth UX that meets V5 Invariants 5 and 13 — could replace our admin dashboard surface but not the underlying queue table.
A paying client requires a stricter SLA than this contract supports — tighten the alert thresholds and possibly add SMS, but the queue + fail-fast + HITL pattern remains.

The contract itself is not reversed. Auto-retrying terminal failures is a known anti-pattern and never worth revisiting.