Skip to content

Ascend Cloud-Native Platform v2 — Engineering Plan

Ascend Cloud-Native Platform v2 — Engineering Plan

Author: Engineering Leadership (Claude + Mishaal) Version: 1.0 (2026-04-24) Status: Proposed — awaiting sign-off Scope: The entire Ascend GTM stack — gateway, context plane, agent workflow, integrations, observability, security — brought to 2026 cutting-edge and made fully cloud-native. Non-goal: Incremental patching. This plan is a structural upgrade.

Executive summary

Ascend’s V5 stack is architecturally sound — the hot path is edge-native, integrations are unified through a single gateway, Phase 2 Context Plane just landed. But we are carrying stale invariants, hand-rolled patterns that 2026 primitives replace, and a laptop-dependent agent workflow. This plan closes those gaps in three waves over ~4 weeks, ending with:

  1. Agent workflow 100% cloud-native. Any surface (phone, 10-year-old laptop, library PC) reaches full engineering capability via claude.ai/code + Routines. No local process. No SSH. No Tailscale. No laptop dependency.
  2. V5 Gateway registered as a first-class remote MCP server. Streamable HTTP + OAuth 2.1 (per MCP spec 2025-11-25). Discoverable via the Anthropic registry at net.ascendgtm/gateway. Per-tenant isolation via Cloudflare MCP Server Portals.
  3. Observability, security, and reliability upgraded to 2026 edge primitives. AI Gateway in front of every LLM call. Workers Logs + Logpush for audit. Access + WebAuthn on admin endpoints. Workflows replacing DIY cron-for-multi-step. SQLite-backed Durable Objects for every new class.

Research receipts: CF platform audit · MCP ecosystem audit · Claude Code cloud architecture · initial cloud-only audit.


Part I — Current state (where we are)

What’s already cutting-edge (keep)

SystemStatusWhy it’s right
V5 Gateway on Workers + Hono 4Single binary, edge-native, <10 ms overhead invariant
KV-only hot pathSpec-correct — request latency bounded by KV reads
Durable Objects for OAuthAlarm-based proactive refresh; no vendor tax
D1 cold path for auditCorrect hot/cold separation
Phase 2 Context WorkerTwo-plane architecture per ADR-016
CF Cron for scheduled workFree-plan limit (5) respected; replaces n8n
R2 weekly KV backupsDisaster recovery on the correct storage
GitHub Actions CI (typecheck + test + drift check)Two jobs: gateway-worker + context-worker-typecheck

What’s drifted (repo docs vs reality)

Caught during this plan’s research pass:

FileStale claimRealityFix
.claude/CLAUDE.md §Architecture Invariants”ONE Worker only … No Service Bindings … No multi-Worker”Phase 2 shipped Service Binding to ascend-context-workerRewrite invariant #1 to reflect ADR-016’s scoped exception
.claude/CLAUDE.md §Architecture Invariants”18 MCP tools”28 tools post-Phase-2Rewrite invariant #7 with the current count + category breakdown
.claude/rules/v5-invariants.mdSame as above (duplicated)SameRewrite both files from one canonical source
Global ~/.claude/CLAUDE.mdReferences legacy n8n/DataTable/VPS architectureV5 is CF-nativeStrip the legacy block; the V5 project-level config already notes “IGNORE” but cleaner to remove the source

This is table stakes — not cutting-edge. It’s doc-hygiene that ships in Wave 1 alongside real architecture work.

What’s not cutting-edge (real gaps)

  1. Gateway speaks MCP over SSE (the 2024-11-05 spec). Current spec is 2025-11-25Streamable HTTP is the only supported transport going forward. SSE is formally deprecated.
  2. No OAuth 2.1 on the MCP surface. We use bearer tokens derived from ASCEND_TENANT_BEARER. The MCP spec mandates OAuth 2.1 + RFC 9728 Protected Resource Metadata + PKCE.
  3. Every LLM call goes direct to DeepSeek / Anthropic / Gemini / OpenRouter / Groq / Cerebras. No observability, no fallback chains, no cost caps, no semantic cache.
  4. DIY multi-step pipelines in Queues + DO alarms. Workers Workflows (GA 2025-04) is purpose-built for this. Gong/SFDC ingestion = textbook Workflows use case.
  5. Admin endpoints (/admin/*) behind a static API-key hash. Cloudflare Access + WebAuthn is the 2026 pattern.
  6. Workers Logs not configured. We persist everything to error_ledger D1 manually. CF now offers structured logs + Logpush to R2/S3/Datadog with 7+ day retention for free.
  7. KV-backed Durable Object for TokenManager. SQLite-backed DOs are GA and explicitly recommended for all new namespaces; the old KV-backed pattern is labeled “(Legacy)” in the CF sidebar.
  8. wrangler secret put for every secret. Cloudflare Secrets Store (open beta, 2026-04-16) gives per-secret rotation, version history, and account-scoped access. Not GA yet → land ADR now, migrate at GA.
  9. Agent workflow runs on a MacBook. Everything I just reviewed means we run on the user’s MacBook. This is the biggest gap and the main driver of this plan.
  10. Stdio MCPs for Cloudflare + GitHub + n8n. Every one of those has a registered hosted remote-MCP (or equivalent) as of 2026. Stdio = legacy.
  11. Playwright tunnel on localhost:8931. 2026 pattern is Workers Browser Run (Quick Actions + Stagehand + Playwright MCP) + Chrome MCP for agent-driven browsing.

Part II — Target architecture (where we’re going)

Two-plane, fully-registered, OAuth-guarded, observable from anywhere

┌───────────────────────────────────────────────────────────────────────┐
│ AGENT WORKFLOW (zero laptop dependency) │
│ │
│ Any browser / iPhone / tablet / 10-yo laptop │
│ │ │
│ ▼ │
│ claude.ai/code ── Anthropic-managed VMs ──┐ │
│ (Web + Routines + Dispatch + Channels) │ │
│ │ MCP (Streamable HTTP │
│ │ + OAuth 2.1) │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Cloudflare MCP Server Portal — one URL per tenant │ │
│ │ portal.ascendgtm.workers.dev/mcp/{tenant} │ │
│ │ (auth: CF Access SSO + WebAuthn) │ │
│ └───────────┬───────────────────────────────────────────────────┘ │
│ │ │
└──────────────┼────────────────────────────────────────────────────────┘
┌──────┴──────┐
▼ ▼
┌──────────────┐ ┌──────────────────────────────────────────────────┐
│ V5 Gateway │ │ Hosted third-party MCPs (registered, OAuth'd) │
│ (EXECUTION) │ │ • Cloudflare (bindings.mcp.cloudflare.com) │
│ │ │ • Linear, Notion, Atlassian, Stripe, Supabase │
│ All 34 tools │ │ • GitHub, Figma, Monday, HuggingFace │
│ Streamable │ │ • Slack (Anthropic official) │
│ HTTP /mcp │ │ • ... 216 registered commercial MCPs total │
│ OAuth 2.1 │ └──────────────────────────────────────────────────┘
└──┬───────────┘
│ Service Binding (RPC, Streamable HTTP)
┌──────────────────────────────────────────────────────────────────────┐
│ V5 Context Worker (CONTEXT PLANE) │
│ • D1: entities + facts + signal_evaluations │
│ • Vectorize: ctx_v5_facts (bge-small 384-dim cosine) │
│ • Workers Workflows: Gong/SFDC extraction pipelines │
│ • CF Queues: Gong + SFDC ingestion (producer only) │
│ • Tools: context_query, context_explain │
└──────────────────────────────────────────────────────────────────────┘
┌─── Observability & Security plane ─────────────────────────────────┐
│ │
│ Every outbound LLM call → Cloudflare AI Gateway │
│ (observability, caching, fallback) │
│ Every admin request → Cloudflare Access + WebAuthn │
│ Every Worker log → Workers Logs → Logpush → R2 (30d) │
│ Every tool invocation → Analytics Engine (weekly rollup) │
│ Every secret → Secrets Store (GA) with rotation │
└──────────────────────────────────────────────────────────────────────┘

Core invariants (new canonical set — supersede both .claude/CLAUDE.md and v5-invariants.md)

  1. Two-plane architecture. Execution (gateway) + Context (context-worker). A third plane is forbidden without an ADR. Service Bindings between these two planes only — no external service bindings.
  2. KV-only hot path on the gateway. Request latency budgeted at ≤10 ms overhead. D1 only in cold paths (error_ledger, kv_audit, decision_log). Context-worker D1 is cold path by definition — it’s the context worker’s own cold path.
  3. Fail-fast, no retries in the request path. Callers retry. Proxy does not.
  4. OAuth 2.1 + Streamable HTTP on every MCP surface — per spec 2025-11-25. No SSE transport.
  5. Composio owns OAuth end-to-end (revised by ADR-057, 2026-05-19). The V5 gateway no longer holds OAuth tokens for any SaaS provider covered by Composio. tokens:{tenant}:{provider}:{account_id} KV is retained only for providers Composio does not cover (AWS via aws4fetch signing keys, Anthropic API key, etc.). No DO-based token refresh. No external auth brokers beyond Composio.
  6. Request path never touches a DO. DO writes KV 10 min before expiry; request reads KV.
  7. Capability index, not static registration. (ADR-042, 2026-05-07) 3 always-on platform tools (call_api, discover_apis, batch_execute) registered statically. All other tools indexed in Vectorize capability_index and retrieved semantically (≤20 per LLM context). Catalog unbounded. Adding a tool: docs/tools/<slug>.md + TOOLS.md row + embed script re-run. No hard ceiling.
  8. Multi-account support is load-bearing. KV key: tokens:{tenant}:{provider}:{account_id}.
  9. CF Cron + CF Workflows for scheduled and multi-step work. No external cron services. No n8n for orchestration.
  10. Gateway overhead ≤10 ms. auth + token + route + AI Gateway callback included.
  11. 30 s AbortController timeout on every outbound fetch.
  12. Every LLM call goes through AI Gateway — observability + fallback chain + budget cap.
  13. Every admin endpoint gated by Cloudflare Access. Static API key never sufficient alone.
  14. Secrets live in Secrets Store when GA; wrangler secrets acceptable during open beta with documented migration trigger.
  15. Sources of truth: KV (config), D1 (audit), Vectorize (facts-embedded), R2 (backups), GitHub (code). No other source-of-truth systems without an ADR.

Part III — Adoption roadmap

Wave 1 — Agent workflow cloud-native + drift cleanup (Week 1)

Goal: any agent work can happen from any device. Ascend’s MacBook becomes a thin client.

#TaskOwnerSizeCloud destination
1.1Move Non-Stop Protocol + global rules into repo .claude/agent45 minRepo
1.2Copy hooks + relevant skills + rules into repoagent30 minRepo
1.3Rewrite invariants to canonical set above. Update .claude/CLAUDE.md + .claude/rules/v5-invariants.md + ~/CLAUDE.md (strip legacy n8n block)agent45 minRepo
1.4Author .mcp.json at repo root declaring remote MCPs — gateway (to be OAuth’d in Wave 2), hindsight, Cloudflare hosted, GitHub hostedagent30 minRepo
1.5Write cloud environment setup script — installs wrangler + gh + npm global deps — commit as scripts/setup-claude-cloud-env.shagent20 minRepo
1.6Write docs/cloud-env-seed.md — list of exact env-var pairs to paste once into claude.ai/codeagent15 minRepo
1.7User: install Claude GitHub App, paste env vars into cloud environment, enable Code on the WebMishaal15 minclaude.ai/code
1.8Convert 4 local scheduled tasks → Anthropic Routines (token-health, error-pattern, docs-freshness, backup-verify)agent1 hrAnthropic
1.9Archive 9 local worktrees (tag then delete)agent10 min
1.10First cloud-native validation — close laptop, fire a claude --remote task, verify PR opens on GitHub without laptopboth30 min

Wave 1 exit criterion: Mishaal powers the MacBook off for 24 hours. Upon reopening, at least one Routine has run, at least one --remote task has completed, all work visible on GitHub + claude.ai/code.


Wave 2 — MCP + Gateway cutting-edge (Week 2)

Goal: the V5 gateway speaks the 2026 MCP spec, authenticates with OAuth 2.1, is discoverable via the Anthropic registry, and lives behind a per-tenant MCP Server Portal.

#TaskSizePrimitive
2.1Upgrade gateway /mcp from SSE → Streamable HTTP transport. Use McpAgent + OAuthProvider from agents SDK (already a dep — agents@0.9.0)1 dayStreamable HTTP MCP
2.2Implement OAuth 2.1 authorization server (Streamable HTTP + PKCE + RFC 8707 Resource Indicators + Dynamic Client Registration) using @cloudflare/workers-oauth-provider2 daysCF OAuth Provider
2.3D1 table for OAuth client registrations + KV for session state0.5 dayD1 + KV
2.4Implement the MCP elicitation/create spec — URL-mode to off-ramp tenant 3rd-party OAuth onboarding (HubSpot/Salesforce/Google). Agent says “I need Gmail access” → elicitation URL → tenant consent → token stored via existing DO1 dayMCP Elicitation
2.5Migrate in-stash stdio MCPs → hosted registered MCPs. .mcp.json points to bindings.mcp.cloudflare.com, GitHub official, etc. Remove stdio entries from all configs.2 hrsAnthropic MCP Registry
2.6Cloudflare MCP Server Portal — one URL per tenant (portal.ascendgtm.workers.dev/mcp/{tenant}). Gates on CF Access SSO + WebAuthn. Forwards to gateway /mcp with tenant context pre-derived.4 hrsCF MCP Portals
2.7Register V5 gateway as net.ascendgtm/gateway with visibility: private in Anthropic MCP Registry30 minAnthropic MCP Registry
2.8Add worksWith: [claude-code, claude-api, claude-desktop] metadata; ship server card at .well-known/mcp-server-card1 hrMCP Server Cards (roadmap 2026-Q3)
2.9Audit every MCP tool for OAuth scope correctness. Tool-level scopes → DCR → never request more than needed0.5 dayOAuth 2.1 scope discipline
2.10Ship telemetry: every MCP call writes {tenant, tool, auth_method, duration_ms, success, error_code} to Analytics Engine2 hrsAnalytics Engine

Wave 2 exit criterion: Claude Code web session connects to portal.ascendgtm.workers.dev/mcp/ascend via OAuth 2.1 (browser consent flow, no copy-pasted tokens), successfully invokes all 34 tools. Registry lookup api.anthropic.com/mcp-registry/.../net.ascendgtm/gateway returns the registered entry. Stdio MCPs removed from every config file — grep returns zero hits.


Wave 3 — Edge primitives + observability (Weeks 3–4)

Goal: replace hand-rolled patterns with CF 2026 primitives. Observability, security, reliability all go up a class.

#TaskSizeCF primitive
3.1Workers Logs enabled on gateway + context-worker. Structured console.log with trace IDs. Logpush job → R2 bucket ascend-logs with 30-day retention2 hrsWorkers Logs + Logpush
3.2Cloudflare AI Gateway in front of every LLM call. Replace direct calls in llm_invoke, claude, perplexity, aws_bedrock_invoke. One gateway per provider (DeepSeek/Anthropic/Gemini/OpenRouter/Groq/Cerebras). Enable semantic cache + fallback + cost caps.1–2 daysAI Gateway
3.3Cloudflare Access in front of /admin/*. WebAuthn hardware-key enrollment for Mishaal. Maintain static API key fallback for Routine access via service token.3 hrsCF Access
3.4Workers Workflows for Gong + SFDC ingestion (Phase 2 Tasks 2.5/2.6). Replaces Queue + DO-alarm-orchestrated multi-step with declarative workflow.step().1 day (then 0.5 per additional pipeline)Workers Workflows
3.5Migrate TokenManager DO + any new DO class to SQLite-backed storage. SQLite DOs are recommended for all new namespaces; KV-backed is legacy.1 daySQLite DO
3.6Workers Browser Run integration: replace any scraping/screenshot need with the Quick Actions endpoints (/screenshot, /pdf, /json, /crawl). Deprecate local Playwright tunnel.0.5 dayBrowser Run
3.7Analytics Engine dashboards for 5 key metrics: gateway P95 latency, tool invocation rate, OAuth refresh rate, AI Gateway spend, error rate. Shareable URLs.0.5 dayAnalytics Engine
3.8ADR for Secrets Store adoption — open beta today. Document migration trigger: “within 30 days of GA announcement, run migration script.” Until then: wrangler secrets with doc’d rotation procedure.1 hrSecrets Store
3.9Runbook for /incident-response — Access → Workers Logs → AI Gateway dashboard → /admin/errors1 hrProcess

Wave 3 exit criterion: AI Gateway shows traffic from every LLM tool. /admin/errors reachable only via CF Access login. Workers Logs query retrieves a full request trace across gateway + context-worker in <5 s. Gong ingestion Workflow has 3 successful runs in production without manual intervention.


Wave 4 — Polish + hardening (Week 4, optional)

Not blocking. Fire these once Waves 1–3 are stable.

#TaskWhy
4.1GitHub OIDC → Cloudflare for tokenless deploys from CIEliminates the CLOUDFLARE_API_TOKEN secret entirely
4.2Environment separation (resolves tech-debt row #16) — prod vs dev KV namespaces, wrangler --env consistently appliedPrevents a dev write from hitting prod KV
4.3Property-based tests on auth layer (resolves tech-debt row #17)Fuzz coverage on token validation
4.4Retire Tailscale + decommissioned-VPS references from all docsDoc hygiene
4.5Telegram + Slack → Channels (Claude Code Channels feature) as alternative to API triggersAdditional surface for mobile command
4.6Ultraplan + Ultrareview workflow adoption for multi-session featuresHigher-quality planning layer

Part IV — Trade-offs + risks

Trade-offs

DecisionGainCost
Streamable HTTP + OAuth 2.1 on MCPSpec-correct; clients discover + connect without copy-pasted tokens3–4 days of gateway work; breaks any existing caller relying on bearer token auth until they migrate
MCP Server Portal per tenantPer-tenant isolation at the network layer; SSO gateOne more Worker to deploy + maintain
AI Gateway in front of every LLMObservability, caching, cost caps, fallbackSmall per-request hop through another Worker (<2 ms); new cost line-item (AI Gateway is free-tier friendly but adds a line)
SQLite DOsCheaper, faster, better transaction semantics than KV-backedOne-time migration: export → import per class
Workflows for ingestionDurable multi-step with retries + replayNew primitive to learn; monitoring dashboard to build
Cloud-only agent workflowZero laptop dependency, persistent sessions, iOS appEnv var pasting is manual; no dedicated secrets store yet (visible to env editors)

Risks + mitigations

RiskLikelihoodImpactMitigation
OAuth 2.1 rollout breaks existing n8n workflows calling gatewayMediumn8n automations go dark until migratedPhase: keep bearer-token auth active in parallel for 30 days; cut-over after all 83 n8n workflows updated
Routine daily cap hitLow initially, medium at scaleScheduled jobs skipEnable Extra Usage billing; monitor cap via claude.ai/settings/usage
Cloud env secrets leaked to someone with env-edit accessLowCredential exposureUse scoped short-lived tokens; rotate quarterly; migrate to Secrets Store at GA
AI Gateway fallback misconfigured → wrong model answersMediumQuality dropTest fallback chains with synthetic bad responses before enabling
Workflows + Queues dual-ownership for Phase 2 ingestionLowConfusion about which is source of truthQueues become producer-only to Workflows; all business logic in Workflows
Claude Code web VM cap hit during overnight (currently ~unlimited for Pro)LowOvernight run failsMonitor; fall back to Agent SDK self-hosted on a CF Worker for truly constrained cases
Stale invariants doc leaks old assumptions into new PRsHigh if not fixedArchitectural drift compoundsWave 1 includes invariant rewrite + .claude/rules/ canonicalization

Part V — Success criteria

Must all be true at end of Wave 3:

  • MacBook off for 24 h → at least one Routine completed, at least one --remote task completed, PRs visible on GitHub.
  • Claude Code web session authenticates to V5 Gateway MCP via OAuth 2.1 (no pasted tokens).
  • net.ascendgtm/gateway returned from api.anthropic.com/mcp-registry/.../net.ascendgtm/gateway.
  • grep -rn "stdio" .claude/ .mcp.json → zero matches.
  • Every LLM call visible in AI Gateway dashboard.
  • /admin/errors returns 401 without CF Access auth.
  • Logpush writing to R2 ascend-logs bucket; last 30 days of gateway logs queryable.
  • Gong or SFDC ingestion Workflow has ≥3 successful runs.
  • .claude/CLAUDE.md + .claude/rules/v5-invariants.md reflect the canonical 15-invariant set.
  • 34 MCP tools still registered, all declared in TOOLS.md.
  • 567/567 gateway tests + 76/76 context-worker tests passing (+ new OAuth + Workflows tests).
  • /admin/health reports all bindings green including AI Gateway.

Part VI — Operating model after migration

How work happens day-to-day

  1. Morning: iOS app shows any Routine-generated PRs from overnight. Mishaal reviews on phone, merges or comments.
  2. Deep work block: Mishaal opens claude.ai/code from any browser. Fires claude --remote "do X". Closes browser. Goes to meeting.
  3. On-call: Sentry alert → Routine API trigger → Claude session investigates + drafts PR → Slack ping → on-call reviews via CF Access SSO.
  4. Weekly: Routine-driven weekly digest summarizes tech-debt, open PRs, LEDGER drift. No manual report.
  5. New integration request: Mishaal types “add Klaviyo” → cloud session reads spec, scaffolds provider, writes tests, opens PR. Elicitation URL triggers if OAuth consent needed.

How we measure health

  • AI Gateway dashboard — weekly review: fallback rate, cache hit rate, cost per tool per tenant.
  • Analytics Engine — weekly rollup: tool usage histogram (the ADR-023 decision was “keep all 25 tools”; this telemetry is what eventually retires anything genuinely zero-use).
  • Logpush + R2 — monthly random-sample audit: 10 requests traced end-to-end, confirm every step has logs.
  • Routine dashboard — daily: every scheduled Routine has a successful run in the last 24 h.
  • LEDGER.md — weekly: zero rows with “last touched >7 days” and no PR.

Operating invariants

  • Plan-first PR rule stays. Multi-session project = first commit is plan-doc + LEDGER row on main.
  • Non-Stop Execution Protocol stays. Now lives in repo .claude/CLAUDE.md; cloud sessions load it automatically.
  • Research-first mandate stays. Every new API / version / limit → live docs fetch before writing a value into code.
  • Parallelization rule stays. Independent tool calls batch in one message. No sequential ladders.

Part VII — What gets deleted

Migration generates noise. To keep the system elegant, delete these after Waves 1–3 ship:

  • ~/.claude/CLAUDE.md legacy n8n section (strip, keep identity + rules)
  • Global references to Tailscale, VPS, Mac bridge (all decommissioned)
  • ~/.claude/scheduled-tasks/ (replaced by Routines)
  • 9 local worktrees under .claude/worktrees/ (replaced by cloud sessions)
  • .claude/mcp.json + stdio claude mcp add history
  • mcp-server/ directory in the old monorepo (no longer used — CF Worker replaces)
  • decommission-plan-vps.md ceremony files if present

Each deletion gets a commit in Wave 4 cleanup.


Part VIII — Wave 4 — Hosted-OSS-first inference (cost discipline, zero hardware)

Goal: route default internal workloads through open-source models hosted on managed cloud (Cloudflare Workers AI, DeepSeek direct API, OpenRouter) instead of premium frontier APIs. Frontier tier stays for novel / customer-facing work. Zero new hardware. Added 2026-04-24 after the DeepSeek V4 launch (4/23) + Qwen3.6 / Workers AI catalog review showed ~90% cost reduction opportunity on bulk workloads.

Invariant update

The repo .claude/CLAUDE.md Non-Stop Protocol already encodes “No hardware dependencies — everything cloud.” Wave 4 operationalizes that for the inference layer specifically: every LLM and embedding call must land on either Cloudflare’s managed inference (Workers AI), a serverless OSS endpoint (DeepSeek API, OpenRouter), or a frontier provider when quality demands it — never on a laptop or self-hosted box.

Cost comparison (live-doc-verified 2026-04-24)

ModelHosted at$/1M input$/1M outputTier
Qwen3-30B-A3B-FP8Workers AI (@cf/qwen/qwen3-30b-a3b-fp8)$0.051$0.34bulk (default)
Llama-3.3-70B-FP8-FastWorkers AI$0.29$0.56bulk (high-quality)
Qwen2.5-Coder-32BWorkers AI$0.16$0.48bulk (code)
Kimi-K2.6 / GLM-4.7-FlashWorkers AIvariesvariesbulk (Chinese OSS tier)
DeepSeek V4-FlashDeepSeek direct API$0.14$0.28standard (1M ctx, MIT)
DeepSeek V4-ProDeepSeek direct API$1.74$3.48standard (heavy reasoning)
GPT-5.5OpenAI via AI Gateway$5.00$30.00frontier
Opus 4.7Anthropic via AI Gateway$15.00$75.00frontier

Phase A (shipped in Wave 4 PR) — tier-aware llm_invoke

  • New workers_ai provider uses env.AI.run() binding (zero egress, <10 ms).
  • New deepseek-v4-flash + deepseek-v4-pro model IDs; V3 deepseek-chat/deepseek-reasoner auto-aliased to V4 (deprecate 2026-07-24 per DeepSeek docs).
  • tier param: bulk | standard | frontier. Default = bulk.
  • Routes through CF AI Gateway ascend-workers-ai when CF_AI_GATEWAY_WORKERS_AI_SLUG is set.
  • ADR-027 documents decision + cost projections.

Phase B (shipped in Wave 4 PR) — context-plane upgrade + reranking

  • Vectorize index migrated from bge-small-en-v1.5 (384 dim) → bge-m3 (1024 dim, multilingual).
  • @cf/baai/bge-reranker-base reranks Vectorize top-50 → top-10 before D1 hydration in context_query.
  • All Workers AI calls route through AI Gateway for unified observability + caching.
  • ADR-028 (bge-m3 migration) + ADR-029 (LoRA adapter roadmap for future tenant-specific tuning).

Invariant #12 update

Invariant #12 “Every LLM call goes through AI Gateway” now includes Workers AI calls — not just external provider calls. The AI Gateway wraps both via the gateway: option on env.AI.run().

Cost projection

At Phase 2 Gong-extraction volumes (~500 transcripts/day once Kahuna is in production):

  • Before Wave 4 (DeepSeek V3): ~$6,300/month
  • After Wave 4 (Workers AI Qwen3-30B bulk tier): ~$450/month

Monthly savings: ~$5,850. Hardware-free. Scales with tenant count, not headcount.

Success criteria (added to Part V)

  • llm_invoke default tier = bulk → workers_ai qwen3-30b
  • DeepSeek V4 replaces V3 as standard tier default
  • Vectorize index at 1024 dim; bge-m3 embeddings live
  • Reranker reduces “wrong-semantic-match” false-positive rate measurably (A/B on 100 Kahuna queries)
  • AI Gateway ascend-workers-ai dashboard shows traffic from both gateway + context-worker
  • TOOL_METRICS dataset aggregates show >80% of llm_invoke calls land on workers_ai tier
  • Monthly LLM spend drops to <20% of pre-Wave-4 baseline within 30 days of Phase 2 GA

Phase A shipped receipt (2026-04-24)

Status: Shipped 2026-04-24 via PR “Wave 4 Phase A — Workers AI + DeepSeek V4 + tier-aware routing (hosted OSS first)” ADR: ADR-027 — hosted-OSS-first routing

Tier routing table (config-driven, zod-schema’d)

TierProviderModel$/1M (in/out)Call path
bulk (default)workers_aiqwen3-30b (@cf/qwen/qwen3-30b-a3b-fp8)$0.051 / $0.34env.AI.run() binding — zero egress
standarddeepseekdeepseek-v4-flash$0.14 / $0.28 (cache-miss) · $0.028 cached inputHTTPS
frontiercaller-setcaller-setfrontier-pricedHTTPS

Tasks shipped

#TaskFiles touched
A.1Add [ai] binding = "AI" to wrangler.toml (prod + staging)wrangler.toml
A.2Extend Env type with AI binding, CF_AI_GATEWAY_WORKERS_AI_SLUG, explicit DEEPSEEK_API_KEYsrc/lib/types.ts
A.3New helper in src/lib/ai-gateway.ts — builds the {gateway:{id,metadata}} third-arg for env.AI.run()src/lib/ai-gateway.ts
A.4Extend src/tools/llm-invoke.tsworkers_ai provider, tier router, WORKERS_AI_MODELS catalog, DeepSeek V4 aliases, binding call path, response translatorsrc/tools/llm-invoke.ts
A.5Update TOOLS.md llm-invoke row to reflect tier routingdocs/requirements/TOOLS.md
A.6ADR-027 authored with cost math, live URLs, alternativesdocs/decisions/ADR-027-*.md
A.721 new tests (router, aliases, binding call, AI Gateway option, cost math, model catalog)test/tools/llm-invoke-tier-routing.test.ts

Phase A exit criteria (all met)

  • npm run typecheck — clean
  • npm test — 588→609 green, zero regressions
  • npm run check:pre-commit — 11/11
  • wrangler deploy --dry-run — bundle within soft target (1102.72 KiB, +10.72 KiB)
  • ADR-027 merged with live-docs citations for every pricing claim
  • TOOLS.md row rewritten

Research receipts (all verified 2026-04-24)

What’s NOT in Phase A (and why)

  • Cost-cap enforcement inside the gateway. Deferred to AI Gateway dashboard config (operator sets daily spend cap per gateway). Phase B of Wave 4 will pull the cap into KV for per-tenant differentiation.
  • tool-scopes.ts integration. The scope-map file doesn’t exist yet (slated for Wave 2 of Cloud-Native v2). llm_invoke stays mcp:read by convention; when scopes land, this tool gets one row, no behavior change.
  • Frontier-tier alias table. frontier is explicitly caller-driven to avoid accidental Opus spend. A future ADR may add explicit frontier-sonnet / frontier-opus aliases once usage is clearly understood.

Sign-off

This plan reflects 2026 edge-native engineering best practice for a multi-tenant GTM automation platform. Every recommendation cites live docs; every trade-off is named; every risk has a mitigation.

Wave 1 starts with your approval. Waves 2 + 3 proceed under the Non-Stop Protocol once Wave 1 is green. Wave 4 Phase A shipped 2026-04-24 ahead of Waves 2+3 because its blast radius is contained to llm_invoke and the cost win is immediate.

References