Ascend Cloud-Native Platform v2 — Engineering Plan
Ascend Cloud-Native Platform v2 — Engineering Plan
Author: Engineering Leadership (Claude + Mishaal) Version: 1.0 (2026-04-24) Status: Proposed — awaiting sign-off Scope: The entire Ascend GTM stack — gateway, context plane, agent workflow, integrations, observability, security — brought to 2026 cutting-edge and made fully cloud-native. Non-goal: Incremental patching. This plan is a structural upgrade.
Executive summary
Ascend’s V5 stack is architecturally sound — the hot path is edge-native, integrations are unified through a single gateway, Phase 2 Context Plane just landed. But we are carrying stale invariants, hand-rolled patterns that 2026 primitives replace, and a laptop-dependent agent workflow. This plan closes those gaps in three waves over ~4 weeks, ending with:
- Agent workflow 100% cloud-native. Any surface (phone, 10-year-old laptop, library PC) reaches full engineering capability via claude.ai/code + Routines. No local process. No SSH. No Tailscale. No laptop dependency.
- V5 Gateway registered as a first-class remote MCP server. Streamable HTTP + OAuth 2.1 (per MCP spec
2025-11-25). Discoverable via the Anthropic registry atnet.ascendgtm/gateway. Per-tenant isolation via Cloudflare MCP Server Portals. - Observability, security, and reliability upgraded to 2026 edge primitives. AI Gateway in front of every LLM call. Workers Logs + Logpush for audit. Access + WebAuthn on admin endpoints. Workflows replacing DIY cron-for-multi-step. SQLite-backed Durable Objects for every new class.
Research receipts: CF platform audit · MCP ecosystem audit · Claude Code cloud architecture · initial cloud-only audit.
Part I — Current state (where we are)
What’s already cutting-edge (keep)
| System | Status | Why it’s right |
|---|---|---|
| V5 Gateway on Workers + Hono 4 | ✅ | Single binary, edge-native, <10 ms overhead invariant |
| KV-only hot path | ✅ | Spec-correct — request latency bounded by KV reads |
| Durable Objects for OAuth | ✅ | Alarm-based proactive refresh; no vendor tax |
| D1 cold path for audit | ✅ | Correct hot/cold separation |
| Phase 2 Context Worker | ✅ | Two-plane architecture per ADR-016 |
| CF Cron for scheduled work | ✅ | Free-plan limit (5) respected; replaces n8n |
| R2 weekly KV backups | ✅ | Disaster recovery on the correct storage |
| GitHub Actions CI (typecheck + test + drift check) | ✅ | Two jobs: gateway-worker + context-worker-typecheck |
What’s drifted (repo docs vs reality)
Caught during this plan’s research pass:
| File | Stale claim | Reality | Fix |
|---|---|---|---|
.claude/CLAUDE.md §Architecture Invariants | ”ONE Worker only … No Service Bindings … No multi-Worker” | Phase 2 shipped Service Binding to ascend-context-worker | Rewrite invariant #1 to reflect ADR-016’s scoped exception |
.claude/CLAUDE.md §Architecture Invariants | ”18 MCP tools” | 28 tools post-Phase-2 | Rewrite invariant #7 with the current count + category breakdown |
.claude/rules/v5-invariants.md | Same as above (duplicated) | Same | Rewrite both files from one canonical source |
Global ~/.claude/CLAUDE.md | References legacy n8n/DataTable/VPS architecture | V5 is CF-native | Strip the legacy block; the V5 project-level config already notes “IGNORE” but cleaner to remove the source |
This is table stakes — not cutting-edge. It’s doc-hygiene that ships in Wave 1 alongside real architecture work.
What’s not cutting-edge (real gaps)
- Gateway speaks MCP over SSE (the
2024-11-05spec). Current spec is2025-11-25→ Streamable HTTP is the only supported transport going forward. SSE is formally deprecated. - No OAuth 2.1 on the MCP surface. We use bearer tokens derived from
ASCEND_TENANT_BEARER. The MCP spec mandates OAuth 2.1 + RFC 9728 Protected Resource Metadata + PKCE. - Every LLM call goes direct to DeepSeek / Anthropic / Gemini / OpenRouter / Groq / Cerebras. No observability, no fallback chains, no cost caps, no semantic cache.
- DIY multi-step pipelines in Queues + DO alarms. Workers Workflows (GA 2025-04) is purpose-built for this. Gong/SFDC ingestion = textbook Workflows use case.
- Admin endpoints (
/admin/*) behind a static API-key hash. Cloudflare Access + WebAuthn is the 2026 pattern. - Workers Logs not configured. We persist everything to
error_ledgerD1 manually. CF now offers structured logs + Logpush to R2/S3/Datadog with 7+ day retention for free. - KV-backed Durable Object for
TokenManager. SQLite-backed DOs are GA and explicitly recommended for all new namespaces; the old KV-backed pattern is labeled “(Legacy)” in the CF sidebar. wrangler secret putfor every secret. Cloudflare Secrets Store (open beta, 2026-04-16) gives per-secret rotation, version history, and account-scoped access. Not GA yet → land ADR now, migrate at GA.- Agent workflow runs on a MacBook. Everything I just reviewed means we run on the user’s MacBook. This is the biggest gap and the main driver of this plan.
- Stdio MCPs for Cloudflare + GitHub + n8n. Every one of those has a registered hosted remote-MCP (or equivalent) as of 2026. Stdio = legacy.
- Playwright tunnel on
localhost:8931. 2026 pattern is Workers Browser Run (Quick Actions + Stagehand + Playwright MCP) + Chrome MCP for agent-driven browsing.
Part II — Target architecture (where we’re going)
Two-plane, fully-registered, OAuth-guarded, observable from anywhere
┌───────────────────────────────────────────────────────────────────────┐│ AGENT WORKFLOW (zero laptop dependency) ││ ││ Any browser / iPhone / tablet / 10-yo laptop ││ │ ││ ▼ ││ claude.ai/code ── Anthropic-managed VMs ──┐ ││ (Web + Routines + Dispatch + Channels) │ ││ │ MCP (Streamable HTTP ││ │ + OAuth 2.1) ││ ▼ ││ ┌───────────────────────────────────────────────────────────────┐ ││ │ Cloudflare MCP Server Portal — one URL per tenant │ ││ │ portal.ascendgtm.workers.dev/mcp/{tenant} │ ││ │ (auth: CF Access SSO + WebAuthn) │ ││ └───────────┬───────────────────────────────────────────────────┘ ││ │ │└──────────────┼────────────────────────────────────────────────────────┘ │ ┌──────┴──────┐ ▼ ▼┌──────────────┐ ┌──────────────────────────────────────────────────┐│ V5 Gateway │ │ Hosted third-party MCPs (registered, OAuth'd) ││ (EXECUTION) │ │ • Cloudflare (bindings.mcp.cloudflare.com) ││ │ │ • Linear, Notion, Atlassian, Stripe, Supabase ││ All 34 tools │ │ • GitHub, Figma, Monday, HuggingFace ││ Streamable │ │ • Slack (Anthropic official) ││ HTTP /mcp │ │ • ... 216 registered commercial MCPs total ││ OAuth 2.1 │ └──────────────────────────────────────────────────┘└──┬───────────┘ │ Service Binding (RPC, Streamable HTTP) ▼┌──────────────────────────────────────────────────────────────────────┐│ V5 Context Worker (CONTEXT PLANE) ││ • D1: entities + facts + signal_evaluations ││ • Vectorize: ctx_v5_facts (bge-small 384-dim cosine) ││ • Workers Workflows: Gong/SFDC extraction pipelines ││ • CF Queues: Gong + SFDC ingestion (producer only) ││ • Tools: context_query, context_explain │└──────────────────────────────────────────────────────────────────────┘
┌─── Observability & Security plane ─────────────────────────────────┐│ ││ Every outbound LLM call → Cloudflare AI Gateway ││ (observability, caching, fallback) ││ Every admin request → Cloudflare Access + WebAuthn ││ Every Worker log → Workers Logs → Logpush → R2 (30d) ││ Every tool invocation → Analytics Engine (weekly rollup) ││ Every secret → Secrets Store (GA) with rotation │└──────────────────────────────────────────────────────────────────────┘Core invariants (new canonical set — supersede both .claude/CLAUDE.md and v5-invariants.md)
- Two-plane architecture. Execution (gateway) + Context (context-worker). A third plane is forbidden without an ADR. Service Bindings between these two planes only — no external service bindings.
- KV-only hot path on the gateway. Request latency budgeted at ≤10 ms overhead. D1 only in cold paths (
error_ledger,kv_audit,decision_log). Context-worker D1 is cold path by definition — it’s the context worker’s own cold path. - Fail-fast, no retries in the request path. Callers retry. Proxy does not.
- OAuth 2.1 + Streamable HTTP on every MCP surface — per spec
2025-11-25. No SSE transport. - Composio owns OAuth end-to-end (revised by ADR-057, 2026-05-19). The V5 gateway no longer holds OAuth tokens for any SaaS provider covered by Composio.
tokens:{tenant}:{provider}:{account_id}KV is retained only for providers Composio does not cover (AWS via aws4fetch signing keys, Anthropic API key, etc.). No DO-based token refresh. No external auth brokers beyond Composio. - Request path never touches a DO. DO writes KV 10 min before expiry; request reads KV.
- Capability index, not static registration. (ADR-042, 2026-05-07) 3 always-on platform tools (call_api, discover_apis, batch_execute) registered statically. All other tools indexed in Vectorize
capability_indexand retrieved semantically (≤20 per LLM context). Catalog unbounded. Adding a tool:docs/tools/<slug>.md+ TOOLS.md row + embed script re-run. No hard ceiling. - Multi-account support is load-bearing. KV key:
tokens:{tenant}:{provider}:{account_id}. - CF Cron + CF Workflows for scheduled and multi-step work. No external cron services. No n8n for orchestration.
- Gateway overhead ≤10 ms. auth + token + route + AI Gateway callback included.
- 30 s AbortController timeout on every outbound fetch.
- Every LLM call goes through AI Gateway — observability + fallback chain + budget cap.
- Every admin endpoint gated by Cloudflare Access. Static API key never sufficient alone.
- Secrets live in Secrets Store when GA; wrangler secrets acceptable during open beta with documented migration trigger.
- Sources of truth: KV (config), D1 (audit), Vectorize (facts-embedded), R2 (backups), GitHub (code). No other source-of-truth systems without an ADR.
Part III — Adoption roadmap
Wave 1 — Agent workflow cloud-native + drift cleanup (Week 1)
Goal: any agent work can happen from any device. Ascend’s MacBook becomes a thin client.
| # | Task | Owner | Size | Cloud destination |
|---|---|---|---|---|
| 1.1 | Move Non-Stop Protocol + global rules into repo .claude/ | agent | 45 min | Repo |
| 1.2 | Copy hooks + relevant skills + rules into repo | agent | 30 min | Repo |
| 1.3 | Rewrite invariants to canonical set above. Update .claude/CLAUDE.md + .claude/rules/v5-invariants.md + ~/CLAUDE.md (strip legacy n8n block) | agent | 45 min | Repo |
| 1.4 | Author .mcp.json at repo root declaring remote MCPs — gateway (to be OAuth’d in Wave 2), hindsight, Cloudflare hosted, GitHub hosted | agent | 30 min | Repo |
| 1.5 | Write cloud environment setup script — installs wrangler + gh + npm global deps — commit as scripts/setup-claude-cloud-env.sh | agent | 20 min | Repo |
| 1.6 | Write docs/cloud-env-seed.md — list of exact env-var pairs to paste once into claude.ai/code | agent | 15 min | Repo |
| 1.7 | User: install Claude GitHub App, paste env vars into cloud environment, enable Code on the Web | Mishaal | 15 min | claude.ai/code |
| 1.8 | Convert 4 local scheduled tasks → Anthropic Routines (token-health, error-pattern, docs-freshness, backup-verify) | agent | 1 hr | Anthropic |
| 1.9 | Archive 9 local worktrees (tag then delete) | agent | 10 min | — |
| 1.10 | First cloud-native validation — close laptop, fire a claude --remote task, verify PR opens on GitHub without laptop | both | 30 min | — |
Wave 1 exit criterion: Mishaal powers the MacBook off for 24 hours. Upon reopening, at least one Routine has run, at least one --remote task has completed, all work visible on GitHub + claude.ai/code.
Wave 2 — MCP + Gateway cutting-edge (Week 2)
Goal: the V5 gateway speaks the 2026 MCP spec, authenticates with OAuth 2.1, is discoverable via the Anthropic registry, and lives behind a per-tenant MCP Server Portal.
| # | Task | Size | Primitive |
|---|---|---|---|
| 2.1 | Upgrade gateway /mcp from SSE → Streamable HTTP transport. Use McpAgent + OAuthProvider from agents SDK (already a dep — agents@0.9.0) | 1 day | Streamable HTTP MCP |
| 2.2 | Implement OAuth 2.1 authorization server (Streamable HTTP + PKCE + RFC 8707 Resource Indicators + Dynamic Client Registration) using @cloudflare/workers-oauth-provider | 2 days | CF OAuth Provider |
| 2.3 | D1 table for OAuth client registrations + KV for session state | 0.5 day | D1 + KV |
| 2.4 | Implement the MCP elicitation/create spec — URL-mode to off-ramp tenant 3rd-party OAuth onboarding (HubSpot/Salesforce/Google). Agent says “I need Gmail access” → elicitation URL → tenant consent → token stored via existing DO | 1 day | MCP Elicitation |
| 2.5 | Migrate in-stash stdio MCPs → hosted registered MCPs. .mcp.json points to bindings.mcp.cloudflare.com, GitHub official, etc. Remove stdio entries from all configs. | 2 hrs | Anthropic MCP Registry |
| 2.6 | Cloudflare MCP Server Portal — one URL per tenant (portal.ascendgtm.workers.dev/mcp/{tenant}). Gates on CF Access SSO + WebAuthn. Forwards to gateway /mcp with tenant context pre-derived. | 4 hrs | CF MCP Portals |
| 2.7 | Register V5 gateway as net.ascendgtm/gateway with visibility: private in Anthropic MCP Registry | 30 min | Anthropic MCP Registry |
| 2.8 | Add worksWith: [claude-code, claude-api, claude-desktop] metadata; ship server card at .well-known/mcp-server-card | 1 hr | MCP Server Cards (roadmap 2026-Q3) |
| 2.9 | Audit every MCP tool for OAuth scope correctness. Tool-level scopes → DCR → never request more than needed | 0.5 day | OAuth 2.1 scope discipline |
| 2.10 | Ship telemetry: every MCP call writes {tenant, tool, auth_method, duration_ms, success, error_code} to Analytics Engine | 2 hrs | Analytics Engine |
Wave 2 exit criterion: Claude Code web session connects to portal.ascendgtm.workers.dev/mcp/ascend via OAuth 2.1 (browser consent flow, no copy-pasted tokens), successfully invokes all 34 tools. Registry lookup api.anthropic.com/mcp-registry/.../net.ascendgtm/gateway returns the registered entry. Stdio MCPs removed from every config file — grep returns zero hits.
Wave 3 — Edge primitives + observability (Weeks 3–4)
Goal: replace hand-rolled patterns with CF 2026 primitives. Observability, security, reliability all go up a class.
| # | Task | Size | CF primitive |
|---|---|---|---|
| 3.1 | Workers Logs enabled on gateway + context-worker. Structured console.log with trace IDs. Logpush job → R2 bucket ascend-logs with 30-day retention | 2 hrs | Workers Logs + Logpush |
| 3.2 | Cloudflare AI Gateway in front of every LLM call. Replace direct calls in llm_invoke, claude, perplexity, aws_bedrock_invoke. One gateway per provider (DeepSeek/Anthropic/Gemini/OpenRouter/Groq/Cerebras). Enable semantic cache + fallback + cost caps. | 1–2 days | AI Gateway |
| 3.3 | Cloudflare Access in front of /admin/*. WebAuthn hardware-key enrollment for Mishaal. Maintain static API key fallback for Routine access via service token. | 3 hrs | CF Access |
| 3.4 | Workers Workflows for Gong + SFDC ingestion (Phase 2 Tasks 2.5/2.6). Replaces Queue + DO-alarm-orchestrated multi-step with declarative workflow.step(). | 1 day (then 0.5 per additional pipeline) | Workers Workflows |
| 3.5 | Migrate TokenManager DO + any new DO class to SQLite-backed storage. SQLite DOs are recommended for all new namespaces; KV-backed is legacy. | 1 day | SQLite DO |
| 3.6 | Workers Browser Run integration: replace any scraping/screenshot need with the Quick Actions endpoints (/screenshot, /pdf, /json, /crawl). Deprecate local Playwright tunnel. | 0.5 day | Browser Run |
| 3.7 | Analytics Engine dashboards for 5 key metrics: gateway P95 latency, tool invocation rate, OAuth refresh rate, AI Gateway spend, error rate. Shareable URLs. | 0.5 day | Analytics Engine |
| 3.8 | ADR for Secrets Store adoption — open beta today. Document migration trigger: “within 30 days of GA announcement, run migration script.” Until then: wrangler secrets with doc’d rotation procedure. | 1 hr | Secrets Store |
| 3.9 | Runbook for /incident-response — Access → Workers Logs → AI Gateway dashboard → /admin/errors | 1 hr | Process |
Wave 3 exit criterion: AI Gateway shows traffic from every LLM tool. /admin/errors reachable only via CF Access login. Workers Logs query retrieves a full request trace across gateway + context-worker in <5 s. Gong ingestion Workflow has 3 successful runs in production without manual intervention.
Wave 4 — Polish + hardening (Week 4, optional)
Not blocking. Fire these once Waves 1–3 are stable.
| # | Task | Why |
|---|---|---|
| 4.1 | GitHub OIDC → Cloudflare for tokenless deploys from CI | Eliminates the CLOUDFLARE_API_TOKEN secret entirely |
| 4.2 | Environment separation (resolves tech-debt row #16) — prod vs dev KV namespaces, wrangler --env consistently applied | Prevents a dev write from hitting prod KV |
| 4.3 | Property-based tests on auth layer (resolves tech-debt row #17) | Fuzz coverage on token validation |
| 4.4 | Retire Tailscale + decommissioned-VPS references from all docs | Doc hygiene |
| 4.5 | Telegram + Slack → Channels (Claude Code Channels feature) as alternative to API triggers | Additional surface for mobile command |
| 4.6 | Ultraplan + Ultrareview workflow adoption for multi-session features | Higher-quality planning layer |
Part IV — Trade-offs + risks
Trade-offs
| Decision | Gain | Cost |
|---|---|---|
| Streamable HTTP + OAuth 2.1 on MCP | Spec-correct; clients discover + connect without copy-pasted tokens | 3–4 days of gateway work; breaks any existing caller relying on bearer token auth until they migrate |
| MCP Server Portal per tenant | Per-tenant isolation at the network layer; SSO gate | One more Worker to deploy + maintain |
| AI Gateway in front of every LLM | Observability, caching, cost caps, fallback | Small per-request hop through another Worker (<2 ms); new cost line-item (AI Gateway is free-tier friendly but adds a line) |
| SQLite DOs | Cheaper, faster, better transaction semantics than KV-backed | One-time migration: export → import per class |
| Workflows for ingestion | Durable multi-step with retries + replay | New primitive to learn; monitoring dashboard to build |
| Cloud-only agent workflow | Zero laptop dependency, persistent sessions, iOS app | Env var pasting is manual; no dedicated secrets store yet (visible to env editors) |
Risks + mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| OAuth 2.1 rollout breaks existing n8n workflows calling gateway | Medium | n8n automations go dark until migrated | Phase: keep bearer-token auth active in parallel for 30 days; cut-over after all 83 n8n workflows updated |
| Routine daily cap hit | Low initially, medium at scale | Scheduled jobs skip | Enable Extra Usage billing; monitor cap via claude.ai/settings/usage |
| Cloud env secrets leaked to someone with env-edit access | Low | Credential exposure | Use scoped short-lived tokens; rotate quarterly; migrate to Secrets Store at GA |
| AI Gateway fallback misconfigured → wrong model answers | Medium | Quality drop | Test fallback chains with synthetic bad responses before enabling |
| Workflows + Queues dual-ownership for Phase 2 ingestion | Low | Confusion about which is source of truth | Queues become producer-only to Workflows; all business logic in Workflows |
| Claude Code web VM cap hit during overnight (currently ~unlimited for Pro) | Low | Overnight run fails | Monitor; fall back to Agent SDK self-hosted on a CF Worker for truly constrained cases |
| Stale invariants doc leaks old assumptions into new PRs | High if not fixed | Architectural drift compounds | Wave 1 includes invariant rewrite + .claude/rules/ canonicalization |
Part V — Success criteria
Must all be true at end of Wave 3:
- MacBook off for 24 h → at least one Routine completed, at least one
--remotetask completed, PRs visible on GitHub. - Claude Code web session authenticates to V5 Gateway MCP via OAuth 2.1 (no pasted tokens).
-
net.ascendgtm/gatewayreturned fromapi.anthropic.com/mcp-registry/.../net.ascendgtm/gateway. -
grep -rn "stdio" .claude/ .mcp.json→ zero matches. - Every LLM call visible in AI Gateway dashboard.
-
/admin/errorsreturns 401 without CF Access auth. - Logpush writing to R2
ascend-logsbucket; last 30 days of gateway logs queryable. - Gong or SFDC ingestion Workflow has ≥3 successful runs.
-
.claude/CLAUDE.md+.claude/rules/v5-invariants.mdreflect the canonical 15-invariant set. - 34 MCP tools still registered, all declared in
TOOLS.md. - 567/567 gateway tests + 76/76 context-worker tests passing (+ new OAuth + Workflows tests).
-
/admin/healthreports all bindings green including AI Gateway.
Part VI — Operating model after migration
How work happens day-to-day
- Morning: iOS app shows any Routine-generated PRs from overnight. Mishaal reviews on phone, merges or comments.
- Deep work block: Mishaal opens
claude.ai/codefrom any browser. Firesclaude --remote "do X". Closes browser. Goes to meeting. - On-call: Sentry alert → Routine API trigger → Claude session investigates + drafts PR → Slack ping → on-call reviews via CF Access SSO.
- Weekly: Routine-driven weekly digest summarizes tech-debt, open PRs, LEDGER drift. No manual report.
- New integration request: Mishaal types “add Klaviyo” → cloud session reads spec, scaffolds provider, writes tests, opens PR. Elicitation URL triggers if OAuth consent needed.
How we measure health
- AI Gateway dashboard — weekly review: fallback rate, cache hit rate, cost per tool per tenant.
- Analytics Engine — weekly rollup: tool usage histogram (the ADR-023 decision was “keep all 25 tools”; this telemetry is what eventually retires anything genuinely zero-use).
- Logpush + R2 — monthly random-sample audit: 10 requests traced end-to-end, confirm every step has logs.
- Routine dashboard — daily: every scheduled Routine has a successful run in the last 24 h.
- LEDGER.md — weekly: zero rows with “last touched >7 days” and no PR.
Operating invariants
- Plan-first PR rule stays. Multi-session project = first commit is plan-doc + LEDGER row on
main. - Non-Stop Execution Protocol stays. Now lives in repo
.claude/CLAUDE.md; cloud sessions load it automatically. - Research-first mandate stays. Every new API / version / limit → live docs fetch before writing a value into code.
- Parallelization rule stays. Independent tool calls batch in one message. No sequential ladders.
Part VII — What gets deleted
Migration generates noise. To keep the system elegant, delete these after Waves 1–3 ship:
~/.claude/CLAUDE.mdlegacy n8n section (strip, keep identity + rules)- Global references to Tailscale, VPS, Mac bridge (all decommissioned)
~/.claude/scheduled-tasks/(replaced by Routines)- 9 local worktrees under
.claude/worktrees/(replaced by cloud sessions) .claude/mcp.json+ stdioclaude mcp addhistorymcp-server/directory in the old monorepo (no longer used — CF Worker replaces)decommission-plan-vps.mdceremony files if present
Each deletion gets a commit in Wave 4 cleanup.
Part VIII — Wave 4 — Hosted-OSS-first inference (cost discipline, zero hardware)
Goal: route default internal workloads through open-source models hosted on managed cloud (Cloudflare Workers AI, DeepSeek direct API, OpenRouter) instead of premium frontier APIs. Frontier tier stays for novel / customer-facing work. Zero new hardware. Added 2026-04-24 after the DeepSeek V4 launch (4/23) + Qwen3.6 / Workers AI catalog review showed ~90% cost reduction opportunity on bulk workloads.
Invariant update
The repo .claude/CLAUDE.md Non-Stop Protocol already encodes “No hardware dependencies — everything cloud.” Wave 4 operationalizes that for the inference layer specifically: every LLM and embedding call must land on either Cloudflare’s managed inference (Workers AI), a serverless OSS endpoint (DeepSeek API, OpenRouter), or a frontier provider when quality demands it — never on a laptop or self-hosted box.
Cost comparison (live-doc-verified 2026-04-24)
| Model | Hosted at | $/1M input | $/1M output | Tier |
|---|---|---|---|---|
| Qwen3-30B-A3B-FP8 | Workers AI (@cf/qwen/qwen3-30b-a3b-fp8) | $0.051 | $0.34 | bulk (default) |
| Llama-3.3-70B-FP8-Fast | Workers AI | $0.29 | $0.56 | bulk (high-quality) |
| Qwen2.5-Coder-32B | Workers AI | $0.16 | $0.48 | bulk (code) |
| Kimi-K2.6 / GLM-4.7-Flash | Workers AI | varies | varies | bulk (Chinese OSS tier) |
| DeepSeek V4-Flash | DeepSeek direct API | $0.14 | $0.28 | standard (1M ctx, MIT) |
| DeepSeek V4-Pro | DeepSeek direct API | $1.74 | $3.48 | standard (heavy reasoning) |
| GPT-5.5 | OpenAI via AI Gateway | $5.00 | $30.00 | frontier |
| Opus 4.7 | Anthropic via AI Gateway | $15.00 | $75.00 | frontier |
Phase A (shipped in Wave 4 PR) — tier-aware llm_invoke
- New
workers_aiprovider usesenv.AI.run()binding (zero egress, <10 ms). - New
deepseek-v4-flash+deepseek-v4-promodel IDs; V3deepseek-chat/deepseek-reasonerauto-aliased to V4 (deprecate 2026-07-24 per DeepSeek docs). tierparam:bulk | standard | frontier. Default =bulk.- Routes through CF AI Gateway
ascend-workers-aiwhenCF_AI_GATEWAY_WORKERS_AI_SLUGis set. - ADR-027 documents decision + cost projections.
Phase B (shipped in Wave 4 PR) — context-plane upgrade + reranking
- Vectorize index migrated from bge-small-en-v1.5 (384 dim) → bge-m3 (1024 dim, multilingual).
@cf/baai/bge-reranker-basereranks Vectorize top-50 → top-10 before D1 hydration incontext_query.- All Workers AI calls route through AI Gateway for unified observability + caching.
- ADR-028 (bge-m3 migration) + ADR-029 (LoRA adapter roadmap for future tenant-specific tuning).
Invariant #12 update
Invariant #12 “Every LLM call goes through AI Gateway” now includes Workers AI calls — not just external provider calls. The AI Gateway wraps both via the gateway: option on env.AI.run().
Cost projection
At Phase 2 Gong-extraction volumes (~500 transcripts/day once Kahuna is in production):
- Before Wave 4 (DeepSeek V3): ~$6,300/month
- After Wave 4 (Workers AI Qwen3-30B bulk tier): ~$450/month
Monthly savings: ~$5,850. Hardware-free. Scales with tenant count, not headcount.
Success criteria (added to Part V)
-
llm_invokedefault tier =bulk→ workers_ai qwen3-30b - DeepSeek V4 replaces V3 as
standardtier default - Vectorize index at 1024 dim; bge-m3 embeddings live
- Reranker reduces “wrong-semantic-match” false-positive rate measurably (A/B on 100 Kahuna queries)
- AI Gateway
ascend-workers-aidashboard shows traffic from both gateway + context-worker -
TOOL_METRICSdataset aggregates show >80% ofllm_invokecalls land onworkers_aitier - Monthly LLM spend drops to <20% of pre-Wave-4 baseline within 30 days of Phase 2 GA
Phase A shipped receipt (2026-04-24)
Status: Shipped 2026-04-24 via PR “Wave 4 Phase A — Workers AI + DeepSeek V4 + tier-aware routing (hosted OSS first)” ADR: ADR-027 — hosted-OSS-first routing
Tier routing table (config-driven, zod-schema’d)
| Tier | Provider | Model | $/1M (in/out) | Call path |
|---|---|---|---|---|
bulk (default) | workers_ai | qwen3-30b (@cf/qwen/qwen3-30b-a3b-fp8) | $0.051 / $0.34 | env.AI.run() binding — zero egress |
standard | deepseek | deepseek-v4-flash | $0.14 / $0.28 (cache-miss) · $0.028 cached input | HTTPS |
frontier | caller-set | caller-set | frontier-priced | HTTPS |
Tasks shipped
| # | Task | Files touched |
|---|---|---|
| A.1 | Add [ai] binding = "AI" to wrangler.toml (prod + staging) | wrangler.toml |
| A.2 | Extend Env type with AI binding, CF_AI_GATEWAY_WORKERS_AI_SLUG, explicit DEEPSEEK_API_KEY | src/lib/types.ts |
| A.3 | New helper in src/lib/ai-gateway.ts — builds the {gateway:{id,metadata}} third-arg for env.AI.run() | src/lib/ai-gateway.ts |
| A.4 | Extend src/tools/llm-invoke.ts — workers_ai provider, tier router, WORKERS_AI_MODELS catalog, DeepSeek V4 aliases, binding call path, response translator | src/tools/llm-invoke.ts |
| A.5 | Update TOOLS.md llm-invoke row to reflect tier routing | docs/requirements/TOOLS.md |
| A.6 | ADR-027 authored with cost math, live URLs, alternatives | docs/decisions/ADR-027-*.md |
| A.7 | 21 new tests (router, aliases, binding call, AI Gateway option, cost math, model catalog) | test/tools/llm-invoke-tier-routing.test.ts |
Phase A exit criteria (all met)
-
npm run typecheck— clean -
npm test— 588→609 green, zero regressions -
npm run check:pre-commit— 11/11 -
wrangler deploy --dry-run— bundle within soft target (1102.72 KiB, +10.72 KiB) - ADR-027 merged with live-docs citations for every pricing claim
- TOOLS.md row rewritten
Research receipts (all verified 2026-04-24)
- Qwen3-30B pricing, ctx, tool calling: developers.cloudflare.com/workers-ai/models/qwen3-30b-a3b-fp8/
env.AI.run()signature +[ai] binding: developers.cloudflare.com/workers-ai/configuration/bindings/- AI Gateway third-arg
gatewayoption: developers.cloudflare.com/ai-gateway/integrations/aig-workers-ai-binding/ - DeepSeek V4 pricing + alias deprecation: api-docs.deepseek.com/quick_start/pricing
What’s NOT in Phase A (and why)
- Cost-cap enforcement inside the gateway. Deferred to AI Gateway dashboard config (operator sets daily spend cap per gateway). Phase B of Wave 4 will pull the cap into KV for per-tenant differentiation.
tool-scopes.tsintegration. The scope-map file doesn’t exist yet (slated for Wave 2 of Cloud-Native v2).llm_invokestaysmcp:readby convention; when scopes land, this tool gets one row, no behavior change.- Frontier-tier alias table.
frontieris explicitly caller-driven to avoid accidental Opus spend. A future ADR may add explicitfrontier-sonnet/frontier-opusaliases once usage is clearly understood.
Sign-off
This plan reflects 2026 edge-native engineering best practice for a multi-tenant GTM automation platform. Every recommendation cites live docs; every trade-off is named; every risk has a mitigation.
Wave 1 starts with your approval. Waves 2 + 3 proceed under the Non-Stop Protocol once Wave 1 is green. Wave 4 Phase A shipped 2026-04-24 ahead of Waves 2+3 because its blast radius is contained to llm_invoke and the cost win is immediate.