Hosted-OSS-first tier routing for `llm_invoke`
ADR-027 — Hosted-OSS-first tier routing for llm_invoke
Status: Accepted Date: 2026-04-24 Deciders: Mishaal Murawala Relates to: ADR-017 (DeepSeek default), ADR-011 (frontier eval tier), ADR-023 (retain low-use tools), Cloud-Native v2 Engineering Plan Wave 4 Phase A
Context
Two facts drove this revision:
- CF Workers AI now hosts every OSS model we actually use — Qwen3-30B, Llama-3.3-70B fast, GLM-4.7-Flash, Kimi-K2.6, Gemma-4-26B, Mistral-Small-3.1, Gpt-OSS-120B, Nemotron-3-120B, Qwen-Coder-32B. The binding call path (
env.AI.run()) is a Worker-to-Worker dispatch with zero egress, single-digit-ms overhead. - DeepSeek shipped V4 (V4-Flash + V4-Pro) on 2026-Q2. Pricing stayed at the V3 floor ($0.14/$0.28 cache-miss for V4-Flash; $0.028 cached input) while output quality moved up. The legacy
deepseek-chat/deepseek-reasonermodel names are being deprecated (live docs, 2026-04-24).
Prior llm_invoke (ADR-017) defaulted to DeepSeek. That was correct for the Q1 2026 workload (heavy prompt caching). It is no longer the cost-optimal default now that Workers AI hosts Qwen3-30B at $0.051/$0.34 per 1M tokens (live CF model card, 2026-04-24) — a ~3× cheaper input-token cost than DeepSeek V4-Flash cache-miss, zero egress, and all the benefits of CF’s edge (auth-binding authz, AI Gateway integration, no API-key management for the default path).
Decision
Introduce a three-tier router inside llm_invoke with hosted OSS as the default.
| Tier | Provider | Model | $/1M tokens (in/out) | Call path |
|---|---|---|---|---|
bulk (default) | workers_ai | qwen3-30b → @cf/qwen/qwen3-30b-a3b-fp8 | $0.051 / $0.34 | env.AI.run() binding — zero egress |
standard | deepseek | deepseek-v4-flash | $0.14 / $0.28 (cache-miss) · $0.028 cached input | HTTPS |
frontier | caller-set | caller-set | frontier-priced | HTTPS |
Resolution rules:
- If
providerandmodelare both explicitly set → honor verbatim (explicit wins, aliased per below). - Else if
tieris set → route per table. - Else default
tier="bulk". tier="frontier"with no explicit{provider, model}→VALIDATION_ERROR.
DeepSeek V3 alias auto-mapping:
| Legacy name | Auto-mapped to |
|---|---|
deepseek-chat | deepseek-v4-flash |
deepseek-reasoner | deepseek-v4-pro |
The deprecation date is not published by DeepSeek beyond “in the future” (see live pricing page, captured 2026-04-24). We auto-alias defensively so no caller breaks; the alias disappears if DeepSeek hard-deprecates and we’ll revisit at that point.
AI Gateway for Workers AI (preferred): when env.CF_AI_GATEWAY_WORKERS_AI_SLUG is set, the binding call is made with a { gateway: { id, metadata: { tenant_id, tool, tier } } } third-argument. Per live CF docs, 2026-04-24. This gives unified observability, semantic cache, cost caps, and per-tenant attribution.
Rationale
Cost math (per-request, 10k input / 2k output — typical classification shape)
| Tier / provider | Input cost | Output cost | Per-request | vs Sonnet-4-6 |
|---|---|---|---|---|
bulk (workers_ai qwen3-30b) | $0.00051 | $0.00068 | $0.00119 | 50× cheaper |
standard (deepseek-v4-flash cache-miss) | $0.00140 | $0.00056 | $0.00196 | 30× cheaper |
standard (deepseek-v4-flash cached) | $0.00028 | $0.00056 | $0.00084 | 71× cheaper |
frontier (claude-sonnet-4-6) | $0.03000 | $0.03000 | $0.06000 | baseline |
At 1M bulk calls/mo (ICP scoring + Gong summarization + signal classification):
- Workers AI bulk tier: ~$1,190/mo
- DeepSeek V4-Flash cache-miss: ~$1,960/mo
- DeepSeek V3 (ADR-017 default): ~$2,800/mo (old price)
- Sonnet: ~$60,000/mo
Why workers_ai as the default (not DeepSeek)
- Price: Qwen3-30B at $0.051 input is ~2.7× cheaper than DeepSeek V4-Flash cache-miss, and ~1.5× cheaper than V4-Flash cache-hit at scale.
- Zero egress: binding calls don’t leave the CF edge. No network latency, no API-key leak surface, no third-party auth.
- AI Gateway native: one gateway, one dashboard, one cost cap. Every third-party provider (DeepSeek, OpenRouter, Gemini) has to be wrapped separately.
- Auth scoped to the Worker: account-level permission; no per-tenant DEEPSEEK_API_KEY to rotate.
- Tool calling + 32k ctx + function calling on Qwen3-30B FP8 per CF model card.
Why DeepSeek stays as standard (not retired)
- Prompt-caching 90% discount materially beats workers_ai at workloads where the system prompt is constant across a batch (ICP scoring against a fixed rubric). DeepSeek’s cached input at $0.028 is actually cheaper than Qwen3-30B input at $0.051.
- Reasoning-optimized V4-Pro fills a gap between Qwen3-30B and frontier.
- Portability: callers who already switched to DeepSeek (post-ADR-017) don’t break — legacy aliases map silently.
Why frontier requires explicit provider+model
The router is for the common case. Frontier calls are rare, intentional, and expensive. Forcing explicit escape-hatch selection prevents accidental Opus-4-7 spends via a fat-finger tier="frontier". A caller who truly wants Opus-via-OpenRouter writes { tier: "frontier", provider: "openrouter", model: "anthropic/claude-opus-4-7" }.
Consequences
Positive
- Default path is ~3× cheaper than previous DeepSeek default, zero egress, AI-Gateway observable.
- DeepSeek V4 migration is transparent — legacy callers using
deepseek-chat/deepseek-reasonercontinue to work, name auto-aliased. - All three tiers behind one MCP tool; callers pick semantic tier, not infra knob.
workers_aishort-key catalog is the contract; CF model IDs are implementation detail.
Negative
[ai] binding = "AI"must be present inwrangler.toml(added in this PR). Staging env needs the same block.- Workers AI respects the Worker’s CPU limit (not a 30s AbortController). If a CF-hosted model has a pathologically slow inference, it burns CPU quota. Mitigation: keep bulk-tier max_tokens ≤ 2048.
- Adds a model catalog (
WORKERS_AI_MODELS) that must track CF’s available models. Drift risk: new CF models aren’t usable until catalog update. Escape hatch: callers can pass an explicit@cf/...model ID (runtime accepts). - Bundle size: +10.72 KiB minified (1092 → 1102.72 KiB). Still far from the 2 MB soft target.
Neutral
tool-scopes.tsis not yet present (slated for Wave 2 of Cloud-Native v2). When it lands,llm_invokestays atmcp:readacross all tiers — the scope reflects caller intent (read-only LLM invocation), not which model runs it.
Alternatives considered + rejected
| Alternative | Why rejected |
|---|---|
| Mac Studio / local-LLM hardware | Violates the zero-hardware invariant of V5 (everything runs on CF edge). No operator mandate to manage on-prem hardware. |
| OpenRouter as default | Adds 15–40% aggregation markup on every call; no prompt-cache passthrough; already available as explicit provider: "openrouter". Same rationale as ADR-017’s rejection. |
| Groq / Cerebras as default | Ultra-fast but narrower catalog (Llama + a few OSS only). Doesn’t cover GLM / Kimi / Qwen-Coder. Keep as explicit opt-ins. |
| Keep DeepSeek as default, add Workers AI as option | Leaves the wrong-default trap: every reflex call misses the ~3× savings on input tokens. Explicit opt-in makes the common case more expensive by default. |
| Route everything via AI Gateway HTTP (even Workers AI via HTTPS) | Binding call is strictly faster, cheaper, and more secure than AI-Gateway-proxied HTTP. AI Gateway’s WAI binding integration is the CF-recommended pattern. |
Implementation
Landed in this PR:
src/lib/ai-gateway.ts— newbuildWorkersAiGatewayOption()helper. Returns the{gateway: {id, metadata}}third-arg forenv.AI.run()whenCF_AI_GATEWAY_WORKERS_AI_SLUGis set, else undefined.src/tools/llm-invoke.ts— tier router,WORKERS_AI_MODELScatalog, DeepSeek V4 aliases,workers_aiprovider with binding-call path,workersAiToOpenAIShape()translator, optionaltierschema param.src/lib/types.ts—env.AIbinding,CF_AI_GATEWAY_WORKERS_AI_SLUG,DEEPSEEK_API_KEYexplicit inEnv.wrangler.toml—[ai] binding = "AI"added to production + staging envs.docs/requirements/TOOLS.md—llm-invokerow rewritten to reflect tier routing.test/tools/llm-invoke-tier-routing.test.ts— 21 new tests covering router, alias mapping, binding call, response translation, AI Gateway option, cost math sanity.
Operator runbook
- Enable the binding (already in this PR via
[ai] binding = "AI"). - Create the AI Gateway in CF dashboard → AI → AI Gateway. Name:
ascend-workers-ai. Enable: log request bodies (sampled), semantic cache (7-day TTL), cost cap ($10/day per tenant). - Set the env var:
wrangler secret put CF_AI_GATEWAY_WORKERS_AI_SLUG→ascend-workers-ai. (Stored as a secret to avoid shipping the slug in wrangler.toml diffs.) - Verify post-deploy:
Expected:
Terminal window curl -sS https://ascend-gateway-v5.ascendgtm.workers.dev/mcp \-H "authorization: Bearer <tenant-token>" \-H "content-type: application/json" \-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"llm_invoke","arguments":{"messages":[{"role":"user","content":"return the word OK"}]}}}'success: true,tier: "bulk",provider: "workers_ai",model: "@cf/qwen/qwen3-30b-a3b-fp8",ai_gateway: "enabled".
Verification
After merge + deploy:
- Test suite: 588 → 609 (21 new) green.
- Typecheck: clean.
- Pre-commit gate: 11/11.
- Bundle size: 1102.72 KiB (<2 MB soft target).
- AI Gateway dashboard (CF): shows
llm_invoketraffic with tenant/tool/tier metadata. - DeepSeek legacy names (
deepseek-chat,deepseek-reasoner) still work end-to-end.
References
- Qwen3-30B-A3B-FP8 model card (pricing, ctx, tool calling, 2026-04-24)
- Workers AI bindings (
[ai] binding = "AI",env.AI.run()signature) - AI Gateway — Workers AI binding integration (third-arg
gatewayoption, metadata attribution) - DeepSeek V4 pricing + deprecation (V4-Flash $0.14/$0.28, V4-Pro $1.74/$3.48, legacy aliases flagged for deprecation)
- ADR-017 — prior DeepSeek-default decision
- Wave 4 Phase A plan