Skip to content

Hosted-OSS-first tier routing for `llm_invoke`

ADR-027 — Hosted-OSS-first tier routing for llm_invoke

Status: Accepted Date: 2026-04-24 Deciders: Mishaal Murawala Relates to: ADR-017 (DeepSeek default), ADR-011 (frontier eval tier), ADR-023 (retain low-use tools), Cloud-Native v2 Engineering Plan Wave 4 Phase A

Context

Two facts drove this revision:

  1. CF Workers AI now hosts every OSS model we actually use — Qwen3-30B, Llama-3.3-70B fast, GLM-4.7-Flash, Kimi-K2.6, Gemma-4-26B, Mistral-Small-3.1, Gpt-OSS-120B, Nemotron-3-120B, Qwen-Coder-32B. The binding call path (env.AI.run()) is a Worker-to-Worker dispatch with zero egress, single-digit-ms overhead.
  2. DeepSeek shipped V4 (V4-Flash + V4-Pro) on 2026-Q2. Pricing stayed at the V3 floor ($0.14/$0.28 cache-miss for V4-Flash; $0.028 cached input) while output quality moved up. The legacy deepseek-chat / deepseek-reasoner model names are being deprecated (live docs, 2026-04-24).

Prior llm_invoke (ADR-017) defaulted to DeepSeek. That was correct for the Q1 2026 workload (heavy prompt caching). It is no longer the cost-optimal default now that Workers AI hosts Qwen3-30B at $0.051/$0.34 per 1M tokens (live CF model card, 2026-04-24) — a ~3× cheaper input-token cost than DeepSeek V4-Flash cache-miss, zero egress, and all the benefits of CF’s edge (auth-binding authz, AI Gateway integration, no API-key management for the default path).

Decision

Introduce a three-tier router inside llm_invoke with hosted OSS as the default.

TierProviderModel$/1M tokens (in/out)Call path
bulk (default)workers_aiqwen3-30b@cf/qwen/qwen3-30b-a3b-fp8$0.051 / $0.34env.AI.run() binding — zero egress
standarddeepseekdeepseek-v4-flash$0.14 / $0.28 (cache-miss) · $0.028 cached inputHTTPS
frontiercaller-setcaller-setfrontier-pricedHTTPS

Resolution rules:

  1. If provider and model are both explicitly set → honor verbatim (explicit wins, aliased per below).
  2. Else if tier is set → route per table.
  3. Else default tier="bulk".
  4. tier="frontier" with no explicit {provider, model}VALIDATION_ERROR.

DeepSeek V3 alias auto-mapping:

Legacy nameAuto-mapped to
deepseek-chatdeepseek-v4-flash
deepseek-reasonerdeepseek-v4-pro

The deprecation date is not published by DeepSeek beyond “in the future” (see live pricing page, captured 2026-04-24). We auto-alias defensively so no caller breaks; the alias disappears if DeepSeek hard-deprecates and we’ll revisit at that point.

AI Gateway for Workers AI (preferred): when env.CF_AI_GATEWAY_WORKERS_AI_SLUG is set, the binding call is made with a { gateway: { id, metadata: { tenant_id, tool, tier } } } third-argument. Per live CF docs, 2026-04-24. This gives unified observability, semantic cache, cost caps, and per-tenant attribution.

Rationale

Cost math (per-request, 10k input / 2k output — typical classification shape)

Tier / providerInput costOutput costPer-requestvs Sonnet-4-6
bulk (workers_ai qwen3-30b)$0.00051$0.00068$0.0011950× cheaper
standard (deepseek-v4-flash cache-miss)$0.00140$0.00056$0.0019630× cheaper
standard (deepseek-v4-flash cached)$0.00028$0.00056$0.0008471× cheaper
frontier (claude-sonnet-4-6)$0.03000$0.03000$0.06000baseline

At 1M bulk calls/mo (ICP scoring + Gong summarization + signal classification):

  • Workers AI bulk tier: ~$1,190/mo
  • DeepSeek V4-Flash cache-miss: ~$1,960/mo
  • DeepSeek V3 (ADR-017 default): ~$2,800/mo (old price)
  • Sonnet: ~$60,000/mo

Why workers_ai as the default (not DeepSeek)

  1. Price: Qwen3-30B at $0.051 input is ~2.7× cheaper than DeepSeek V4-Flash cache-miss, and ~1.5× cheaper than V4-Flash cache-hit at scale.
  2. Zero egress: binding calls don’t leave the CF edge. No network latency, no API-key leak surface, no third-party auth.
  3. AI Gateway native: one gateway, one dashboard, one cost cap. Every third-party provider (DeepSeek, OpenRouter, Gemini) has to be wrapped separately.
  4. Auth scoped to the Worker: account-level permission; no per-tenant DEEPSEEK_API_KEY to rotate.
  5. Tool calling + 32k ctx + function calling on Qwen3-30B FP8 per CF model card.

Why DeepSeek stays as standard (not retired)

  1. Prompt-caching 90% discount materially beats workers_ai at workloads where the system prompt is constant across a batch (ICP scoring against a fixed rubric). DeepSeek’s cached input at $0.028 is actually cheaper than Qwen3-30B input at $0.051.
  2. Reasoning-optimized V4-Pro fills a gap between Qwen3-30B and frontier.
  3. Portability: callers who already switched to DeepSeek (post-ADR-017) don’t break — legacy aliases map silently.

Why frontier requires explicit provider+model

The router is for the common case. Frontier calls are rare, intentional, and expensive. Forcing explicit escape-hatch selection prevents accidental Opus-4-7 spends via a fat-finger tier="frontier". A caller who truly wants Opus-via-OpenRouter writes { tier: "frontier", provider: "openrouter", model: "anthropic/claude-opus-4-7" }.

Consequences

Positive

  • Default path is ~3× cheaper than previous DeepSeek default, zero egress, AI-Gateway observable.
  • DeepSeek V4 migration is transparent — legacy callers using deepseek-chat / deepseek-reasoner continue to work, name auto-aliased.
  • All three tiers behind one MCP tool; callers pick semantic tier, not infra knob.
  • workers_ai short-key catalog is the contract; CF model IDs are implementation detail.

Negative

  • [ai] binding = "AI" must be present in wrangler.toml (added in this PR). Staging env needs the same block.
  • Workers AI respects the Worker’s CPU limit (not a 30s AbortController). If a CF-hosted model has a pathologically slow inference, it burns CPU quota. Mitigation: keep bulk-tier max_tokens ≤ 2048.
  • Adds a model catalog (WORKERS_AI_MODELS) that must track CF’s available models. Drift risk: new CF models aren’t usable until catalog update. Escape hatch: callers can pass an explicit @cf/... model ID (runtime accepts).
  • Bundle size: +10.72 KiB minified (1092 → 1102.72 KiB). Still far from the 2 MB soft target.

Neutral

  • tool-scopes.ts is not yet present (slated for Wave 2 of Cloud-Native v2). When it lands, llm_invoke stays at mcp:read across all tiers — the scope reflects caller intent (read-only LLM invocation), not which model runs it.

Alternatives considered + rejected

AlternativeWhy rejected
Mac Studio / local-LLM hardwareViolates the zero-hardware invariant of V5 (everything runs on CF edge). No operator mandate to manage on-prem hardware.
OpenRouter as defaultAdds 15–40% aggregation markup on every call; no prompt-cache passthrough; already available as explicit provider: "openrouter". Same rationale as ADR-017’s rejection.
Groq / Cerebras as defaultUltra-fast but narrower catalog (Llama + a few OSS only). Doesn’t cover GLM / Kimi / Qwen-Coder. Keep as explicit opt-ins.
Keep DeepSeek as default, add Workers AI as optionLeaves the wrong-default trap: every reflex call misses the ~3× savings on input tokens. Explicit opt-in makes the common case more expensive by default.
Route everything via AI Gateway HTTP (even Workers AI via HTTPS)Binding call is strictly faster, cheaper, and more secure than AI-Gateway-proxied HTTP. AI Gateway’s WAI binding integration is the CF-recommended pattern.

Implementation

Landed in this PR:

  1. src/lib/ai-gateway.ts — new buildWorkersAiGatewayOption() helper. Returns the {gateway: {id, metadata}} third-arg for env.AI.run() when CF_AI_GATEWAY_WORKERS_AI_SLUG is set, else undefined.
  2. src/tools/llm-invoke.ts — tier router, WORKERS_AI_MODELS catalog, DeepSeek V4 aliases, workers_ai provider with binding-call path, workersAiToOpenAIShape() translator, optional tier schema param.
  3. src/lib/types.tsenv.AI binding, CF_AI_GATEWAY_WORKERS_AI_SLUG, DEEPSEEK_API_KEY explicit in Env.
  4. wrangler.toml[ai] binding = "AI" added to production + staging envs.
  5. docs/requirements/TOOLS.mdllm-invoke row rewritten to reflect tier routing.
  6. test/tools/llm-invoke-tier-routing.test.ts — 21 new tests covering router, alias mapping, binding call, response translation, AI Gateway option, cost math sanity.

Operator runbook

  1. Enable the binding (already in this PR via [ai] binding = "AI").
  2. Create the AI Gateway in CF dashboard → AI → AI Gateway. Name: ascend-workers-ai. Enable: log request bodies (sampled), semantic cache (7-day TTL), cost cap ($10/day per tenant).
  3. Set the env var: wrangler secret put CF_AI_GATEWAY_WORKERS_AI_SLUGascend-workers-ai. (Stored as a secret to avoid shipping the slug in wrangler.toml diffs.)
  4. Verify post-deploy:
    Terminal window
    curl -sS https://ascend-gateway-v5.ascendgtm.workers.dev/mcp \
    -H "authorization: Bearer <tenant-token>" \
    -H "content-type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"llm_invoke","arguments":{"messages":[{"role":"user","content":"return the word OK"}]}}}'
    Expected: success: true, tier: "bulk", provider: "workers_ai", model: "@cf/qwen/qwen3-30b-a3b-fp8", ai_gateway: "enabled".

Verification

After merge + deploy:

  • Test suite: 588 → 609 (21 new) green.
  • Typecheck: clean.
  • Pre-commit gate: 11/11.
  • Bundle size: 1102.72 KiB (<2 MB soft target).
  • AI Gateway dashboard (CF): shows llm_invoke traffic with tenant/tool/tier metadata.
  • DeepSeek legacy names (deepseek-chat, deepseek-reasoner) still work end-to-end.

References