Hosted-OSS-first tier routing for `llm_invoke`

ADR-027 — Hosted-OSS-first tier routing for `llm_invoke`

Status: Accepted Date: 2026-04-24 Deciders: Mishaal Murawala Relates to: ADR-017 (DeepSeek default), ADR-011 (frontier eval tier), ADR-023 (retain low-use tools), Cloud-Native v2 Engineering Plan Wave 4 Phase A

Context

Two facts drove this revision:

CF Workers AI now hosts every OSS model we actually use — Qwen3-30B, Llama-3.3-70B fast, GLM-4.7-Flash, Kimi-K2.6, Gemma-4-26B, Mistral-Small-3.1, Gpt-OSS-120B, Nemotron-3-120B, Qwen-Coder-32B. The binding call path (env.AI.run()) is a Worker-to-Worker dispatch with zero egress, single-digit-ms overhead.
DeepSeek shipped V4 (V4-Flash + V4-Pro) on 2026-Q2. Pricing stayed at the V3 floor ($0.14/$0.28 cache-miss for V4-Flash; $0.028 cached input) while output quality moved up. The legacy deepseek-chat / deepseek-reasoner model names are being deprecated (live docs, 2026-04-24).

Prior llm_invoke (ADR-017) defaulted to DeepSeek. That was correct for the Q1 2026 workload (heavy prompt caching). It is no longer the cost-optimal default now that Workers AI hosts Qwen3-30B at $0.051/$0.34 per 1M tokens (live CF model card, 2026-04-24) — a ~3× cheaper input-token cost than DeepSeek V4-Flash cache-miss, zero egress, and all the benefits of CF’s edge (auth-binding authz, AI Gateway integration, no API-key management for the default path).

Decision

Introduce a three-tier router inside llm_invoke with hosted OSS as the default.

Tier	Provider	Model	$/1M tokens (in/out)	Call path
`bulk` (default)	`workers_ai`	`qwen3-30b` → `@cf/qwen/qwen3-30b-a3b-fp8`	$0.051 / $0.34	`env.AI.run()` binding — zero egress
`standard`	`deepseek`	`deepseek-v4-flash`	$0.14 / $0.28 (cache-miss) · $0.028 cached input	HTTPS
`frontier`	caller-set	caller-set	frontier-priced	HTTPS

Resolution rules:

If provider and model are both explicitly set → honor verbatim (explicit wins, aliased per below).
Else if tier is set → route per table.
Else default tier="bulk".
tier="frontier" with no explicit {provider, model} → VALIDATION_ERROR.

DeepSeek V3 alias auto-mapping:

Legacy name	Auto-mapped to
`deepseek-chat`	`deepseek-v4-flash`
`deepseek-reasoner`	`deepseek-v4-pro`

The deprecation date is not published by DeepSeek beyond “in the future” (see live pricing page, captured 2026-04-24). We auto-alias defensively so no caller breaks; the alias disappears if DeepSeek hard-deprecates and we’ll revisit at that point.

AI Gateway for Workers AI (preferred): when env.CF_AI_GATEWAY_WORKERS_AI_SLUG is set, the binding call is made with a { gateway: { id, metadata: { tenant_id, tool, tier } } } third-argument. Per live CF docs, 2026-04-24. This gives unified observability, semantic cache, cost caps, and per-tenant attribution.

Rationale

Cost math (per-request, 10k input / 2k output — typical classification shape)

Tier / provider	Input cost	Output cost	Per-request	vs Sonnet-4-6
`bulk` (workers_ai qwen3-30b)	$0.00051	$0.00068	$0.00119	50× cheaper
`standard` (deepseek-v4-flash cache-miss)	$0.00140	$0.00056	$0.00196	30× cheaper
`standard` (deepseek-v4-flash cached)	$0.00028	$0.00056	$0.00084	71× cheaper
`frontier` (claude-sonnet-4-6)	$0.03000	$0.03000	$0.06000	baseline

At 1M bulk calls/mo (ICP scoring + Gong summarization + signal classification):

Workers AI bulk tier: ~$1,190/mo
DeepSeek V4-Flash cache-miss: ~$1,960/mo
DeepSeek V3 (ADR-017 default): ~$2,800/mo (old price)
Sonnet: ~$60,000/mo

Why workers_ai as the default (not DeepSeek)

Price: Qwen3-30B at $0.051 input is ~2.7× cheaper than DeepSeek V4-Flash cache-miss, and ~1.5× cheaper than V4-Flash cache-hit at scale.
Zero egress: binding calls don’t leave the CF edge. No network latency, no API-key leak surface, no third-party auth.
AI Gateway native: one gateway, one dashboard, one cost cap. Every third-party provider (DeepSeek, OpenRouter, Gemini) has to be wrapped separately.
Auth scoped to the Worker: account-level permission; no per-tenant DEEPSEEK_API_KEY to rotate.
Tool calling + 32k ctx + function calling on Qwen3-30B FP8 per CF model card.

Why DeepSeek stays as `standard` (not retired)

Prompt-caching 90% discount materially beats workers_ai at workloads where the system prompt is constant across a batch (ICP scoring against a fixed rubric). DeepSeek’s cached input at $0.028 is actually cheaper than Qwen3-30B input at $0.051.
Reasoning-optimized V4-Pro fills a gap between Qwen3-30B and frontier.
Portability: callers who already switched to DeepSeek (post-ADR-017) don’t break — legacy aliases map silently.

Why `frontier` requires explicit `provider+model`

The router is for the common case. Frontier calls are rare, intentional, and expensive. Forcing explicit escape-hatch selection prevents accidental Opus-4-7 spends via a fat-finger tier="frontier". A caller who truly wants Opus-via-OpenRouter writes { tier: "frontier", provider: "openrouter", model: "anthropic/claude-opus-4-7" }.

Consequences

Positive

Default path is ~3× cheaper than previous DeepSeek default, zero egress, AI-Gateway observable.
DeepSeek V4 migration is transparent — legacy callers using deepseek-chat / deepseek-reasoner continue to work, name auto-aliased.
All three tiers behind one MCP tool; callers pick semantic tier, not infra knob.
workers_ai short-key catalog is the contract; CF model IDs are implementation detail.

Negative

[ai] binding = "AI" must be present in wrangler.toml (added in this PR). Staging env needs the same block.
Workers AI respects the Worker’s CPU limit (not a 30s AbortController). If a CF-hosted model has a pathologically slow inference, it burns CPU quota. Mitigation: keep bulk-tier max_tokens ≤ 2048.
Adds a model catalog (WORKERS_AI_MODELS) that must track CF’s available models. Drift risk: new CF models aren’t usable until catalog update. Escape hatch: callers can pass an explicit @cf/... model ID (runtime accepts).
Bundle size: +10.72 KiB minified (1092 → 1102.72 KiB). Still far from the 2 MB soft target.

Neutral

tool-scopes.ts is not yet present (slated for Wave 2 of Cloud-Native v2). When it lands, llm_invoke stays at mcp:read across all tiers — the scope reflects caller intent (read-only LLM invocation), not which model runs it.

Alternatives considered + rejected

Alternative	Why rejected
Mac Studio / local-LLM hardware	Violates the zero-hardware invariant of V5 (everything runs on CF edge). No operator mandate to manage on-prem hardware.
OpenRouter as default	Adds 15–40% aggregation markup on every call; no prompt-cache passthrough; already available as explicit `provider: "openrouter"`. Same rationale as ADR-017’s rejection.
Groq / Cerebras as default	Ultra-fast but narrower catalog (Llama + a few OSS only). Doesn’t cover GLM / Kimi / Qwen-Coder. Keep as explicit opt-ins.
Keep DeepSeek as default, add Workers AI as option	Leaves the wrong-default trap: every reflex call misses the ~3× savings on input tokens. Explicit opt-in makes the common case more expensive by default.
Route everything via AI Gateway HTTP (even Workers AI via HTTPS)	Binding call is strictly faster, cheaper, and more secure than AI-Gateway-proxied HTTP. AI Gateway’s WAI binding integration is the CF-recommended pattern.

Implementation

Landed in this PR:

src/lib/ai-gateway.ts — new buildWorkersAiGatewayOption() helper. Returns the {gateway: {id, metadata}} third-arg for env.AI.run() when CF_AI_GATEWAY_WORKERS_AI_SLUG is set, else undefined.
src/tools/llm-invoke.ts — tier router, WORKERS_AI_MODELS catalog, DeepSeek V4 aliases, workers_ai provider with binding-call path, workersAiToOpenAIShape() translator, optional tier schema param.
src/lib/types.ts — env.AI binding, CF_AI_GATEWAY_WORKERS_AI_SLUG, DEEPSEEK_API_KEY explicit in Env.
wrangler.toml — [ai] binding = "AI" added to production + staging envs.
docs/requirements/TOOLS.md — llm-invoke row rewritten to reflect tier routing.
test/tools/llm-invoke-tier-routing.test.ts — 21 new tests covering router, alias mapping, binding call, response translation, AI Gateway option, cost math sanity.

Operator runbook

Enable the binding (already in this PR via [ai] binding = "AI").
Create the AI Gateway in CF dashboard → AI → AI Gateway. Name: ascend-workers-ai. Enable: log request bodies (sampled), semantic cache (7-day TTL), cost cap ($10/day per tenant).
Set the env var: wrangler secret put CF_AI_GATEWAY_WORKERS_AI_SLUG → ascend-workers-ai. (Stored as a secret to avoid shipping the slug in wrangler.toml diffs.)

Verify post-deploy:

curl -sS https://ascend-gateway-v5.ascendgtm.workers.dev/mcp \
  -H "authorization: Bearer <tenant-token>" \
  -H "content-type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"llm_invoke","arguments":{"messages":[{"role":"user","content":"return the word OK"}]}}}'

Expected: success: true, tier: "bulk", provider: "workers_ai", model: "@cf/qwen/qwen3-30b-a3b-fp8", ai_gateway: "enabled".

Verification

After merge + deploy:

Test suite: 588 → 609 (21 new) green.
Typecheck: clean.
Pre-commit gate: 11/11.
Bundle size: 1102.72 KiB (<2 MB soft target).
AI Gateway dashboard (CF): shows llm_invoke traffic with tenant/tool/tier metadata.
DeepSeek legacy names (deepseek-chat, deepseek-reasoner) still work end-to-end.

References

Qwen3-30B-A3B-FP8 model card (pricing, ctx, tool calling, 2026-04-24)
Workers AI bindings ([ai] binding = "AI", env.AI.run() signature)
AI Gateway — Workers AI binding integration (third-arg gateway option, metadata attribution)
DeepSeek V4 pricing + deprecation (V4-Flash $0.14/$0.28, V4-Pro $1.74/$3.48, legacy aliases flagged for deprecation)
ADR-017 — prior DeepSeek-default decision
Wave 4 Phase A plan