LoRA adapters for per-tenant tuning — design accepted, implementation deferred
ADR-029: LoRA adapters for per-tenant tuning — design accepted, implementation deferred
Status: Accepted (design) / Deferred (implementation — next tenant onboarding) Date: 2026-04-24 Deciders: Mishaal Murawala Related: ADR-016 — Context Plane · ADR-017 — llm_invoke DeepSeek default · ADR-028 — bge-m3 embeddings upgrade · Wave 4 Phase B of Cloud-Native v2 Engineering Plan
Context
The platform’s LLM surface (llm_invoke, claude, aws_bedrock_invoke, plus the context-worker’s embed + rerank calls) is provider-generic — same prompt, same model, same behavior across every tenant.
This works at N=1 tenant (Ascend). It starts to dilute as tenants diverge:
- Kahuna cares about PE sector taxonomy, fund-cycle-aware scoring, and a specific ICP definition.
- Point Field Partners cares about a different ICP, Microsoft 365 + Box + DealCloud data shapes, and UK/EU regulatory sensitivity.
- Future clients will each have their own firmographic skew, tone-of-voice guardrails, and compliance posture.
Three known options to specialize model behavior per tenant:
- Prompt engineering alone. Inject tenant-specific system prompts + few-shots. Scales to maybe 5 tenants before context budgets burn and the prompts drift out of sync with reality.
- Full fine-tune per tenant. Quality ceiling is highest. Cost floor is prohibitive — even at Together.ai rates, a tenant-scale fine-tune on Qwen-30B runs hundreds of dollars per training iteration, and we lose Cloudflare’s shared-model pricing (pay full per-token rates against a dedicated endpoint).
- LoRA adapters per tenant. Train a small (~1% of base-model params) adapter on tenant data. Base model stays shared — billing stays at Workers AI’s shared-model rates. Specialization happens in O(minutes) training + cents-per-training run.
Cloudflare Workers AI supports LoRAs natively as of 2026-04-24 (verified against live docs — https://developers.cloudflare.com/workers-ai/fine-tunes/loras/). Inference API:
await env.AI.run(BASE_MODEL, { messages: [...], raw: true, lora: 'finetune-id-or-name',});100 adapters per account; trivially lets us support ~100 tenants on a single account.
Decision
Adopt CF Workers AI LoRA adapters as the per-tenant tuning mechanism. Each paying tenant gets adapters along the naming scheme:
lora:{tenant_id}:icp_classifierlora:{tenant_id}:tone_and_voicelora:{tenant_id}:objection_mapper # futureImplementation deferred until the next tenant onboarding (post-Kahuna-v6 + post-Point-Field-scoping). This ADR locks in the design so the onboarding path is clear when we get there.
Training pipeline (to-be-built, next tenant)
Zero dedicated GPU hardware:
- Data collection. Exported Gong transcripts + SFDC fields + the tenant’s manually-labeled ICP judgments from the context-worker D1. We already have this schema — ADR-016 §source_authority covers the provenance.
- Training. Use Together.ai or Hugging Face AutoTrain for initial runs (GPU-free for us). Fall back to Modal or RunPod if spot-GPU pricing breaks their way.
- Adapter export. Outputs are
adapter_model.safetensors+adapter_config.json. - Upload to CF Workers AI. Per the docs:
POST /accounts/{ACCOUNT_ID}/ai/finetunes # create fine-tunePOST /accounts/{ACCOUNT_ID}/ai/finetunes/{ID}/finetune-assets/ # upload each file
- Runtime activation. Gateway’s
llm_invokereads the tenant context, looks uplora_adapters:{tenant}in KV, and passes the matching adapter id via thelorafield onenv.AI.run().
Compatible base models (live docs, 2026-04-24)
Verified against https://developers.cloudflare.com/workers-ai/fine-tunes/loras/ :
@cf/mistralai/mistral-7b-instruct-v0.2-lora@cf/meta-llama/llama-2-7b-chat-hf-lora@cf/google/gemma-*-lora(variants)
None of these are our primary Qwen3-30B. That’s fine — the LoRA path initially targets tasks where a 7B + tenant adapter beats a raw 30B (ICP scoring, tone enforcement, objection classification — all narrow supervised problems). The primary llm_invoke call path remains DeepSeek / Anthropic / Qwen per ADR-017 routing, unmodified.
If / when CF Workers AI ships Qwen-30B with LoRA support, we revisit and consider promoting some Qwen-LoRA flows into the main routing.
Alternatives considered, rejected
| Alternative | Why rejected |
|---|---|
| Full fine-tune per tenant on dedicated endpoints | Cost 10–100× LoRA; fragmentation of the shared-model advantage |
| Prompt engineering alone | Doesn’t converge for structured-output tasks (ICP scoring); context-budget tax compounds linearly with tenant count |
| RAG-only (no tuning, better retrieval) | Already partially done via context-worker. Complements LoRA; doesn’t replace it. Tenant-specific judgment (e.g. “Acme is good ICP because X” learned from labels) doesn’t come out of pure retrieval. |
| Self-hosted open-source finetune (vLLM on a CF Worker-attached GPU) | Not a 2026 primitive on CF; would require leaving the CF-native path |
Consequences
Positive
- Per-tenant specialization at ~cents per training iteration.
- Base model remains shared → Workers AI pricing stays competitive.
- 100-adapter ceiling per account comfortably supports our next-3-year tenant roadmap.
- Sets up “onboard a tenant” playbook: connect data → label ICP → train LoRA → deploy adapter → measure uplift.
Negative
- One more binding / KV key per tenant (
lora_adapters:{tenant}) to manage. - LoRA drift: adapter quality decays as tenant data evolves. Need a retraining cadence policy — deferred to implementation time, but flagged: quarterly re-train is the starting cadence.
- Not compatible with our primary Qwen-30B path today → LoRA flows run on smaller base models and must justify the model swap vs. staying on prompt-engineered Qwen. Case-by-case decision when we implement.
Invariants preserved
- No external vendor in the token path (V5 invariant #5): LoRA training runs on Together.ai / HF / Modal, which are training-time providers, not runtime token brokers. Runtime inference is 100% Workers AI.
- Research-first mandate: every LoRA base model + adapter upload step is cited to live CF docs in this ADR.
- No D1 in interactive path (V5 invariant #2):
lora_adapters:{tenant}lives in KV, matching the token + config pattern.
Rollout (when triggered)
- Trigger: next paid tenant onboarding OR explicit ADR-029 activation by Mishaal.
- Data export pipeline: extend context-worker to emit a training dump per tenant.
- Pick first LoRA task: ICP classifier (highest ROI, most labeled data available).
- Train on Together.ai → upload to Workers AI → add KV entry
lora_adapters:{tenant}= { icp_classifier: "" }. - Wire gateway’s
llm_invoketo read the KV entry and pass via thelorafield on eligible calls. - A/B: 50/50 split between base model + adapter vs. base model alone. Measure ICP prediction accuracy on held-out labeled set.
- Ship adapter-on if > 3 pt uplift; ship adapter-off (retrain, keep trying) otherwise.
Future-reversal trigger
If CF Workers AI deprecates LoRA support OR if our biggest tenant hits the 100-adapter account ceiling, revisit with a multi-account-per-tenant or dedicated-endpoint approach.