LoRA adapters for per-tenant tuning — design accepted, implementation deferred

ADR-029: LoRA adapters for per-tenant tuning — design accepted, implementation deferred

Status: Accepted (design) / Deferred (implementation — next tenant onboarding) Date: 2026-04-24 Deciders: Mishaal Murawala Related: ADR-016 — Context Plane · ADR-017 — llm_invoke DeepSeek default · ADR-028 — bge-m3 embeddings upgrade · Wave 4 Phase B of Cloud-Native v2 Engineering Plan

Context

The platform’s LLM surface (llm_invoke, claude, aws_bedrock_invoke, plus the context-worker’s embed + rerank calls) is provider-generic — same prompt, same model, same behavior across every tenant.

This works at N=1 tenant (Ascend). It starts to dilute as tenants diverge:

Kahuna cares about PE sector taxonomy, fund-cycle-aware scoring, and a specific ICP definition.
Point Field Partners cares about a different ICP, Microsoft 365 + Box + DealCloud data shapes, and UK/EU regulatory sensitivity.
Future clients will each have their own firmographic skew, tone-of-voice guardrails, and compliance posture.

Three known options to specialize model behavior per tenant:

Prompt engineering alone. Inject tenant-specific system prompts + few-shots. Scales to maybe 5 tenants before context budgets burn and the prompts drift out of sync with reality.
Full fine-tune per tenant. Quality ceiling is highest. Cost floor is prohibitive — even at Together.ai rates, a tenant-scale fine-tune on Qwen-30B runs hundreds of dollars per training iteration, and we lose Cloudflare’s shared-model pricing (pay full per-token rates against a dedicated endpoint).
LoRA adapters per tenant. Train a small (~1% of base-model params) adapter on tenant data. Base model stays shared — billing stays at Workers AI’s shared-model rates. Specialization happens in O(minutes) training + cents-per-training run.

Cloudflare Workers AI supports LoRAs natively as of 2026-04-24 (verified against live docs — https://developers.cloudflare.com/workers-ai/fine-tunes/loras/). Inference API:

await env.AI.run(BASE_MODEL, {
  messages: [...],
  raw: true,
  lora: 'finetune-id-or-name',
});

100 adapters per account; trivially lets us support ~100 tenants on a single account.

Decision

Adopt CF Workers AI LoRA adapters as the per-tenant tuning mechanism. Each paying tenant gets adapters along the naming scheme:

lora:{tenant_id}:icp_classifier
lora:{tenant_id}:tone_and_voice
lora:{tenant_id}:objection_mapper   # future

Implementation deferred until the next tenant onboarding (post-Kahuna-v6 + post-Point-Field-scoping). This ADR locks in the design so the onboarding path is clear when we get there.

Training pipeline (to-be-built, next tenant)

Zero dedicated GPU hardware:

Data collection. Exported Gong transcripts + SFDC fields + the tenant’s manually-labeled ICP judgments from the context-worker D1. We already have this schema — ADR-016 §source_authority covers the provenance.
Training. Use Together.ai or Hugging Face AutoTrain for initial runs (GPU-free for us). Fall back to Modal or RunPod if spot-GPU pricing breaks their way.
Adapter export. Outputs are adapter_model.safetensors + adapter_config.json.

Upload to CF Workers AI. Per the docs:

POST /accounts/{ACCOUNT_ID}/ai/finetunes            # create fine-tune
POST /accounts/{ACCOUNT_ID}/ai/finetunes/{ID}/finetune-assets/  # upload each file

Runtime activation. Gateway’s llm_invoke reads the tenant context, looks up lora_adapters:{tenant} in KV, and passes the matching adapter id via the lora field on env.AI.run().

Compatible base models (live docs, 2026-04-24)

Verified against https://developers.cloudflare.com/workers-ai/fine-tunes/loras/ :

@cf/mistralai/mistral-7b-instruct-v0.2-lora
@cf/meta-llama/llama-2-7b-chat-hf-lora
@cf/google/gemma-*-lora (variants)

None of these are our primary Qwen3-30B. That’s fine — the LoRA path initially targets tasks where a 7B + tenant adapter beats a raw 30B (ICP scoring, tone enforcement, objection classification — all narrow supervised problems). The primary llm_invoke call path remains DeepSeek / Anthropic / Qwen per ADR-017 routing, unmodified.

If / when CF Workers AI ships Qwen-30B with LoRA support, we revisit and consider promoting some Qwen-LoRA flows into the main routing.

Alternatives considered, rejected

Alternative	Why rejected
Full fine-tune per tenant on dedicated endpoints	Cost 10–100× LoRA; fragmentation of the shared-model advantage
Prompt engineering alone	Doesn’t converge for structured-output tasks (ICP scoring); context-budget tax compounds linearly with tenant count
RAG-only (no tuning, better retrieval)	Already partially done via context-worker. Complements LoRA; doesn’t replace it. Tenant-specific judgment (e.g. “Acme is good ICP because X” learned from labels) doesn’t come out of pure retrieval.
Self-hosted open-source finetune (vLLM on a CF Worker-attached GPU)	Not a 2026 primitive on CF; would require leaving the CF-native path

Consequences

Positive

Per-tenant specialization at ~cents per training iteration.
Base model remains shared → Workers AI pricing stays competitive.
100-adapter ceiling per account comfortably supports our next-3-year tenant roadmap.
Sets up “onboard a tenant” playbook: connect data → label ICP → train LoRA → deploy adapter → measure uplift.

Negative

One more binding / KV key per tenant (lora_adapters:{tenant}) to manage.
LoRA drift: adapter quality decays as tenant data evolves. Need a retraining cadence policy — deferred to implementation time, but flagged: quarterly re-train is the starting cadence.
Not compatible with our primary Qwen-30B path today → LoRA flows run on smaller base models and must justify the model swap vs. staying on prompt-engineered Qwen. Case-by-case decision when we implement.

Invariants preserved

No external vendor in the token path (V5 invariant #5): LoRA training runs on Together.ai / HF / Modal, which are training-time providers, not runtime token brokers. Runtime inference is 100% Workers AI.
Research-first mandate: every LoRA base model + adapter upload step is cited to live CF docs in this ADR.
No D1 in interactive path (V5 invariant #2): lora_adapters:{tenant} lives in KV, matching the token + config pattern.

Rollout (when triggered)

Trigger: next paid tenant onboarding OR explicit ADR-029 activation by Mishaal.
Data export pipeline: extend context-worker to emit a training dump per tenant.
Pick first LoRA task: ICP classifier (highest ROI, most labeled data available).
Train on Together.ai → upload to Workers AI → add KV entry lora_adapters:{tenant} = { icp_classifier: "" }.
Wire gateway’s llm_invoke to read the KV entry and pass via the lora field on eligible calls.
A/B: 50/50 split between base model + adapter vs. base model alone. Measure ICP prediction accuracy on held-out labeled set.
Ship adapter-on if > 3 pt uplift; ship adapter-off (retrain, keep trying) otherwise.

Future-reversal trigger

If CF Workers AI deprecates LoRA support OR if our biggest tenant hits the 100-adapter account ceiling, revisit with a multi-account-per-tenant or dedicated-endpoint approach.