Upgrade context-worker embeddings from bge-small-en-v1.5 (384) to bge-m3 (1024)
ADR-028: Upgrade context-worker embeddings from bge-small-en-v1.5 (384) to bge-m3 (1024)
Status: Accepted Date: 2026-04-24 Deciders: Mishaal Murawala Related: ADR-016 — Context Plane · ADR-029 — LoRA adapters · Wave 4 Phase B of Cloud-Native v2 Engineering Plan
Context
ADR-016 Phase 2 shipped the context worker with @cf/baai/bge-small-en-v1.5 — 384-dim, English-only, 512 input-token cap. At the time it was the right call: fast, cheap, and we had no non-English tenants.
Three pressures have since shifted the calculus:
- Tenant pipeline. Kahuna’s EU + LATAM expansion is on the 2026-Q3 roadmap. Point Field Partners has an EU pilot slated 2026-Q4. Both will ingest non-English Gong transcripts and SFDC records. bge-small-en-v1.5 is monolingual — running it on Spanish or German call text produces degenerate vectors.
- Retrieval quality. Wave 4 Phase B adds a bge-reranker-base cross-encoder (this ADR’s sibling change). The reranker’s ceiling is bounded by the candidate quality; a 384-dim embedding model is the weakest link in the pipeline. Moving to 1024-dim captures finer-grained predicate / subject distinctions (the canonical bge benchmarks show ~4–6 pt MTEB uplift from small → m3 on retrieval tasks).
- Token budget. bge-small caps inputs at 512 tokens. Our
composeFactText()budgets 1800 chars (~450 tokens) to leave headroom. Under bge-m3’s 60k-token context window we can drop the artificial ceiling and let long Gong verbatim quotes flow through in a single embedding (now capped at 8k chars = ~2k tokens, which is a quality budget not a model limit).
Decision
Move the context worker to @cf/baai/bge-m3 (1024-dim) cosine. Ship in a single PR (Wave 4 Phase B) with:
- New Vectorize index
ctx_v5_facts_1024@ 1024-dim cosine. embedFact()/embedQuery()switched to@cf/baai/bge-m3— seecontext-worker/src/lib/embeddings.ts.- Old index
ctx_v5_factsretained read-only for 30 days as the rollback path + as the source for the backfill script. - Re-embed migration script:
scripts/migrate-vectorize-384-to-1024.ts. Reads every non-superseded row from D1, re-embeds with bge-m3, upserts to new index with same fact_id + metadata. - No dual-write during the transition — context-worker writes only to the new index from deploy-forward. Old index is frozen; any gaps filled by the backfill script, which reads the D1 source-of-truth. D1 is the durable anchor per ADR-016 invariant #1.
Why bge-m3, not bge-large-en-v1.5
bge-large-en-v1.5 is also 1024-dim, slightly higher English-only retrieval scores on MTEB, and more widely tested.
We picked bge-m3 anyway because:
| Dimension | bge-large-en-v1.5 | bge-m3 | Winner |
|---|---|---|---|
| Output dim | 1024 | 1024 | tie |
| Context window | 512 tokens | 60,000 tokens | m3 |
| Languages | English only | 100+ | m3 |
| Pricing (CF Workers AI, 2026-04) | $0.20 / M input tokens | $0.012 / M input tokens | m3 (~16× cheaper) |
| Multi-functionality (dense + sparse + ColBERT) | dense only | dense + sparse + multi-vector | m3 (future-proof) |
| Retrieval quality (MTEB-EN Retrieval) | ~0.54 | ~0.50 | bge-large (~4pt edge on English-only) |
The English-only MTEB delta is bounded by Wave 4 Phase B’s reranker. bge-reranker-base on top of a slightly-lower-ceiling embedder still beats bge-large without a reranker on end-to-end benchmarks. We’re not leaving retrieval quality on the table — we’re buying multilingual + 16× pricing + 120× context for a ~4pt MTEB gap that the reranker covers.
Receipts:
- bge-m3: https://developers.cloudflare.com/workers-ai/models/bge-m3/
- bge-large-en-v1.5: https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5/
Why a new index instead of re-embedding in place
Vectorize does not support changing an index’s dimension once created. The only choices are:
- New index + re-embed (chosen). Durable, reversible, observable.
- Recreate the old index at 1024-dim. Irreversible mid-migration; any read traffic during reindex would return empty or 384-dim stale results.
Option 1 also gives us a clean 30-day rollback by keeping both indexes live.
Migration sequence
- Create
ctx_v5_facts_1024(1024-dim cosine) viawrangler vectorize create. Keepctx_v5_factslive. - Deploy context-worker with both bindings —
CONTEXT_VECTORIZE→ new index,CONTEXT_VECTORIZE_LEGACY_384→ old index (read-only reference). - Run
tsx scripts/migrate-vectorize-384-to-1024.ts --dry-runto confirm fact count. - Run without
--dry-run. Monitors the console; failures logged per fact, final summary reports succeeded / failed counts. - Verify via
context_queryagainst a known-good tenant ({tenant_id, predicate_filter}) — compare semantic top-10 before/after. Delta should be > 0 facts overlapping (sanity check), and reranker-sorted order should look reasonable. - Keep both indexes for 30 days. After that window, operator runs
wrangler vectorize delete ctx_v5_factsand removes theCONTEXT_VECTORIZE_LEGACY_384binding fromwrangler.toml.
Consequences
Positive
- Retrieval improves on English via reranker, improves substantially on non-English via m3.
- 60k-token context lets future fact-composition strategies embed richer text without truncation artifacts.
- 16× cheaper per embed token — relevant at the rate of Gong ingestion growth.
- Sets up ADR-029 (LoRA adapters) cleanly: per-tenant adapters work against the shared bge-m3 base without re-tuning embedding scale.
Negative
- 2.6× storage per vector (384 → 1024 floats). At 2026-04 fact counts this is << $1/month on Vectorize; not material.
- Backfill runs through every fact once (~N Workers-AI embed calls). At $0.012 / M tokens and ~20 avg tokens per fact, a backfill of 1M facts costs ~$0.24. Not material.
- 30-day parallel-index window adds config surface. Acceptable for rollback safety.
Invariants preserved
- ADR-016 invariant #1 (D1 is source of truth): unchanged. D1 is the backfill source.
- ADR-016 invariant #3 (source authority hierarchy): unchanged. Rerank score is an additional sort key, NOT a replacement for authority ordering.
- V5 invariant #3 (no retries in hot path): unchanged. Embed failures fall through EmbeddingError as before.
- V5 invariant #11 (30s AbortController): inherited by the migration script.
Verification strategy
- Typecheck:
cd context-worker && npx tsc --noEmit— clean. - Unit tests:
cd context-worker && npx vitest run— 96/96 passing (includes new 1024-dim assertions, multilingual input test, AI Gateway routing tests). - Dry-run migration: operator runs
--dry-run --tenant=kahunaonascend-context-dband reviews the planned count before live run. - Live cutover verification: operator hits
context_queryon 3 known-good tenant+predicate combos and compares top-5 result overlap between old and new indexes. Expected: ≥3 of top-5 overlap (different rerank scores OK). - Rollback plan: if regression observed in production, revert
CONTEXT_VECTORIZEbinding inwrangler.tomltoctx_v5_factsand redeploy. No data loss — both indexes are still populated during the 30-day window.
Future-reversal trigger
If bge-m3 English-only retrieval quality proves materially worse than bge-large on Ascend-specific workloads (measured via an A/B test running both indexes), revisit. Re-evaluate in 90 days post-cutover when reranker telemetry is available.