Upgrade context-worker embeddings from bge-small-en-v1.5 (384) to bge-m3 (1024)

ADR-028: Upgrade context-worker embeddings from bge-small-en-v1.5 (384) to bge-m3 (1024)

Status: Accepted Date: 2026-04-24 Deciders: Mishaal Murawala Related: ADR-016 — Context Plane · ADR-029 — LoRA adapters · Wave 4 Phase B of Cloud-Native v2 Engineering Plan

Context

ADR-016 Phase 2 shipped the context worker with @cf/baai/bge-small-en-v1.5 — 384-dim, English-only, 512 input-token cap. At the time it was the right call: fast, cheap, and we had no non-English tenants.

Three pressures have since shifted the calculus:

Tenant pipeline. Kahuna’s EU + LATAM expansion is on the 2026-Q3 roadmap. Point Field Partners has an EU pilot slated 2026-Q4. Both will ingest non-English Gong transcripts and SFDC records. bge-small-en-v1.5 is monolingual — running it on Spanish or German call text produces degenerate vectors.
Retrieval quality. Wave 4 Phase B adds a bge-reranker-base cross-encoder (this ADR’s sibling change). The reranker’s ceiling is bounded by the candidate quality; a 384-dim embedding model is the weakest link in the pipeline. Moving to 1024-dim captures finer-grained predicate / subject distinctions (the canonical bge benchmarks show ~4–6 pt MTEB uplift from small → m3 on retrieval tasks).
Token budget. bge-small caps inputs at 512 tokens. Our composeFactText() budgets 1800 chars (~450 tokens) to leave headroom. Under bge-m3’s 60k-token context window we can drop the artificial ceiling and let long Gong verbatim quotes flow through in a single embedding (now capped at 8k chars = ~2k tokens, which is a quality budget not a model limit).

Decision

Move the context worker to @cf/baai/bge-m3 (1024-dim) cosine. Ship in a single PR (Wave 4 Phase B) with:

New Vectorize index ctx_v5_facts_1024 @ 1024-dim cosine.
embedFact() / embedQuery() switched to @cf/baai/bge-m3 — see context-worker/src/lib/embeddings.ts.
Old index ctx_v5_facts retained read-only for 30 days as the rollback path + as the source for the backfill script.
Re-embed migration script: scripts/migrate-vectorize-384-to-1024.ts. Reads every non-superseded row from D1, re-embeds with bge-m3, upserts to new index with same fact_id + metadata.
No dual-write during the transition — context-worker writes only to the new index from deploy-forward. Old index is frozen; any gaps filled by the backfill script, which reads the D1 source-of-truth. D1 is the durable anchor per ADR-016 invariant #1.

Why bge-m3, not bge-large-en-v1.5

bge-large-en-v1.5 is also 1024-dim, slightly higher English-only retrieval scores on MTEB, and more widely tested.

We picked bge-m3 anyway because:

Dimension	bge-large-en-v1.5	bge-m3	Winner
Output dim	1024	1024	tie
Context window	512 tokens	60,000 tokens	m3
Languages	English only	100+	m3
Pricing (CF Workers AI, 2026-04)	$0.20 / M input tokens	$0.012 / M input tokens	m3 (~16× cheaper)
Multi-functionality (dense + sparse + ColBERT)	dense only	dense + sparse + multi-vector	m3 (future-proof)
Retrieval quality (MTEB-EN Retrieval)	~0.54	~0.50	bge-large (~4pt edge on English-only)

The English-only MTEB delta is bounded by Wave 4 Phase B’s reranker. bge-reranker-base on top of a slightly-lower-ceiling embedder still beats bge-large without a reranker on end-to-end benchmarks. We’re not leaving retrieval quality on the table — we’re buying multilingual + 16× pricing + 120× context for a ~4pt MTEB gap that the reranker covers.

Receipts:

bge-m3: https://developers.cloudflare.com/workers-ai/models/bge-m3/
bge-large-en-v1.5: https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5/

Why a new index instead of re-embedding in place

Vectorize does not support changing an index’s dimension once created. The only choices are:

New index + re-embed (chosen). Durable, reversible, observable.
Recreate the old index at 1024-dim. Irreversible mid-migration; any read traffic during reindex would return empty or 384-dim stale results.

Option 1 also gives us a clean 30-day rollback by keeping both indexes live.

Migration sequence

Create ctx_v5_facts_1024 (1024-dim cosine) via wrangler vectorize create. Keep ctx_v5_facts live.
Deploy context-worker with both bindings — CONTEXT_VECTORIZE → new index, CONTEXT_VECTORIZE_LEGACY_384 → old index (read-only reference).
Run tsx scripts/migrate-vectorize-384-to-1024.ts --dry-run to confirm fact count.
Run without --dry-run. Monitors the console; failures logged per fact, final summary reports succeeded / failed counts.
Verify via context_query against a known-good tenant ({tenant_id, predicate_filter}) — compare semantic top-10 before/after. Delta should be > 0 facts overlapping (sanity check), and reranker-sorted order should look reasonable.
Keep both indexes for 30 days. After that window, operator runs wrangler vectorize delete ctx_v5_facts and removes the CONTEXT_VECTORIZE_LEGACY_384 binding from wrangler.toml.

Consequences

Positive

Retrieval improves on English via reranker, improves substantially on non-English via m3.
60k-token context lets future fact-composition strategies embed richer text without truncation artifacts.
16× cheaper per embed token — relevant at the rate of Gong ingestion growth.
Sets up ADR-029 (LoRA adapters) cleanly: per-tenant adapters work against the shared bge-m3 base without re-tuning embedding scale.

Negative

2.6× storage per vector (384 → 1024 floats). At 2026-04 fact counts this is << $1/month on Vectorize; not material.
Backfill runs through every fact once (~N Workers-AI embed calls). At $0.012 / M tokens and ~20 avg tokens per fact, a backfill of 1M facts costs ~$0.24. Not material.
30-day parallel-index window adds config surface. Acceptable for rollback safety.

Invariants preserved

ADR-016 invariant #1 (D1 is source of truth): unchanged. D1 is the backfill source.
ADR-016 invariant #3 (source authority hierarchy): unchanged. Rerank score is an additional sort key, NOT a replacement for authority ordering.
V5 invariant #3 (no retries in hot path): unchanged. Embed failures fall through EmbeddingError as before.
V5 invariant #11 (30s AbortController): inherited by the migration script.

Verification strategy

Typecheck: cd context-worker && npx tsc --noEmit — clean.
Unit tests: cd context-worker && npx vitest run — 96/96 passing (includes new 1024-dim assertions, multilingual input test, AI Gateway routing tests).
Dry-run migration: operator runs --dry-run --tenant=kahuna on ascend-context-db and reviews the planned count before live run.
Live cutover verification: operator hits context_query on 3 known-good tenant+predicate combos and compares top-5 result overlap between old and new indexes. Expected: ≥3 of top-5 overlap (different rerank scores OK).
Rollback plan: if regression observed in production, revert CONTEXT_VECTORIZE binding in wrangler.toml to ctx_v5_facts and redeploy. No data loss — both indexes are still populated during the 30-day window.

Future-reversal trigger

If bge-m3 English-only retrieval quality proves materially worse than bge-large on Ascend-specific workloads (measured via an A/B test running both indexes), revisit. Re-evaluate in 90 days post-cutover when reranker telemetry is available.