Skip to content

Vectorize Namespace Registry

Vectorize Namespace Registry

ENGINEERING_STANDARD §OO-EngStd-002 — Every Vectorize index must appear here with its binding name, dimensions, metric, isolation model, and owner.

Generated from codebase audit 2026-05-09. Canonical source: this file.

Index summary

Index nameBindingDimensionsMetricIsolationCreated byOwner
capability_indexCAPABILITY_INDEX1024cosineglobalwrangler vectorize create capability_index --dimensions=1024 --metric=cosinesrc/lib/capability-retrieval.ts
memory-indexMEMORY_INDEX1024cosineper-tenant (metadata filter)wrangler vectorize create memory-index --dimensions=1024 --metric=cosinesrc/lib/memory-patterns.ts
pattern-bankPATTERN_INDEX1024cosineglobalwrangler vectorize create pattern-bank --dimensions=1024 --metric=cosinesrc/cron/seed-pattern-bank.ts
client-knowledgeVECTORIZE_INDEX1024cosineper-tenant (metadata filter)wrangler vectorize create client-knowledge --dimensions=1024 --metric=cosinesrc/tools/search-knowledge.ts

Index details

capability_index — Tool capability embeddings

  • Binding: CAPABILITY_INDEX
  • Dimensions: 1024 — model @cf/baai/bge-m3 (parity with memory-index)
  • Metric: cosine
  • Isolation: Global — single namespace, no per-tenant partitioning. All tenants query the same index.
  • Purpose: ADR-042 capability catalog. Stores semantic embeddings of every tool capability entry so the gateway can retrieve relevant tools at runtime via retrieveCapabilities(). Powers the unbounded tool catalog (no hard ceiling on registered capabilities).
  • Vector ID schema: {tool_slug}:{capability_slug} (e.g. hubspot_crm:search_contacts)
  • Metadata schema:
    {
    "toolName": "string",
    "action": "string",
    "description": "string",
    "category": "string",
    "phase": "number"
    }
  • Write path: scripts/embed-capabilities.ts (manual re-run after config/capabilities/registry.yaml changes). CI runs verify-capability-registry.mjs to detect drift.
  • Read path: src/lib/capability-retrieval.tsretrieveCapabilities(query, topK). Called from discover_apis and batch_execute tools.
  • Staging: Same global index reused for staging (wrangler.toml [[env.staging.vectorize]] points to capability_index).

memory-index — Per-tenant semantic memory

  • Binding: MEMORY_INDEX
  • Dimensions: 1024 — model @cf/baai/bge-m3
  • Metric: cosine
  • Isolation: Per-tenant via Vectorize metadata filter { tenant_id: "{tenantId}" } at query time. Shared physical namespace; logical isolation enforced in application layer.
  • Purpose: Long-term semantic memory for tenant conversations and learned patterns. Written by learnSemanticMemory() via ctx.waitUntil (non-blocking, cold path). Read by memory retrieval operations.
  • Vector ID schema: {tenant_id}:{timestamp_ms}:{hash8} — ensures no collisions across tenants or time.
  • Metadata schema:
    {
    "tenant_id": "string",
    "content_type": "fact | preference | pattern | entity",
    "source": "string",
    "created_at": "ISO8601"
    }
  • Write path: src/lib/memory-patterns.tslearnSemanticMemory().
  • Read path: src/lib/memory-patterns.tsrecallSemanticMemory(tenantId, query, topK).
  • Tenant isolation invariant: Every query() call MUST include filter: { tenant_id: tenantId }. A missing filter exposes cross-tenant data. Enforced by recallSemanticMemory() wrapper — never call MEMORY_INDEX.query() directly.

pattern-bank — Harness pattern embeddings

  • Binding: PATTERN_INDEX
  • Dimensions: 1024
  • Metric: cosine
  • Isolation: Global — single namespace. Patterns are tool-agnostic quality exemplars, not tenant data.
  • Purpose: Quality evaluation patterns for the harness. Stores embedded examples of good/bad tool responses that the harness uses for few-shot comparison during evals.
  • Vector ID schema: pattern:{run_id}:{seq} — run_id from seed job, seq for ordering within a run.
  • Metadata schema:
    {
    "tool_name": "string",
    "quality_label": "good | bad",
    "category": "string",
    "seeded_at": "ISO8601"
    }
  • Write path: src/cron/seed-pattern-bank.ts — runs daily at 0 4 * * * (multiplexed). Idempotent via pattern_bank:seeded:{run_id} KV guard.
  • Read path: src/workflows/harness-investigate.ts, src/workflows/harness-autofix.ts.

client-knowledge — Client-specific knowledge base

  • Binding: VECTORIZE_INDEX
  • Dimensions: 1024
  • Metric: cosine
  • Isolation: Per-tenant via metadata filter { tenant_id: "{tenantId}" }. Same pattern as memory-index.
  • Purpose: Client-uploaded / ingested knowledge documents (product specs, playbooks, ICP profiles, competitive intel). Used by the search_knowledge MCP tool.
  • Vector ID schema: {tenant_id}:{doc_id}:{chunk_seq} — doc_id from ingestion pipeline, chunk_seq for multi-chunk documents.
  • Metadata schema:
    {
    "tenant_id": "string",
    "doc_id": "string",
    "doc_title": "string",
    "chunk_seq": "number",
    "source": "upload | api | webhook",
    "created_at": "ISO8601"
    }
  • Write path: Ingestion pipeline (future) + POST /admin/knowledge endpoint.
  • Read path: src/tools/search-knowledge.tssearch_knowledge MCP tool. Requires VECTORIZE_INDEX binding to be set.
  • Tenant isolation invariant: Same as memory-index — every query() call MUST include filter: { tenant_id: tenantId }. The search_knowledge tool enforces this via ctx.tenantId (Invariant 3 — tenant from context, never from args).

Adding a new index

  1. Create the index: wrangler vectorize create {name} --dimensions=1024 --metric=cosine
  2. Add [[vectorize]] block to wrangler.toml with binding name.
  3. Add binding to Env interface in src/lib/types.ts.
  4. Add a row + detail section to this file.
  5. Decide isolation model:
    • Global: No filter at query time. Use for tool catalogs, config embeddings.
    • Per-tenant: Always filter { tenant_id: tenantId }. Use for any tenant data. Document the isolation invariant clearly and enforce via a wrapper function — never expose raw .query() to callers.
  6. If per-tenant, add an isolation invariant to .claude/rules/v5-invariants.md if the pattern is new.