Skip to content

Ascend Operator OS — Engineering Standard

Ascend Operator OS — Engineering Standard

  • Status: Active (ADOPTED 2026-05-07)
  • Owner: Claude Code (engineering lead) + Mishaal Murawala
  • Companion to: ASCEND_OPERATOR_OS_VISION.md (strategic master plan)
  • Supersedes governance sections of: docs/platform-product-spec-v0.1.md, docs/platform-product-spec-v5-mapping.md (both archived 2026-05-07)
  • Authority: This document codifies the non-negotiable engineering rules for the Operator OS stack. The Vision says what we are building and why. This document says how every line of code, every Worker, every DO, every D1 row, every KV key, every Vectorize namespace, every R2 object MUST behave. If anything in this document conflicts with code, the code is wrong. If anything conflicts with the Vision, the Vision wins.

Why this document exists

Three convergent reviews (GPT-5.5 / Gemini 3.1 Pro / Kimi K2.6) of the Operator OS Vision flagged the same gap: the Vision is a strong strategic document, but several critical operating contracts are implied rather than enforced. Specifically: tenant isolation, data classification, backup/restore, idempotency, observability, and eval gates. Without enforceable contracts on these dimensions, the system will look correct in development and fail under multi-tenant production load.

This standard re-expresses Mishaal’s original platform-instruction list (scalable · flexible · multiclient · simple · fast · secure · config-driven · best-practices · up-to-date · backups · versioned · alerting · auto-fixing · documented · idempotent · observable · evaluated) as concrete, enforceable rules expressed in Operator OS primitives (Workers, Durable Objects, Workflows, Queues, D1, KV, Vectorize, R2, AI Gateway).

Every rule below has a clear enforcement point. If a rule cannot be enforced (linter, test, CI gate, runtime check, or documented review checklist), it is removed from this document until it can be.


Standing principles (read first)

  1. Tenant isolation is load-bearing. Every read, every write, every queue message, every Workflow step is scoped to a tenant_id derived from the auth context, never from a tool argument. A bug here is a P0 incident.
  2. Config-driven over hardcoded. Endpoints, models, scopes, ICPs, tool catalogs, eval thresholds, budget caps live in KV / D1 / Vectorize. Code reads them. We never ship a one-line PR to change a model name.
  3. Fail-fast in the hot path; durable retry off the hot path. Gateway returns the upstream error. Workflows, Queues, and DO alarms own retry semantics, with idempotency guarantees.
  4. Idempotency is a contract, not a hope. Every step that can be replayed (Workflow step.do, Queue handler, DO alarm fire) is keyed and deterministic so replay is safe.
  5. Observability is non-optional. Every code path emits a structured event (Langfuse trace, AI Gateway log, D1 audit row, or Slack alert) such that 7 days from now we can answer what happened without reading source.
  6. Evals gate behavioral changes. No new agent behavior, prompt change, model swap, or memory tier addition lands without a per-tenant eval pass and an A/B baseline.
  7. Backups exist before they are needed. Every D1 table, KV namespace, and Vectorize namespace has a documented restore procedure tested at least once before going to production.
  8. One canonical source of truth per data shape. KV (config), D1 (audit + cold facts), Vectorize (embeddings), R2 (artifacts), GitHub (code). Anything else needs an ADR.

1. Multi-tenant isolation

Tenant isolation is the most important property of the system. Violations are silent and devastating.

1.1 Every request carries a tenant_id from auth, never from input

  • Auth middleware on every Worker resolves tenant_id from the bearer token via KV tenant_auth:{sha256_hash}.
  • tenant_id is attached to the request context (Hono c.set('tenantId', ...)).
  • Every downstream function reads ctx.tenantId. Tools that accept tenant_id as an argument are a CI failure.

1.2 Every D1 query is parameterized AND tenant-scoped

  • All db.prepare(sql).bind(...) calls. No string interpolation into SQL — ever.
  • Every table containing tenant-scoped data has a tenant_id column with an index. Every query filters on it.
  • D1 views that aggregate across tenants (e.g., fund-level cross-portco views) are explicitly labeled and gated by a fund-tenant role. The default is per-tenant.

1.3 Every KV key is tenant-prefixed for tenant-scoped data

  • Tenant-scoped KV keys begin with tokens:{tenant}:, tenant_config:{tenant}:, agent:{tenant}:, oauth_token:{tenant}:, etc.
  • Cross-tenant config (e.g., api_config:{provider}, capability_index:{tool_name} priors) is explicitly cross-tenant and named accordingly.
  • Lint rule (CI): any new KV key prefix introduced in code must appear in docs/architecture/KV_KEY_REGISTRY.md with classification tenant-scoped or cross-tenant.

1.4 Every Vectorize namespace is per-tenant or explicitly aggregated

  • Per-tenant namespaces: memory_semantic_{tenant}, agent_episodes_{tenant}.
  • Cross-tenant namespaces (e.g., capability_index, anonymized pattern banks) are named without a tenant prefix and explicitly classified cross-tenant in the registry.

1.5 Every R2 prefix is tenant-prefixed for tenant artifacts

  • r2://backups/{tenant}/..., r2://artifacts/{tenant}/..., r2://workflow-payloads/{tenant}/....
  • Cross-tenant artifacts (platform metrics dashboards, anonymized benchmarks) live under platform/.

1.6 Cross-tenant tests are mandatory

  • Every new agent surface ships with at least one test that proves tenant A cannot read tenant B’s data.
  • The test exercises the actual auth + storage path, not a mocked tenant resolver.

2. Data classification

Every datum the platform stores or moves is in one of four classes. The class governs where it can be stored, who can read it, and whether it can be reused for analytics or model training. Codified in ADR-045.

ClassDescriptionStorageReuse
tenant_privateA single portco’s CRM, calls, emails, prospects, drafts, conversationsPer-tenant D1 + KV + Vectorize. Never copied across tenants.Default. Used to serve that tenant only.
fund_privateA PE fund’s portfolio-level metrics + cross-portco aggregates for that fundFund-tenant D1 view + Vectorize namespace. Per-portco data behind it must be aggregated/anonymized before fund tenant sees it.Fund-internal only. Never shared with other funds.
anonymized_benchmarkAggregated patterns derived from many tenants, with no PII or identifying account namesCross-tenant Vectorize namespace (pattern_bank) + D1 cold-path tables labeled anonymized_*Reusable across tenants for retrieval. Cannot be reverse-mapped to source tenant.
ascend_platform_metricPlatform-level operating data (latency, error rates, agent run counts, capability priors)Cross-tenant D1 + KV (capability_index:*, error_ledger, decision_log)Internal Ascend operating only. Never surfaced to a tenant verbatim.

Enforcement

  • Every D1 column carrying tenant data has its class documented in the migration file’s header comment.
  • Every Vectorize namespace declared in wrangler.toml has its class in the comment beside the binding.
  • Anonymization to anonymized_benchmark is a Workflow step, not an ad-hoc query. The Workflow has a named idempotency key and is auditable.

3. Configuration vs. code

We never ship a one-line PR to change a value that should have lived in config.

3.1 What MUST be config (KV / D1)

  • API endpoints, base URLs, version strings, scopes (per provider) → api_config:{provider} in KV.
  • Tenant-specific values: ICP rules, target accounts, thresholds, model preferences → tenant_config:{tenant} in KV.
  • Tool catalog (provider, tool name, scope, risk rating, owner) → src/config/providers.ts (compile-time) AND capability_index Vectorize (runtime, enriched with priors).
  • Eval thresholds, budget caps, A/B traffic splits → KV per tenant or per agent.

3.2 What MAY be code

  • Pure structural logic (request routing, auth verification flow, error response shape).
  • Pure type contracts (Zod schemas, TypeScript types).
  • The set of LLM providers we know how to call (the adapter code; the model name choice is config).

3.3 Lint rule

  • Any literal URL, model name, scope string, or threshold appearing twice in the codebase outside of src/config/* requires a config-extraction PR or a comment justifying the literal.

4. Idempotency and replay safety

Every operation that can be replayed must be safe to replay. This is a contract, not a hope.

4.1 Workflow steps

  • Every step.do(name, fn) uses a stable, deterministic name. Two runs with the same input produce the same step name sequence.
  • Side-effect operations (D1 inserts, R2 writes, KV puts, outbound API calls) are keyed by an idempotency key derived from {run_id, step_name, input_hash}.
  • D1 inserts use INSERT OR IGNORE on the idempotency key, or INSERT ... ON CONFLICT DO NOTHING on a unique index.
  • Outbound provider calls that don’t accept idempotency keys are wrapped: we record attempted_at in D1 BEFORE the call and check it on replay.

4.2 Queue consumers

  • Queue consumers are PRODUCER-ONLY for downstream work: they read the message, kick a Workflow with a deterministic instance ID, and ack. On duplicate-instance error, ack (idempotent redelivery). See ADR-026.
  • Business logic never lives in a Queue consumer. It lives in the Workflow.

4.3 DO alarm fires

  • Every alarm-driven action (token refresh, cron-style cleanup) checks state before acting and is safe to fire twice in a row.

4.4 Cron-driven jobs

  • Cron handlers compute “what has not yet been processed for window W” rather than “process the last N hours.” A cron that fires twice in a window does not double-process.

5. Hot path vs. cold path

Defined by latency budget, not by which Worker the code lives in.

5.1 Hot path = synchronous request the human or AI agent is waiting on

  • Gateway proxy requests (≤10 ms overhead per invariant 10).
  • Direct agent run synchronous responses (target ≤30 ms overhead, governed by pre-listed ADR-041 once that ADR is filed).
  • KV reads only. No D1 reads. No DO calls.
  • 30-second AbortController on every outbound fetch (invariant 11).
  • Fail-fast on upstream error: return the error, do not retry.

5.2 Cold path = anything off the request critical path

  • All D1 reads/writes (audit, episodic memory, eval datasets, cross-portco views).
  • All Workflow execution.
  • All Vectorize embedding writes.
  • All R2 writes (backups, artifacts, payload pointers).
  • All AI Gateway analytics post-processing.
  • All Slack alerts, all error_ledger writes.

5.3 ctx.executionCtx.waitUntil() is the bridge

  • Non-critical writes (kv_audit, episodic memory append, Slack notification, cost recording) happen in waitUntil() after the response is sent. They MUST be idempotent (rule 4).

6. Workflows: payload-pointer pattern

Workers Workflows have payload size limits. We never pass large payloads through Workflow steps. Codified in ADR-046.

6.1 Rule

  • Any artifact larger than 64 KB or any binary blob is written to R2 (r2://workflow-payloads/{tenant}/{run_id}/{step_name}.json).
  • Workflow steps pass only the R2 key + content hash + size, never the bytes.
  • Downstream steps fetch from R2 by key.

6.2 Why

  • Workflow state is durable and replayed on restart. Replaying a step that carries 5 MB of payload through state is wasteful and brittle.
  • R2 GET is cheap and lets us inspect the artifact 7 days later for debugging.

6.3 Naming

  • R2 keys follow workflow-payloads/{tenant}/{workflow_name}/{run_id}/{step_name}.{ext}.
  • The Workflow’s metadata (D1 workflow_runs table) records the R2 key for every step that produces an artifact.

7. OAuth and re-auth lifecycle

Every external SaaS connection is an OAuth-or-equivalent token that will expire, get revoked, or get its scopes changed. The system is designed for re-auth as a normal event, not an exception. Codified in ADR-047.

7.1 Token storage

  • tokens:{tenant}:{provider}:{account_id} in KV per invariant 8.
  • DO TokenManager writes 10 minutes before expiry per invariant 6. Request path reads KV only.

7.2 Failure modes

  1. Token refresh succeeds. Normal path. KV updated.
  2. Refresh fails (transient). DO retries with exponential backoff up to 3 attempts inside its alarm.
  3. Refresh fails (terminal — refresh token revoked / scopes changed / connection deleted). DO marks the token row status: needs_reauth, writes an entry to oauth_reauth_queue D1 table, and fires a Slack alert tagged with tenant + provider + account.
  4. Hot path encounters needs_reauth. Gateway returns TOKEN_EXPIRED immediately. Fail-fast, no retry. The error response includes a reauth_url field pointing at the V5 OAuth start endpoint for that provider.

7.3 Re-auth UX

  • Tenant admin (Mishaal during alpha; portco operators in production) sees the queue in the admin dashboard.
  • Re-auth is one click → OAuth flow → KV updated → D1 row marked status: active.
  • The admin dashboard surfaces the failed-since-timestamp so we can prove SLA.

7.4 What we do NOT do

  • Never silently retry forever in a DO alarm. After terminal failure we escalate to a human.
  • Never store refresh tokens outside KV / Wrangler-secrets-managed storage.
  • Never hand a tenant another tenant’s re-auth URL.

8. Backups and restore

Every durable storage layer has a documented backup and a tested restore.

8.1 D1

  • Daily wrangler d1 export per database to R2 backups/d1/{db_name}/{YYYY-MM-DD}.sql.gz. Cron: 0 2 * * *.
  • 30-day rolling retention.
  • Quarterly: pick a non-production database, restore to a new D1, run a smoke test.

8.2 KV

  • Daily snapshot of every namespace’s keys to R2 backups/kv/{namespace}/{YYYY-MM-DD}.jsonl.gz. Cron: 0 3 * * *.
  • Snapshot includes value bytes + metadata + TTL hint. Recovery script scripts/restore-kv.ts reads the JSONL and replays via KV.put.
  • Quarterly: restore one namespace to a staging KV namespace from yesterday’s snapshot, sample-verify 100 keys.

8.3 Vectorize

  • Embedding sources are themselves stored in D1 + R2, so Vectorize is reproducible by re-running the embed script.
  • The reproduction script for each namespace is checked into scripts/embed-*.ts and runnable from CI.

8.4 R2

  • R2 has versioning enabled on every bucket containing critical data.
  • Cross-region replication is not enabled in Q1; reversal trigger is the first paying-client SLA that requires it.

8.5 The restore drill

  • Once per quarter, a restore-drill.md runbook is executed. The drill produces a written PR-style report: date, what was restored, time-to-restore, what was missing, what to fix. Drills are mandatory; missing a drill is a tech-debt row.

9. Versioning

Every shape that crosses a process boundary has a version.

9.1 What gets versioned

  • KV value shapes (tenant_auth, tokens, api_config, tenant_config, capability_index, bandit:*) — version field at the top of each value.
  • D1 table schemas — every migration file numbered sequentially under migrations/. Migrations are append-only; we never edit a landed migration.
  • Workflow definitions — version embedded in the Workflow class name (SdrAgentRunV2 etc.) when behavior changes incompatibly.
  • Public API endpoints — /v1/* is the current major. Breaking changes go to /v2/*.
  • Agent prompts and system messages — every prompt edit ships through a PR with an eval comparison; the prompt’s KV key includes a version field.

9.2 Migration discipline

  • New D1 migration → next sequential number.
  • Migrations are forward-only by default. Down migrations are documented but not run automatically.

10. Observability

Three layers, every code path covered.

10.1 Trace layer (per-request)

  • Every Worker request opens a Langfuse trace via AI Gateway for any LLM call (invariant 12).
  • Every agent run gets a run_id (UUID) attached to: D1 agent_runs row, KV working memory key suffix, every D1 memory_episodes row, every Slack alert, every error_ledger row.

10.2 Metric layer (aggregates)

  • Latency p50/p95/p99 per surface (gateway proxy, agent run, Workflow run).
  • Error rate per provider, per tool, per tenant.
  • Cost per agent run (Anthropic + Cerebras + AWS Bedrock + AI Gateway).
  • Backlog depth per Queue, per Workflow.
  • Token-refresh failure rate per provider.

10.3 Audit layer (durable)

  • kv_audit D1 table — every admin KV write (key, action, source, timestamp).
  • error_ledger D1 table — every failed request.
  • decision_log D1 table — every ADR-driven runtime decision (e.g., bandit weight selection, model fallback chain hop).
  • agent_runs and memory_episodes — every agent run and every turn.

10.4 Alerting

  • Slack alerts on: token-refresh terminal failure, Workflow step failure after 3 retries, gateway error rate > 1% over 5 min, agent run cost > $1, Anthropic daily budget cap hit.
  • Alerts include tenant_id, run_id (when relevant), and a link to the matching D1 row.

11. Eval gates

No agent behavior change ships without per-tenant evals + an A/B baseline. This is the gate the proving wedge agent must pass before we expand the 19-agent map.

11.1 What requires an eval

  • New agent type (e.g., SDR Agent → Pipeline Reviewer Agent).
  • New prompt or system message version.
  • New model swap (e.g., Anthropic Haiku → Sonnet for a given task).
  • New memory tier integration (e.g., wiring semantic recall into an existing agent).

11.2 The eval flow

  1. Author writes ≥20 graded examples for the change in eval_datasets D1 table per ADR-040 Track E.
  2. Tri-judge auto-grader (per ADR-034) scores baseline + candidate.
  3. Candidate must beat baseline on the relevant metric by a documented margin (default: ≥5% relative on success rate, no regression > 1% on cost or latency).
  4. PR description includes the eval table.
  5. Production rollout is a bandit weight shift in KV, not a code deploy. We can revert by adjusting weights.

11.3 Continuous evals

  • Every live agent has a per-tenant eval set that grows over time.
  • Weekly cron re-runs the full eval set against current prod prompts to detect regressions in upstream model behavior.

12. CI gates (the pre-commit and PR contract)

npm run check:pre-commit is the local gate. PR CI is the remote gate. They enforce:

  1. npm run typecheck — TypeScript strict, no any, no @ts-ignore without a comment.
  2. npm test — Vitest, all green. New surfaces require a test.
  3. Cross-tool readiness check (scripts/check_cross_tool_readiness.sh) — 13 V5 invariants enforced.
  4. Drift check against deployed Worker (provider config, KV schema, exposed endpoints).
  5. Secret scan (gitleaks + trufflehog).
  6. Lint: KV-key prefix registry, Vectorize namespace registry, ADR-numbering.
  7. Migration sequencing: new migration must be next sequential number with no gaps.
  8. ADR linting: status, date, decider, supersedes, related fields populated.

A red CI is fix-and-continue, never .skip, never deploy:unsafe.


13. Documentation

Every primitive in the system is documented in one place that is easy to find from the top of the repo.

13.1 The required docs

13.2 Per-agent docs

  • Every agent class lives at src/agents/{name}/ and ships with README.md (purpose, inputs, outputs, eval baseline) + prompts/ + runner.ts + tests.

13.3 Per-Workflow docs

  • Every Workflow class ships with a header comment listing: trigger, idempotency key shape, expected duration, payload-pointer R2 prefix, failure-mode escalation.

14. Auto-fix: scope and limits

The platform self-heals where it is safe to do so, and escalates where it is not. The principle is “auto-fix the bug class, don’t auto-fix the symptom.”

14.1 Auto-fix is in scope

  • Token refresh on impending expiry (DO alarm).
  • Workflow step retry on transient error.
  • Bandit weight rebalancing on outcome feedback.
  • Cron re-runs of cap-index priors.
  • LLM fallback chain (AI Gateway: Anthropic → Bedrock → Cerebras) on primary failure.
  • Self-healing agent pipeline (per ADR-034) for grading regressions.

14.2 Auto-fix is OUT of scope (always escalate)

  • Token refresh terminal failure → human re-auth.
  • Cross-tenant data leak detected → freeze + page.
  • D1 migration failure mid-run → freeze deploys + page.
  • Eval regression > 5% on a live agent → freeze rollout + page.
  • Cost overrun on Anthropic daily budget → freeze inference + Slack.

15. The proving-wedge discipline

Vision ambition is high (19 agents, 7 motions). Engineering reality is that we ship one end-to-end agent first — fully evaluated, fully observed, fully backed up, fully isolated — before we replicate the pattern. Per multi-model feedback synthesized 2026-05-07.

15.1 The Q1 proving wedge

  • SDR Agent for tenant ascend (Track A in ADR-040).
  • Acceptance gate before declaring the wedge proven: every numbered rule in this document has at least one enforced manifestation in the SDR Agent’s surface.

15.2 Replication after wedge proves out

  • Pipeline Reviewer Agent and Op Partner Brief Agent are next, in that order, both replicating the SDR Agent’s compliance with this standard.
  • The 19-agent topology in the Vision document is a map, not a roadmap. We expand only when each prior wedge passes its eval gate.

Amendment process

Changes to this document require:

  1. An ADR proposing the change with rationale and rollback plan.
  2. A PR that updates this file AND any related code/CI gate in the same commit.
  3. Mishaal’s approval (engineering-lead-decides applies to how, not to what the rules are).

Drift between this document and ASCEND_OPERATOR_OS_VISION.md is a CI failure once the cross-doc lint lands (tracked as tech debt).


Open items (tracked, not blocking adoption)

  • docs/architecture/KV_KEY_REGISTRY.md — file does not yet exist. Created when the first Operator OS Q1 KV prefix lands. (Tech debt row: OO-EngStd-001.)
  • docs/architecture/VECTORIZE_NAMESPACE_REGISTRY.md — same. (Tech debt row: OO-EngStd-002.)
  • Cross-doc lint between Vision and this Standard. (Tech debt row: OO-EngStd-003.)
  • Workflow header-comment lint. (Tech debt row: OO-EngStd-004.)
  • KV-key-prefix lint. (Tech debt row: OO-EngStd-005.)
  • Quarterly restore-drill cron + report template. (Tech debt row: OO-EngStd-006.)