Hot-Path Latency Budget Expansion: ≤10 ms → ≤30 ms

ADR-041 — Hot-Path Latency Budget Expansion: ≤10 ms → ≤30 ms

Status: Accepted
Date: 2026-05-07
Decider: Mishaal Murawala (engineering sequencing delegated to Claude Code per ADR-040)
Supersedes: Invariant #10 in .claude/rules/v5-invariants.md (≤10 ms gateway overhead)
Related: ADR-040, ADR-042, ASCEND_OPERATOR_OS_VISION.md §5 (AgentRuntime), §3.2 (Capability Index)
Invariant changed: #10

Context

Invariant #10 set a ≤10 ms gateway overhead budget covering: auth lookup (KV) + token read (KV) + config lookup (KV) + outbound dispatch + AI Gateway callback. This was appropriate for the V5 pure-proxy phase where the gateway was a stateless KV-backed request forwarder.

The Operator OS architecture (adopted 2026-05-07, ADR-040) adds three new components to the agent execution path that did not exist when the 10 ms budget was set:

1. AgentRuntime Service Binding call (Track A) POST /v1/agents/:tenant/:agent_type/run routes through a Service Binding to AgentRuntime DO. The Service Binding call itself is ~2–5 ms (in-datacenter, same CF PoP). The DO is not touched from the hot path of existing gateway routes — this overhead only applies to the new /v1/agents/ surface.

2. AI Gateway callback The AI Gateway observability callback was assumed to be <1 ms when invariant #10 was set. In practice, under load with the Analytics Engine pipeline active, it adds ~3–7 ms. This was not measured during the original 10 ms budget design.

3. Capability-index retrieval (Vectorize, Track B) After Track B ships, the agent assembly path calls retrieveCapabilities(intent) which queries the Vectorize capability_index namespace. Vectorize p95 query latency is ~8–12 ms per Cloudflare documentation (2025-Q4 benchmarks). This call happens in the agent context assembly phase — before the agent turn begins — not in the existing gateway hot path for tool proxying.

Combined measured / projected overhead:

Existing KV auth + token + config: ~3–4 ms (unchanged)
AI Gateway callback (actual): ~3–7 ms (was assumed <1 ms)
AgentRuntime Service Binding (new): ~2–5 ms (agents path only)
Vectorize capability retrieval (new, post-Track B): ~8–12 ms (agents path only)
Total agent path p95: ~16–28 ms

The original 10 ms budget is breached by the AI Gateway callback alone in observed production measurements. Keeping the invariant at 10 ms would require either removing AI Gateway observability (violates invariant #12) or abandoning the Capability Index (violates the vision’s core architectural bet).

Decision

Expand the hot-path latency budget from ≤10 ms to ≤30 ms total (auth + token + route + capability retrieval + AI Gateway callback), measured as p95.

The 30 ms figure is chosen to:

Accommodate the agent assembly path at p95 with headroom.
Stay well below the 100 ms threshold where LLM users perceive “sluggishness” in the agent interface.
Remain strict enough that accidental D1 reads or synchronous DO calls would still breach the budget and be caught by p95 monitoring.

What changes

Invariant #10 (old):

Gateway overhead ≤10 ms. auth + token + route + AI Gateway callback included.

Invariant #10 (new):

Gateway overhead ≤30 ms p95. auth + token + route + capability retrieval + AI Gateway callback included. The existing tool-proxy path (non-agent routes) must remain ≤15 ms p95. Agent assembly path (including Vectorize retrieval) may use the full 30 ms budget.

The split budget (≤15 ms for tool proxy, ≤30 ms for agent assembly) ensures that adding the agent surface does not degrade the existing gateway’s responsiveness for Cursor, Codex, and direct tool calls.

What stays the same

Invariant #2 (KV-only hot path): Token and auth lookups remain KV-only. D1 is never added to the request path. This ADR does NOT relax invariant #2.
Invariant #3 (fail-fast): No retries in the request path. The extra latency budget is not a license to add retry loops.
Invariant #6 (no DO in request path): The existing gateway routes for tool proxying still never touch a DO. The AgentRuntime DO is only reachable via the /v1/agents/ admin surface (CF-Access gated), not from the hot path of tool calls.
Invariant #11 (30s AbortController): Unchanged. The budget expansion is about gateway overhead, not the upstream API call timeout.
Invariant #12 (AI Gateway): Unchanged. All LLM calls go through AI Gateway.

Acceptance criteria

p95 gateway overhead for non-agent routes (tool proxy) stays ≤15 ms, measured via AI Gateway dashboard over a 24-hour window post-deploy.
p95 gateway overhead for agent assembly routes stays ≤30 ms, measured over the first 7 days of Track A live traffic.
The AI Gateway dashboard shows no p99 spikes above 50 ms on either path under normal load.
A monitoring alert is configured (Slack via existing alerting) to fire if p95 non-agent overhead exceeds 20 ms for >5 consecutive minutes.

Implementation note

This ADR does NOT require a code change to ship. It formalizes the budget that the Track A and Track B implementations will be held to. The invariant files must be updated when this ADR merges (per the invariant change procedure in docs/engineering-standards/ASCEND_OPERATOR_OS_ENGINEERING_STANDARD.md §5).

The invariant update (.claude/rules/v5-invariants.md + docs/architecture/ASCEND-CLOUD-NATIVE-V2-ENGINEERING-PLAN.md) will land in the same commit that merges the Track A implementation PR, not in this plan-first PR. This is the standard sequencing: ADR accepted on main, invariant text updated when Track A merges.

Consequences

Positive:

Unblocks Track A (AgentRuntime DO) and Track B (Vectorize capability index) from being measured against an artificially tight budget that predates these components.
Makes the monitoring regime honest — 10 ms was already breached by AI Gateway callback overhead in production.
The split budget (≤15 ms proxy / ≤30 ms agent) protects existing Cursor/Codex/tool-proxy performance from regression.

Negative / accepted risk:

A 30 ms p95 budget for agent paths is still tight for some Vectorize deployments under spike load. If p95 exceeds 30 ms after Track B ships, the retrieval helper will need a KV cache layer (cache the top-20 tools for frequent intent patterns with a 5-minute TTL). This is not in Q1 scope but is the documented escalation path.
The split budget requires two separate monitoring dashboards / alerts. This is manageable but adds operational overhead.