Cloudflare Analytics Engine for Per-Tool Telemetry

ADR-020 — Cloudflare Analytics Engine for Per-Tool Telemetry

Status: Accepted Date: 2026-04-23 Deciders: Mishaal Murawala Relates to: ADR-018 (Phase 1 task 1.1), docs/platform-product-spec-v0.1.md §6, §9

Context

Phase 1 task 1.1 calls for per-tool call counters so we can do a data-driven tool audit in task 1.2 (cut 3-5 low-use tools). The handover’s suggested implementation was KV keys metrics:tool_calls:{tool}:{YYYY-MM-DD}.

That suggestion predates the Cloudflare Analytics Engine GA (April 2024). A research pass on 2026-04-23 across current Cloudflare docs confirmed three viable options and one clear winner.

Decision

Use Cloudflare Analytics Engine, not KV, for per-tool telemetry.

Binding added to wrangler.toml:

[[analytics_engine_datasets]]
binding = "TOOL_METRICS"
dataset = "ascend_v5_tool_metrics"

Write pattern on tool response path (hot path, via ctx.waitUntil):

ctx.waitUntil(
  c.env.TOOL_METRICS.writeDataPoint({
    indexes: [toolName],                 // sampled for grouping (limit 1 index, 32 bytes)
    blobs: [tenantId, status, errorCode ?? ''],
    doubles: [latencyMs, tokensIn ?? 0, tokensOut ?? 0],
  })
);

Query via the Analytics Engine SQL API (/accounts/:account/analytics_engine/sql) in a scheduled weekly cron or on-demand from the admin endpoint.

Rationale — the comparison

Dimension	Analytics Engine	KV atomic writes	Durable Object counters
Atomicity under concurrent writes	✅ Native event stream (ClickHouse-backed)	❌ Last-write-wins. Two concurrent tool calls to same tool → one increment lost	✅ Atomic
Cost at 500K writes/month (realistic V5 volume)	Free (10M writes/month included)	~$2.50 (KV writes are $5 per million)	~$1.50 (DO requests)
Write latency (hot path)	~1–5ms (via `waitUntil`, non-blocking)	<1ms (single KV PUT)	~30–50ms (RPC to DO)
Query ergonomics	SQL API: `SELECT index1 AS tool, COUNT(*) FROM ... GROUP BY tool`	Manual — list keys, fetch each, sum in app	SQL against DO SQLite, one DO per metric
GA status (April 2026)	✅ GA since 2024-04	✅	✅
Retention	90 days native (configurable via Workers Logs integration)	Infinite until deleted	Per-DO configuration
Spec alignment (platform-spec §6 backups, §9 observability)	✅ Purpose-built for this	⚠️ Hack	⚠️ Over-engineered for counters

Why not KV

KV is key-value. “Increment a counter” is not a primitive — it’s read-modify-write, which races under concurrent load. At our steady-state ~10-50 tool calls/min, two simultaneous calls to the same tool in the same second will almost certainly lose one increment. The data would be systematically undercounted, silently, for exactly the high-traffic tools we most need accurate data on.

Why not Durable Objects

DOs give atomic increments but add 30-50ms of RPC latency to every hot-path call that wants to write telemetry. That violates the gateway overhead ≤10ms invariant. waitUntil(...) would hide the latency from the caller but still pay the CPU cost and the DO invocation fee.

Why Analytics Engine wins

Purpose-built. This is literally what the service is for — write events at Worker request rate, query aggregations later.
Free at our scale. 10M writes/month included. Even at 100× growth we stay in the free tier.
Non-blocking. Writes happen fire-and-forget inside waitUntil. The hot path sees ~0ms addition.
SQL. Query by tool, tenant, status, error, latency in one endpoint. Makes the tool audit (task 1.2) trivial.

Gotchas

Sampling. Analytics Engine samples at high volumes (100% under ~25K writes/s, which we’ll never approach). Irrelevant for us but noted.
One index only, 32 bytes. Use toolName as the index; everything else goes in blobs (for filtering) or doubles (for sum/avg).
Datapoint ordering is not guaranteed. For per-request tracing use Workers Logs or console.log → Logpush. Analytics Engine is for aggregate signals, not individual-request replay.
Query-side cost. Writes are free up to 10M; queries cost per GB scanned (but our dataset is tiny).

Consequences

Positive

Phase 1 task 1.1 unblocked — 2-hr implementation instead of the 4-hr estimate (less boilerplate than KV scheme).
Task 1.2 (tool audit) has a proper data source.
Sets precedent: future platform metrics (error rates, latency distributions, per-tenant cost) all land on the same binding.
Zero hot-path impact.

Negative

Adds one more CF resource to wrangler.toml (trivial).
Analytics Engine SQL syntax is Clickhouse-flavored, not PostgreSQL. One-time learning curve for the first query.

Neutral

The existing platform-spec §6 “scheduled backup workflow” pattern stays the same — Analytics Engine datasets are not “backed up” in the traditional sense (they’re the backup).

Implementation plan (inside Phase 1 task 1.1)

Add analytics_engine_datasets binding to wrangler.toml (dev + prod environments).
Create src/lib/telemetry.ts with recordToolCall({tool, tenantId, status, latencyMs, tokensIn, tokensOut, errorCode}) helper.
Wire into every tool’s response path via a single decorator in src/handlers/mcp.ts (don’t sprinkle it across 28 tools).
Test via typecheck + one live call (Analytics Engine has no vitest binding; integration check in staging).
Add admin endpoint GET /admin/metrics/tools?start=<iso>&end=<iso> that issues the SQL query + returns JSON summary.
Document query recipe in docs/runbooks/tool-metrics-audit.md.

Exit criterion: 7 days of data collected → task 1.2 reads the output and identifies bottom 5 tools by call count.

References

Analytics Engine Pricing
Analytics Engine SQL API
Workers KV Pricing — for the comparison
Durable Objects Pricing — for the comparison