Cloudflare Analytics Engine for Per-Tool Telemetry
ADR-020 — Cloudflare Analytics Engine for Per-Tool Telemetry
Status: Accepted
Date: 2026-04-23
Deciders: Mishaal Murawala
Relates to: ADR-018 (Phase 1 task 1.1), docs/platform-product-spec-v0.1.md §6, §9
Context
Phase 1 task 1.1 calls for per-tool call counters so we can do a data-driven tool audit in task 1.2 (cut 3-5 low-use tools). The handover’s suggested implementation was KV keys metrics:tool_calls:{tool}:{YYYY-MM-DD}.
That suggestion predates the Cloudflare Analytics Engine GA (April 2024). A research pass on 2026-04-23 across current Cloudflare docs confirmed three viable options and one clear winner.
Decision
Use Cloudflare Analytics Engine, not KV, for per-tool telemetry.
Binding added to wrangler.toml:
[[analytics_engine_datasets]]binding = "TOOL_METRICS"dataset = "ascend_v5_tool_metrics"Write pattern on tool response path (hot path, via ctx.waitUntil):
ctx.waitUntil( c.env.TOOL_METRICS.writeDataPoint({ indexes: [toolName], // sampled for grouping (limit 1 index, 32 bytes) blobs: [tenantId, status, errorCode ?? ''], doubles: [latencyMs, tokensIn ?? 0, tokensOut ?? 0], }));Query via the Analytics Engine SQL API (/accounts/:account/analytics_engine/sql) in a scheduled weekly cron or on-demand from the admin endpoint.
Rationale — the comparison
| Dimension | Analytics Engine | KV atomic writes | Durable Object counters |
|---|---|---|---|
| Atomicity under concurrent writes | ✅ Native event stream (ClickHouse-backed) | ❌ Last-write-wins. Two concurrent tool calls to same tool → one increment lost | ✅ Atomic |
| Cost at 500K writes/month (realistic V5 volume) | Free (10M writes/month included) | ~$2.50 (KV writes are $5 per million) | ~$1.50 (DO requests) |
| Write latency (hot path) | ~1–5ms (via waitUntil, non-blocking) | <1ms (single KV PUT) | ~30–50ms (RPC to DO) |
| Query ergonomics | SQL API: SELECT index1 AS tool, COUNT(*) FROM ... GROUP BY tool | Manual — list keys, fetch each, sum in app | SQL against DO SQLite, one DO per metric |
| GA status (April 2026) | ✅ GA since 2024-04 | ✅ | ✅ |
| Retention | 90 days native (configurable via Workers Logs integration) | Infinite until deleted | Per-DO configuration |
| Spec alignment (platform-spec §6 backups, §9 observability) | ✅ Purpose-built for this | ⚠️ Hack | ⚠️ Over-engineered for counters |
Why not KV
KV is key-value. “Increment a counter” is not a primitive — it’s read-modify-write, which races under concurrent load. At our steady-state ~10-50 tool calls/min, two simultaneous calls to the same tool in the same second will almost certainly lose one increment. The data would be systematically undercounted, silently, for exactly the high-traffic tools we most need accurate data on.
Why not Durable Objects
DOs give atomic increments but add 30-50ms of RPC latency to every hot-path call that wants to write telemetry. That violates the gateway overhead ≤10ms invariant. waitUntil(...) would hide the latency from the caller but still pay the CPU cost and the DO invocation fee.
Why Analytics Engine wins
- Purpose-built. This is literally what the service is for — write events at Worker request rate, query aggregations later.
- Free at our scale. 10M writes/month included. Even at 100× growth we stay in the free tier.
- Non-blocking. Writes happen fire-and-forget inside
waitUntil. The hot path sees ~0ms addition. - SQL. Query by tool, tenant, status, error, latency in one endpoint. Makes the tool audit (task 1.2) trivial.
Gotchas
- Sampling. Analytics Engine samples at high volumes (100% under ~25K writes/s, which we’ll never approach). Irrelevant for us but noted.
- One index only, 32 bytes. Use
toolNameas the index; everything else goes inblobs(for filtering) ordoubles(for sum/avg). - Datapoint ordering is not guaranteed. For per-request tracing use Workers Logs or
console.log→ Logpush. Analytics Engine is for aggregate signals, not individual-request replay. - Query-side cost. Writes are free up to 10M; queries cost per GB scanned (but our dataset is tiny).
Consequences
Positive
- Phase 1 task 1.1 unblocked — 2-hr implementation instead of the 4-hr estimate (less boilerplate than KV scheme).
- Task 1.2 (tool audit) has a proper data source.
- Sets precedent: future platform metrics (error rates, latency distributions, per-tenant cost) all land on the same binding.
- Zero hot-path impact.
Negative
- Adds one more CF resource to wrangler.toml (trivial).
- Analytics Engine SQL syntax is Clickhouse-flavored, not PostgreSQL. One-time learning curve for the first query.
Neutral
- The existing platform-spec §6 “scheduled backup workflow” pattern stays the same — Analytics Engine datasets are not “backed up” in the traditional sense (they’re the backup).
Implementation plan (inside Phase 1 task 1.1)
- Add
analytics_engine_datasetsbinding towrangler.toml(dev + prod environments). - Create
src/lib/telemetry.tswithrecordToolCall({tool, tenantId, status, latencyMs, tokensIn, tokensOut, errorCode})helper. - Wire into every tool’s response path via a single decorator in
src/handlers/mcp.ts(don’t sprinkle it across 28 tools). - Test via typecheck + one live call (Analytics Engine has no vitest binding; integration check in staging).
- Add admin endpoint
GET /admin/metrics/tools?start=<iso>&end=<iso>that issues the SQL query + returns JSON summary. - Document query recipe in
docs/runbooks/tool-metrics-audit.md.
Exit criterion: 7 days of data collected → task 1.2 reads the output and identifies bottom 5 tools by call count.
References
- Analytics Engine Pricing
- Analytics Engine SQL API
- Workers KV Pricing — for the comparison
- Durable Objects Pricing — for the comparison