Skip to content

Cloudflare Analytics Engine for Per-Tool Telemetry

ADR-020 — Cloudflare Analytics Engine for Per-Tool Telemetry

Status: Accepted Date: 2026-04-23 Deciders: Mishaal Murawala Relates to: ADR-018 (Phase 1 task 1.1), docs/platform-product-spec-v0.1.md §6, §9

Context

Phase 1 task 1.1 calls for per-tool call counters so we can do a data-driven tool audit in task 1.2 (cut 3-5 low-use tools). The handover’s suggested implementation was KV keys metrics:tool_calls:{tool}:{YYYY-MM-DD}.

That suggestion predates the Cloudflare Analytics Engine GA (April 2024). A research pass on 2026-04-23 across current Cloudflare docs confirmed three viable options and one clear winner.

Decision

Use Cloudflare Analytics Engine, not KV, for per-tool telemetry.

Binding added to wrangler.toml:

[[analytics_engine_datasets]]
binding = "TOOL_METRICS"
dataset = "ascend_v5_tool_metrics"

Write pattern on tool response path (hot path, via ctx.waitUntil):

ctx.waitUntil(
c.env.TOOL_METRICS.writeDataPoint({
indexes: [toolName], // sampled for grouping (limit 1 index, 32 bytes)
blobs: [tenantId, status, errorCode ?? ''],
doubles: [latencyMs, tokensIn ?? 0, tokensOut ?? 0],
})
);

Query via the Analytics Engine SQL API (/accounts/:account/analytics_engine/sql) in a scheduled weekly cron or on-demand from the admin endpoint.

Rationale — the comparison

DimensionAnalytics EngineKV atomic writesDurable Object counters
Atomicity under concurrent writes✅ Native event stream (ClickHouse-backed)❌ Last-write-wins. Two concurrent tool calls to same tool → one increment lost✅ Atomic
Cost at 500K writes/month (realistic V5 volume)Free (10M writes/month included)~$2.50 (KV writes are $5 per million)~$1.50 (DO requests)
Write latency (hot path)~1–5ms (via waitUntil, non-blocking)<1ms (single KV PUT)~30–50ms (RPC to DO)
Query ergonomicsSQL API: SELECT index1 AS tool, COUNT(*) FROM ... GROUP BY toolManual — list keys, fetch each, sum in appSQL against DO SQLite, one DO per metric
GA status (April 2026)✅ GA since 2024-04
Retention90 days native (configurable via Workers Logs integration)Infinite until deletedPer-DO configuration
Spec alignment (platform-spec §6 backups, §9 observability)✅ Purpose-built for this⚠️ Hack⚠️ Over-engineered for counters

Why not KV

KV is key-value. “Increment a counter” is not a primitive — it’s read-modify-write, which races under concurrent load. At our steady-state ~10-50 tool calls/min, two simultaneous calls to the same tool in the same second will almost certainly lose one increment. The data would be systematically undercounted, silently, for exactly the high-traffic tools we most need accurate data on.

Why not Durable Objects

DOs give atomic increments but add 30-50ms of RPC latency to every hot-path call that wants to write telemetry. That violates the gateway overhead ≤10ms invariant. waitUntil(...) would hide the latency from the caller but still pay the CPU cost and the DO invocation fee.

Why Analytics Engine wins

  • Purpose-built. This is literally what the service is for — write events at Worker request rate, query aggregations later.
  • Free at our scale. 10M writes/month included. Even at 100× growth we stay in the free tier.
  • Non-blocking. Writes happen fire-and-forget inside waitUntil. The hot path sees ~0ms addition.
  • SQL. Query by tool, tenant, status, error, latency in one endpoint. Makes the tool audit (task 1.2) trivial.

Gotchas

  1. Sampling. Analytics Engine samples at high volumes (100% under ~25K writes/s, which we’ll never approach). Irrelevant for us but noted.
  2. One index only, 32 bytes. Use toolName as the index; everything else goes in blobs (for filtering) or doubles (for sum/avg).
  3. Datapoint ordering is not guaranteed. For per-request tracing use Workers Logs or console.log → Logpush. Analytics Engine is for aggregate signals, not individual-request replay.
  4. Query-side cost. Writes are free up to 10M; queries cost per GB scanned (but our dataset is tiny).

Consequences

Positive

  • Phase 1 task 1.1 unblocked — 2-hr implementation instead of the 4-hr estimate (less boilerplate than KV scheme).
  • Task 1.2 (tool audit) has a proper data source.
  • Sets precedent: future platform metrics (error rates, latency distributions, per-tenant cost) all land on the same binding.
  • Zero hot-path impact.

Negative

  • Adds one more CF resource to wrangler.toml (trivial).
  • Analytics Engine SQL syntax is Clickhouse-flavored, not PostgreSQL. One-time learning curve for the first query.

Neutral

  • The existing platform-spec §6 “scheduled backup workflow” pattern stays the same — Analytics Engine datasets are not “backed up” in the traditional sense (they’re the backup).

Implementation plan (inside Phase 1 task 1.1)

  1. Add analytics_engine_datasets binding to wrangler.toml (dev + prod environments).
  2. Create src/lib/telemetry.ts with recordToolCall({tool, tenantId, status, latencyMs, tokensIn, tokensOut, errorCode}) helper.
  3. Wire into every tool’s response path via a single decorator in src/handlers/mcp.ts (don’t sprinkle it across 28 tools).
  4. Test via typecheck + one live call (Analytics Engine has no vitest binding; integration check in staging).
  5. Add admin endpoint GET /admin/metrics/tools?start=<iso>&end=<iso> that issues the SQL query + returns JSON summary.
  6. Document query recipe in docs/runbooks/tool-metrics-audit.md.

Exit criterion: 7 days of data collected → task 1.2 reads the output and identifies bottom 5 tools by call count.

References