Skip to content

Autoresearch severity logic: current-state, not trailing average

ADR-021 — Autoresearch severity logic: current-state, not trailing average

Status: Accepted Date: 2026-04-23 Deciders: Mishaal Murawala Relates to: src/cron/autoresearch.ts, ADR-018 (Phase 1 task 1.6 ops wounds)

Context

The autoresearch cron runs daily at 06:00 UTC, analyzes error_ledger D1 data, and posts a Slack digest. Since shipping 2026-04-13, the CRITICAL severity rule has been:

const avgDaily = total_7d_errors / 7;
if (avgDaily > 50) severity = 'critical';
else if (avgDaily > 10) severity = 'warning';

This produced daily CRITICAL alerts even when the system was completely healthy. Observed pattern:

DateAutoresearch: “X errors/day over 7 days”Daily digest: “Errors (24h)“
2026-04-22CRITICAL — 72 errors/day0 errors
2026-04-23CRITICAL — 73 errors/day9 errors

On 2026-04-22 the system had zero errors in the past 24 hours. The 7-day trailing window still contained older errors, so the average stayed above 50 and fired CRITICAL. Had the user fixed every remaining error by 2026-04-23, the alert would have fired CRITICAL for five more days after the fix. By construction.

This is a lagging-indicator bug: the metric is mathematically incapable of reflecting same-day resolutions. The alert said “there’s a problem” when there wasn’t, and it said the same thing every day, which destroyed its signal value. The user’s lived experience was “I keep getting told this is fixed, then it happens again” — but the actual errors never came back; the average was just slowly decaying.

Decision

Drive severity from past-24h data only. Keep the 7-day window as historical context at info severity — visible via the admin endpoint, excluded from the alert digest.

New rules inside analyzeErrors:

  1. Query error_ledger with three windows in parallel — 24h (severity driver), 7d weekly aggregates (historical context + recurring-pattern detection), daily buckets (spike detection).
  2. CRITICAL fires if a single provider has ≥10 errors in the past 24h. “Actively failing now.”
  3. WARNING fires for 3-9 errors on a single provider in past 24h. “Degraded but not broken.”
  4. Spike detection (a day’s count >3× the 7-day mean) still fires CRITICAL, because the spike IS current-day data.
  5. Historical rows (7-day total ≥50 but 24h count <3) become info with status: 'historical' — stored in KV for the admin endpoint, excluded from the Slack digest so they don’t page anyone.
  6. Recurring patterns (same error_type ≥5 times in 7 days, no current fire) become info. Information-dense context, not a trigger.
  7. Alerts include actual error samples: the top-3 error types for the flagged provider, with the most-recent error_message truncated to 120 chars. No more “check the status page” with zero context.

Additionally, the Slack digest logic changes:

  • Header count excludes config_cleanup (it’s an artifact of incomplete MCP usage tracking — not actionable) and excludes info (historical context). Previous header said “32 suggestions”; new header says “1 critical, 2 warning” only.
  • Overall severity is driven by actionable items only. A digest with zero critical/warning goes out as severity info (“All systems nominal”) regardless of how many info-level historical rows exist.

Rationale

  • Alert fatigue destroys signal. A CRITICAL that fires every morning regardless of system health becomes wallpaper. Real critical events get ignored along with it.
  • 24h is the correct window for “is this fire active now?” Platform spec §9 (weekly review cadence) is a different question — for that, the historical trailing data is the right source, but it belongs in a weekly digest, not a daily pager-eligible alert.
  • Spike detection retains CRITICAL because a >3× mean spike on the most-recent day is current-day evidence of degradation. This is the ONE place the 7-day window contributes to critical severity — as a baseline, not as a count.
  • Error-message samples turn alerts into diagnostic events. Before: “google_ads averaging 73 errors/day, check status page.” After: “google_ads: 12 errors in past 24h — UPSTREAM_ERROR×8 ‘quota exceeded’; AUTH_FAILED×4 ‘invalid_grant’.” That’s actionable.
  • config_cleanup header count was misleading. 27-32 config_cleanup suggestions per day made every digest look like a firehose even though the top-3 excluded them. This category is a placeholder until MCP usage counters ship (currently only /api/ REST traffic is counted).

Alternatives considered

  1. Exponential-weighted moving average (EWMA) over 7 days. Would reduce lag but not eliminate it. Still produces false positives after a same-day fix. Rejected.
  2. Keep 7-day average but add a “trending” field so alerts say “decreasing.” The trend field IS added in this ADR, but as context inside a current-state alert — not as a way to keep the bad metric.
  3. Silence the cron entirely until telemetry (Phase 1 task 1.1) provides a proper source. Rejected — telemetry via Analytics Engine covers MCP tool calls, but the autoresearch cron reads error_ledger D1 (cold-path error records from ALL paths including REST + crons + webhooks). They’re different data sources; autoresearch is still the right home for error-ledger analysis.
  4. Page on any error. Rejected — normal API flakiness produces 1-2 errors/hour; paging on every one creates worse fatigue than the current bug.

Consequences

Positive

  • Daily alert fatigue stops. If the system is healthy today, the digest says so.
  • CRITICAL fires only when CRITICAL action is required (≥10 errors on one provider in 24h).
  • Alerts contain actual error messages, not boilerplate “check the status page” text. Debugging starts immediately.
  • Header count stops inflating with config_cleanup noise.

Negative

  • If a real error goes from 15/24h to 2/24h overnight, alert demotes CRITICAL → INFO immediately. That’s correct behavior (the fire is out) but a reviewer who wants “did this recur?” context needs to pull historical data from the admin endpoint. Documented in the runbook.
  • Backwards compatibility: any dashboards or tools reading the old avg_daily evidence field need to switch to last_24h_total + week_7d_total. Current consumers: none that I can identify — the evidence was only consumed by the Slack digest. Safe.

Neutral

  • The four specific suggestionsKey(date) payloads in KV change shape. Old payloads retain their format; new ones have the last_24h_total, week_7d_total, trend, and samples fields. Readers should accept both for a 30-day window as the 30-day TTL ages out the old records.

Implementation

  • src/cron/autoresearch.ts: analyzeErrors rewritten (this ADR’s code change). sendDigest header + severity logic updated. Added truncate() + computeTrend() helpers.
  • test/cron/autoresearch-severity.test.ts: 4 new unit tests covering the core scenarios (quiet-24h-high-7d, critical-fire, warning-fire, all-nominal).
  • HANDOVER.md: unchanged (doesn’t reference the severity rule).
  • No config changes required. No infra changes required. Constants moved to CRITICAL_24H_THRESHOLD / WARNING_24H_THRESHOLD / HISTORICAL_7D_THRESHOLD.

Verification after deploy

Next autoresearch run (06:00 UTC after merge):

  • If 24h error counts are normal — digest is an INFO-severity “all systems nominal” one-liner. No CRITICAL, no wall of text, no daily wakeup.
  • If a real error is firing — the alert body will contain the actual error type, count, and a truncated sample of the latest error message from the ledger.
  • If a historical error from the week still shows up — it appears in the stored KV suggestions with severity: "info" and status: "historical", NOT in the Slack digest.

References

  • Source: src/cron/autoresearch.ts
  • Tests: test/cron/autoresearch-severity.test.ts
  • Related: ADR-018 Phase 1 task 1.6 (fix ops wounds) — this is one of those ops wounds.