Skip to content

Exclude dev/sandbox tenants from autoresearch alert math

ADR-022 — Exclude dev/sandbox tenants from autoresearch alert math

Status: Accepted Date: 2026-04-23 Deciders: Mishaal Murawala Relates to: ADR-021 (autoresearch current-state severity), src/cron/autoresearch.ts, 2026-04-23 error_ledger investigation

Context

The 2026-04-23 error_ledger investigation (using the new /admin/errors endpoint) revealed that all recent errors triggering autoresearch alerts were from the tenant=ascend sandbox, not from any production tenant (kahuna, pointfield, etc.).

7-day error breakdown:

Provider7d errorsTenantSource
google_ads510496 kahuna (2026-04-20 backfill, expected 90-day wall) + 14 ascend (dev queries)mix
salesforce34100% ascendADR-013 research traffic
hubspot27100% ascendHubSpot API exploration
gmail1100% ascendprobe misconfiguration

Production traffic accounted for zero errors in the rolling 24h window on any non-backfill day. Every CRITICAL autoresearch fire this week was driven by ascend-tenant dev traffic OR a one-time Kahuna backfill that correctly hit Google Ads’s 90-day data horizon.

The core insight: dev/sandbox tenant traffic is inherently noisy and should not drive production alerts. Exploration, curl-test iteration, ADR research, and API probing all produce error_ledger rows by design. They are not operational signals.

Decision

Exclude a configurable list of tenant_ids from autoresearch’s three error_ledger queries (24h, 7d, daily buckets). Priority (high → low):

  1. KV override — key autoresearch:excluded_tenants (JSON string[]). Runtime override, no deploy.
  2. Wrangler env varAUTORESEARCH_EXCLUDED_TENANTS in wrangler.toml [vars] (comma-separated). Config-as-code default. Current value: "ascend".
  3. Empty list — if neither is set, no exclusion.

The tenant list is NOT hardcoded in src/ — this enforces the platform-spec §2.3 “config-driven, no hardcoding” rule (pre-commit check 01 now passes).

Excluded tenants’ errors:

  • Remain in error_ledger for forensic retrieval via /admin/errors?provider=X (no data loss).
  • Are not counted in autoresearch’s severity thresholds.
  • Are not surfaced in the daily Slack digest.

Rationale

  • Evidence-based. The 2026-04-23 investigation proved that every “CRITICAL” alert this week was dev traffic or one-time backfill events. No production fires.
  • Separation of concerns. ascend is Mishaal’s own sandbox — it runs ad-hoc research queries, dev probes, half-finished experiments. Production tenants (Kahuna, PFP) run versioned code on stable configs. Alert math conflates the two.
  • Runtime configurable — KV override lets us add or remove dev tenants without a deploy. A new PE client can be whitelisted as “production-like” immediately. A new research sandbox can be added without touching code.
  • SQL-injection safe — excluded tenant names are filtered to [a-zA-Z0-9_-]+ before interpolation. Values come from our own KV, but defensive filtering costs nothing and closes the surface completely.

Alternatives considered

  1. Tag errors with a severity column in error_ledger itself. More invasive — requires schema migration, every log call updated, backfill. Rejected — filter-at-query-time is equivalent and cheaper.
  2. Separate error_ledger tables per tenant. Massive over-engineering for our scale. Rejected.
  3. Route dev-tenant errors to a different D1 table. Same effect but requires every logErrorToLedger call-site to branch. Rejected — filter at autoresearch read time is the single point of change.
  4. Leave it alone; rely on ADR-021’s 24h window. Even with current-state severity, two back-to-back dev research sessions of 10+ errors each in 24h will still fire CRITICAL. Dev noise would still defeat production signal. Rejected.

Consequences

Positive

  • Autoresearch alerts only reflect production state.
  • Dev iteration speed unchanged — no need to stop testing “to keep the alert clean.”
  • Forensic trail intact via /admin/errors endpoint — no data lost.
  • Any future dev tenant auto-excluded via KV edit (no deploy required).

Negative

  • If a real bug is introduced that only manifests under the ascend tenant (e.g., an admin endpoint regression), the autoresearch alert won’t catch it. Mitigation: admin endpoints are exercised by the pre-deploy + post-deploy gates in CI; we don’t rely on daily autoresearch for them.
  • The ascend tenant could be used to run real workloads in the future (Mishaal’s own GTM work for Ascend the agency). If that transition happens, flip the KV override to ['sandbox'] and recategorize. Decision documented here so the flip is auditable.

Neutral

  • error_ledger schema unchanged.
  • Existing suggestions stored in KV are not retroactively modified. 30-day TTL lets the new filter propagate naturally.

Implementation

  • src/cron/autoresearch.ts:
    • New constant DEV_TENANTS_DEFAULT = ['ascend'].
    • New resolveExcludedTenants(env) helper — reads KV override, falls back to default, swallows KV read errors.
    • New buildTenantFilter(excluded) helper — returns SQLite-safe AND tenant_id NOT IN (...) fragments; regex-filters tenant names to [a-zA-Z0-9_-]+.
    • All three error_ledger queries in analyzeErrors now include the filter.
  • test/cron/autoresearch-severity.test.ts:
    • Test: default exclusion filters ascend.
    • Test: KV override with multi-tenant list produces correct SQL.
    • Test: injection attempt is filtered — malformed tenant names never reach the SQL, DROP TABLE never appears.

Verification

After merge + deploy, next autoresearch run at 06:00 UTC:

  • Expected: *Autoresearch 2026-04-24*: No current issues. All systems nominal.
  • If Kahuna later has real errors: alert surfaces them with actual error_message samples (via the ADR-021 enrichment), no ascend-tenant dev noise diluting the signal.

Override the exclusion list at runtime (no deploy):

Terminal window
wrangler kv key put --namespace-id=$ASCEND_KV_ID \
'autoresearch:excluded_tenants' '["ascend","new-sandbox"]'

References

  • ADR-021: autoresearch current-state severity
  • 2026-04-23 investigation: GET /admin/errors?hours=168 output
  • Platform spec §7 (tenant isolation) — excluded list is per-tenant, derived at query time, never mutated