Exclude dev/sandbox tenants from autoresearch alert math
ADR-022 — Exclude dev/sandbox tenants from autoresearch alert math
Status: Accepted
Date: 2026-04-23
Deciders: Mishaal Murawala
Relates to: ADR-021 (autoresearch current-state severity), src/cron/autoresearch.ts, 2026-04-23 error_ledger investigation
Context
The 2026-04-23 error_ledger investigation (using the new /admin/errors endpoint) revealed that all recent errors triggering autoresearch alerts were from the tenant=ascend sandbox, not from any production tenant (kahuna, pointfield, etc.).
7-day error breakdown:
| Provider | 7d errors | Tenant | Source |
|---|---|---|---|
| google_ads | 510 | 496 kahuna (2026-04-20 backfill, expected 90-day wall) + 14 ascend (dev queries) | mix |
| salesforce | 34 | 100% ascend | ADR-013 research traffic |
| hubspot | 27 | 100% ascend | HubSpot API exploration |
| gmail | 1 | 100% ascend | probe misconfiguration |
Production traffic accounted for zero errors in the rolling 24h window on any non-backfill day. Every CRITICAL autoresearch fire this week was driven by ascend-tenant dev traffic OR a one-time Kahuna backfill that correctly hit Google Ads’s 90-day data horizon.
The core insight: dev/sandbox tenant traffic is inherently noisy and should not drive production alerts. Exploration, curl-test iteration, ADR research, and API probing all produce error_ledger rows by design. They are not operational signals.
Decision
Exclude a configurable list of tenant_ids from autoresearch’s three error_ledger queries (24h, 7d, daily buckets). Priority (high → low):
- KV override — key
autoresearch:excluded_tenants(JSON string[]). Runtime override, no deploy. - Wrangler env var —
AUTORESEARCH_EXCLUDED_TENANTSinwrangler.toml [vars](comma-separated). Config-as-code default. Current value:"ascend". - Empty list — if neither is set, no exclusion.
The tenant list is NOT hardcoded in src/ — this enforces the platform-spec §2.3 “config-driven, no hardcoding” rule (pre-commit check 01 now passes).
Excluded tenants’ errors:
- Remain in
error_ledgerfor forensic retrieval via/admin/errors?provider=X(no data loss). - Are not counted in autoresearch’s severity thresholds.
- Are not surfaced in the daily Slack digest.
Rationale
- Evidence-based. The 2026-04-23 investigation proved that every “CRITICAL” alert this week was dev traffic or one-time backfill events. No production fires.
- Separation of concerns.
ascendis Mishaal’s own sandbox — it runs ad-hoc research queries, dev probes, half-finished experiments. Production tenants (Kahuna, PFP) run versioned code on stable configs. Alert math conflates the two. - Runtime configurable — KV override lets us add or remove dev tenants without a deploy. A new PE client can be whitelisted as “production-like” immediately. A new research sandbox can be added without touching code.
- SQL-injection safe — excluded tenant names are filtered to
[a-zA-Z0-9_-]+before interpolation. Values come from our own KV, but defensive filtering costs nothing and closes the surface completely.
Alternatives considered
- Tag errors with a
severitycolumn in error_ledger itself. More invasive — requires schema migration, every log call updated, backfill. Rejected — filter-at-query-time is equivalent and cheaper. - Separate error_ledger tables per tenant. Massive over-engineering for our scale. Rejected.
- Route dev-tenant errors to a different D1 table. Same effect but requires every logErrorToLedger call-site to branch. Rejected — filter at autoresearch read time is the single point of change.
- Leave it alone; rely on ADR-021’s 24h window. Even with current-state severity, two back-to-back dev research sessions of 10+ errors each in 24h will still fire CRITICAL. Dev noise would still defeat production signal. Rejected.
Consequences
Positive
- Autoresearch alerts only reflect production state.
- Dev iteration speed unchanged — no need to stop testing “to keep the alert clean.”
- Forensic trail intact via
/admin/errorsendpoint — no data lost. - Any future dev tenant auto-excluded via KV edit (no deploy required).
Negative
- If a real bug is introduced that only manifests under the
ascendtenant (e.g., an admin endpoint regression), the autoresearch alert won’t catch it. Mitigation: admin endpoints are exercised by the pre-deploy + post-deploy gates in CI; we don’t rely on daily autoresearch for them. - The
ascendtenant could be used to run real workloads in the future (Mishaal’s own GTM work for Ascend the agency). If that transition happens, flip the KV override to['sandbox']and recategorize. Decision documented here so the flip is auditable.
Neutral
error_ledgerschema unchanged.- Existing suggestions stored in KV are not retroactively modified. 30-day TTL lets the new filter propagate naturally.
Implementation
src/cron/autoresearch.ts:- New constant
DEV_TENANTS_DEFAULT = ['ascend']. - New
resolveExcludedTenants(env)helper — reads KV override, falls back to default, swallows KV read errors. - New
buildTenantFilter(excluded)helper — returns SQLite-safeAND tenant_id NOT IN (...)fragments; regex-filters tenant names to[a-zA-Z0-9_-]+. - All three
error_ledgerqueries inanalyzeErrorsnow include the filter.
- New constant
test/cron/autoresearch-severity.test.ts:- Test: default exclusion filters
ascend. - Test: KV override with multi-tenant list produces correct SQL.
- Test: injection attempt is filtered — malformed tenant names never reach the SQL,
DROP TABLEnever appears.
- Test: default exclusion filters
Verification
After merge + deploy, next autoresearch run at 06:00 UTC:
- Expected:
*Autoresearch 2026-04-24*: No current issues. All systems nominal. - If Kahuna later has real errors: alert surfaces them with actual error_message samples (via the ADR-021 enrichment), no ascend-tenant dev noise diluting the signal.
Override the exclusion list at runtime (no deploy):
wrangler kv key put --namespace-id=$ASCEND_KV_ID \ 'autoresearch:excluded_tenants' '["ascend","new-sandbox"]'References
- ADR-021: autoresearch current-state severity
- 2026-04-23 investigation:
GET /admin/errors?hours=168output - Platform spec §7 (tenant isolation) — excluded list is per-tenant, derived at query time, never mutated