Adopt a lightweight frontier-eval tier (expect-to-fail tests)
ADR-011: Adopt a lightweight frontier-eval tier (expect-to-fail tests)
Status: Accepted (2026-04-22) Date: 2026-04-17 proposed, 2026-04-22 committed Deciders: Mishaal Murawala
Context
From the Notion Custom Agents deep dive (Latent.Space, 2026-04-13): Notion treats eval writing as a distinct discipline and uses three tiers:
- Regression — must always pass. Catches breakage.
- Launch-quality — must pass before shipping a feature. Catches quality drops.
- Frontier/headroom — intentionally only passes ~30% today. Tells you when the platform is ready for the next bet.
V5’s current test/ suite is regression-only (~451 tests). We have no forward-looking signal for “is Claude/Opus/our MCP stack capable enough yet for feature X.”
Decision
Adopt a lightweight frontier-eval tier — NOT a new framework. Use existing vitest with two conventions:
- Mark frontier tests explicitly. Use
describe.todo("frontier: ...")orit.fails("frontier: ...")with a comment explaining the capability bet the test represents. - Ship with ~30% pass rate expectation. A frontier test that reliably passes gets promoted to launch-quality. One that reliably fails gets kept until we make a capability bet that requires it.
- No CI gating. Frontier tests run locally and in the biweekly parallelization audit, but they do NOT block deploys. Pre-deploy gate only looks at regression + launch-quality tests.
Seed with three tests:
frontier: Ascend Tech Lead can complete a 10-step Kahuna audit in <60s end-to-end(capability bet: long agent chains stay coherent)frontier: Opus 4.7 parses a 50k-token PRD into structured tasks without hallucinating IDs(capability bet: long-context extraction fidelity)frontier: llm_invoke with glm-4.7 matches Sonnet within 5% on bulk ICP classification(capability bet: economy tier is real)
Non-goals
- No new test framework. Plain vitest.
- No separate CI run for frontier tier.
- No automated promotion from frontier → launch-quality. Manual, deliberate.
Consequences
Positive
- Forward-looking signal with near-zero infrastructure cost.
- When a frontier test starts passing, we know a feature just became viable.
- Codifies capability bets so they don’t live only in Mishaal’s head.
Negative
- Requires discipline to add frontier tests when making capability bets. If we stop writing them, this tier decays.
Implementation
One commit adding test/frontier/ directory with the three seed tests described above. ~2 hours of work. Part of the 2-week consolidation sprint preceding ADR-016 Phase 1 (per the V5 engineering plan discussion on 2026-04-22).