Adopt a lightweight frontier-eval tier (expect-to-fail tests)

ADR-011: Adopt a lightweight frontier-eval tier (expect-to-fail tests)

Status: Accepted (2026-04-22) Date: 2026-04-17 proposed, 2026-04-22 committed Deciders: Mishaal Murawala

Context

From the Notion Custom Agents deep dive (Latent.Space, 2026-04-13): Notion treats eval writing as a distinct discipline and uses three tiers:

Regression — must always pass. Catches breakage.
Launch-quality — must pass before shipping a feature. Catches quality drops.
Frontier/headroom — intentionally only passes ~30% today. Tells you when the platform is ready for the next bet.

V5’s current test/ suite is regression-only (~451 tests). We have no forward-looking signal for “is Claude/Opus/our MCP stack capable enough yet for feature X.”

Decision

Adopt a lightweight frontier-eval tier — NOT a new framework. Use existing vitest with two conventions:

Mark frontier tests explicitly. Use describe.todo("frontier: ...") or it.fails("frontier: ...") with a comment explaining the capability bet the test represents.
Ship with ~30% pass rate expectation. A frontier test that reliably passes gets promoted to launch-quality. One that reliably fails gets kept until we make a capability bet that requires it.
No CI gating. Frontier tests run locally and in the biweekly parallelization audit, but they do NOT block deploys. Pre-deploy gate only looks at regression + launch-quality tests.

Seed with three tests:

frontier: Ascend Tech Lead can complete a 10-step Kahuna audit in <60s end-to-end (capability bet: long agent chains stay coherent)
frontier: Opus 4.7 parses a 50k-token PRD into structured tasks without hallucinating IDs (capability bet: long-context extraction fidelity)
frontier: llm_invoke with glm-4.7 matches Sonnet within 5% on bulk ICP classification (capability bet: economy tier is real)

Non-goals

No new test framework. Plain vitest.
No separate CI run for frontier tier.
No automated promotion from frontier → launch-quality. Manual, deliberate.

Consequences

Positive

Forward-looking signal with near-zero infrastructure cost.
When a frontier test starts passing, we know a feature just became viable.
Codifies capability bets so they don’t live only in Mishaal’s head.

Negative

Requires discipline to add frontier tests when making capability bets. If we stop writing them, this tier decays.

Implementation

One commit adding test/frontier/ directory with the three seed tests described above. ~2 hours of work. Part of the 2-week consolidation sprint preceding ADR-016 Phase 1 (per the V5 engineering plan discussion on 2026-04-22).