Skip to content

Adopt a lightweight frontier-eval tier (expect-to-fail tests)

ADR-011: Adopt a lightweight frontier-eval tier (expect-to-fail tests)

Status: Accepted (2026-04-22) Date: 2026-04-17 proposed, 2026-04-22 committed Deciders: Mishaal Murawala

Context

From the Notion Custom Agents deep dive (Latent.Space, 2026-04-13): Notion treats eval writing as a distinct discipline and uses three tiers:

  • Regression — must always pass. Catches breakage.
  • Launch-quality — must pass before shipping a feature. Catches quality drops.
  • Frontier/headroom — intentionally only passes ~30% today. Tells you when the platform is ready for the next bet.

V5’s current test/ suite is regression-only (~451 tests). We have no forward-looking signal for “is Claude/Opus/our MCP stack capable enough yet for feature X.”

Decision

Adopt a lightweight frontier-eval tier — NOT a new framework. Use existing vitest with two conventions:

  1. Mark frontier tests explicitly. Use describe.todo("frontier: ...") or it.fails("frontier: ...") with a comment explaining the capability bet the test represents.
  2. Ship with ~30% pass rate expectation. A frontier test that reliably passes gets promoted to launch-quality. One that reliably fails gets kept until we make a capability bet that requires it.
  3. No CI gating. Frontier tests run locally and in the biweekly parallelization audit, but they do NOT block deploys. Pre-deploy gate only looks at regression + launch-quality tests.

Seed with three tests:

  • frontier: Ascend Tech Lead can complete a 10-step Kahuna audit in <60s end-to-end (capability bet: long agent chains stay coherent)
  • frontier: Opus 4.7 parses a 50k-token PRD into structured tasks without hallucinating IDs (capability bet: long-context extraction fidelity)
  • frontier: llm_invoke with glm-4.7 matches Sonnet within 5% on bulk ICP classification (capability bet: economy tier is real)

Non-goals

  • No new test framework. Plain vitest.
  • No separate CI run for frontier tier.
  • No automated promotion from frontier → launch-quality. Manual, deliberate.

Consequences

Positive

  • Forward-looking signal with near-zero infrastructure cost.
  • When a frontier test starts passing, we know a feature just became viable.
  • Codifies capability bets so they don’t live only in Mishaal’s head.

Negative

  • Requires discipline to add frontier tests when making capability bets. If we stop writing them, this tier decays.

Implementation

One commit adding test/frontier/ directory with the three seed tests described above. ~2 hours of work. Part of the 2-week consolidation sprint preceding ADR-016 Phase 1 (per the V5 engineering plan discussion on 2026-04-22).