Tool Utilization Framework (TUF)

ADR-062 — Tool Utilization Framework (TUF)

Status: Accepted
Date: 2026-05-19
Supersedes: none
INVARIANTS-UNCHANGED — TUF invariants I1–I7 below are TUF-scoped (measurement framework), not amendments to the v5 gateway invariants in .claude/rules/v5-invariants.md.
Related: ADR-042 (capability index), ADR-053 (lean stack), ADR-054 (V5 retirement), ADR-057 (cutover 2026-05-19), ~/.claude/rules/doctrine.md D3 (two-layer measurement separation)

Context

We keep installing tools and using a fraction of their surface area — Hermes V3 wired as a cron runner, SEMrush at 5 of 35 report types, HubSpot at ~20% API coverage. Each gap was caught by accident, not by a systematic process.

There is no measurement layer that answers two questions independently:

What % of a tool’s feature universe have we wired into the codebase?
What % of what we have wired do we actually invoke?

Conflating those two numbers (e.g. “we use HubSpot at 20%”) is the exact failure mode that hides whether the gap is implementation work or adoption work. Adoption work on top of incomplete implementation is premature optimization.

Decision

Build the Tool Utilization Framework (TUF) — a measurement system on the existing CF Workers stack that produces two separate scores per tool:

Layer 1 — Implementation Coverage = implemented_features / total_features_in_universe.
Layer 2 — Utilization = features_called_last_N_days / implemented_features. Computed only when Layer 1 = 100%. Otherwise Layer 2 = N/A (incomplete).

The framework is composed of:

Adapter interface locked at M1: discover() / checkImplementation() / fetchFeed(). Every integration conforms. Adding integration N+1 never modifies the interface.
D1 cold-path tables for measurement state: tools_register, feature_universe, feature_calls, change_log, triage_queue.
5 markdown artifacts per tool under docs/tools/<slug>/: profile.md, feature-universe.md, implementation-status.md, implementation-backlog.md, utilization-report.md.
CF Workflow + Cron drives scheduled audit runs (M4).
Vectorize semantic index over feature_universe (M6) — reuses the existing capability_index binding pattern from ADR-042.
Slack + GitHub sinks for drift digests + auto-filed issues (M7).
tool-audit skill at ~/.claude/skills/tool-audit/SKILL.md with subcommands onboard / run / review / score.

Out of scope

Issue #557 — lands as a single row in Mem0’s implementation-backlog.md, not as a top-level TUF requirement.
New vendors. TUF is built entirely on CF Workers + Workflows + Cron + D1 + Vectorize + KV + Slack + GitHub + Claude (via AI Gateway). Any new vendor needs its own ADR (doctrine D5).
Tool selection. TUF measures coverage and utilization; it does not decide which tools to install.

Invariants (I1–I7)

Two-layer scores are separate. Layer 2 reports N/A (incomplete) whenever Layer 1 < 100%. No blended single-number score.
Adapter interface frozen at M1. Changes require an amendment ADR.
Existing-stack-only. No new vendors, runtimes, or hosted services.
Feature universe is the source of truth for Layer 1 denominators. Hand-edited supplements go in implementation-backlog.md; the universe file regenerates from discover() and is overwritten on each run.
feature_calls writes are cold path. TUF never reads or writes D1 on the gateway hot path.
5 artifacts per tool, no more, no less. Adding a sixth needs an amendment ADR.
Drift is a CI gate, not a doc problem. Mismatch between feature_universe.md and the adapter’s discover() output fails pre-commit (doctrine D6).

Consequences

Positive: surfaces hidden gaps (Hermes-style underutilization, SEMrush-style implementation gaps) as separate, addressable backlogs. Forces every new integration through the same shape, so audits are cheap to repeat.
Positive: reuses CF Workflow / Cron / Vectorize patterns we already operate — zero new ops surface.
Negative: doubles the doc-maintenance burden per integration (5 artifacts). Mitigated by tool-audit run regenerating 3 of the 5 from telemetry + discover().
Negative: requires every existing integration to backfill a discover() implementation before its Layer 1 score is meaningful. Sequenced across M2–M7.

Reversal trigger

If after M7 + 30 days of runtime, fewer than 3 integrations have backfilled discover(), or no decision has been made on a flagged drift, the framework has failed adoption. Re-evaluate scope or sunset.

References

docs/projects/tool-utilization-framework/GOAL.md — frozen contract (problem, DONE checklist, OOS, invariants, adapter contract, milestones, stop conditions).
docs/projects/tool-utilization-framework/working-context.md — operational state.
docs/projects/tool-utilization-framework/status.md — engineering decision log.
~/.claude/rules/doctrine.md D3 — two-layer measurement separation rule that TUF instantiates.