Aider + DeepSeek Architect/Editor for Spec-Driven Implementation
ADR-051 — Aider + DeepSeek Architect/Editor for Spec-Driven Implementation
- Status: Accepted
- Date: 2026-05-14
- Decider: Mishaal Murawala (engineering delegated to Claude Code per ADR-040)
- Related: ADR-040 (Q1 sequencing),
docs/plans/aider-deepseek-orchestration.md,.claude/CLAUDE.md→ Model & Effort Configuration (20/80 Opus/Sonnet split) - Invariant changed: none. This is dev-tooling only — zero runtime impact on the gateway. No V5 invariant in
.claude/rules/v5-invariants.mdis affected.
Context
The Claude Code workflow in this repo today uses a 20/80 Opus/Sonnet split — Opus for architecture, Sonnet for implementation. Both are billed at Anthropic frontier rates. Post-hoc session analysis across May 2026 shows ~80% of tool calls are mechanical: spec-defined edits (seed YAMLs, tech-debt rows, single-file test additions, version bumps). For that subclass of work, Anthropic-tier spend is structurally over-priced.
Three convergent signals make a third lane viable now (and not 6 months ago):
- Aider’s architect/editor benchmark (Sept 2024 release, mature by May 2026) — splitting “think” (frontier model) from “edit” (cheap model) jumped its bench from 77.4% to 80.5%. The pattern is headless, runs on one YAML config, and commits to git directly.
- DeepSeek V3.1 Terminus / V4 Flash at $0.21/$0.79 per M tokens (May 2026) scores 68.4% on SWE-bench Verified — close enough to Opus on coding while costing 30–50× less per task (DevTk May 2026 numbers: ~$0.42/task vs ~$22.50/task for the same spec-driven YAML edit).
- Cognition’s “Don’t Build Multi-Agents” essay (May 2025) + MindStudio’s orchestration findings (Q1 2026) — both independently arrive at the same shape: one driver, one writer, one reviewer; anything more complex bleeds context and reliability. MindStudio reports 80–90% of tokens shift to the cheap model in well-designed setups, yielding 5–10× cost reduction.
The prior CCS/TIDS work proved the “drop a YAML, cron picks it up” pattern works for adding tool coverage. Aider extends that pattern to “drop a spec, Aider writes the code.” Both are spec-driven, both keep Claude on the review side of the loop.
There is no commercial product solving this for our exact shape: solo operator + Claude Code + cost-sensitive on routine work. Aider is the closest off-the-shelf fit and it’s open-source.
Decision
Adopt Aider (Apache 2.0, paul-gauthier/aider) as a third coding lane for spec-driven implementation work, with DeepSeek V3.1 Terminus / V4 Flash as the editor model. Aider runs locally on Mishaal’s Mac, invoked headlessly by scripts/aider-run.sh against specs in docs/specs/<slug>.md. Claude Opus reviews the resulting diff in PR; the merge gate is unchanged.
The decision ships with six artifacts only — no code dependencies, no Cloudflare resources, no DB migrations, no runtime impact. Fully reversible by deleting six files.
What Aider owns vs existing lanes
| Concern | Owner |
|---|---|
| Architecture decisions, multi-file refactors, hard debugging, ambiguous tasks | Claude Opus — unchanged |
| Implementation, tests, config work, clear single-purpose edits | Claude Sonnet — unchanged (default lane) |
| Spec-driven mechanical edits (seed YAMLs, tech-debt rows, single-file tests, version bumps, doc updates from a clear spec) | Aider + DeepSeek — new lane |
| PR review | Claude Opus — unchanged |
| Merge gate | Mishaal + npm run check:pre-commit — unchanged |
| Multi-agent orchestration | N/A — Cognition’s essay says don’t, we don’t |
The Sonnet 80% lane is not displaced. Aider only takes work that already had a spec written for it. If the operator can write a 30-line spec with allowed_files + acceptance_criteria + test_commands, route to Aider. If the work needs exploration or judgment, stay in Claude Code.
Architecture (summary — full design in docs/plans/aider-deepseek-orchestration.md)
docs/specs/<slug>.md ──► scripts/aider-run.sh ──► aider (LiteLLM → DeepSeek API) │ │ │ ▼ │ Edits files + commits to current branch │ ▼ npm run check:pre-commit (typecheck + tests + lint) │ ▼ Pass: ready to push Fail (<3 attempts): re-prompt Aider with check output Fail (≥3 attempts): escalate to Claude CodeSingle-threaded by design. No queue, no orchestrator, no parallel writers — per Cognition’s essay. The single thread is the wrapper script; Aider runs as a subprocess.
Model policy
| Stage | Model | Rate |
|---|---|---|
| Spec authoring | Human or Claude Sonnet (interactive Claude Code session) | Existing |
| Aider editor (single-model mode, default) | DeepSeek deepseek-chat (V3.1 Terminus, auto-promoted to V4 Flash) | $0.21/$0.79 per M |
Aider architect (opt-in for hard specs via architect: true front-matter) | DeepSeek deepseek-reasoner | $0.55/$2.19 per M |
| Aider editor (when architect is on) | DeepSeek deepseek-chat | $0.21/$0.79 per M |
| PR review | Claude Opus | Existing |
No Claude / Anthropic spend inside Aider. This is the bright line: Aider never invokes Anthropic. If a spec needs Claude’s judgment, that’s the signal to do the work in Claude Code, not Aider.
Aligned with the LLM Model Policy in ~/.claude/CLAUDE.md and the project-level Model & Effort Configuration in .claude/CLAUDE.md. The eval/judge tri-judge panel is untouched — those are pro-tier only and exclude same-vendor judging.
Model IDs are not hardcoded in code. They live in .aider.conf.yml and can be swapped via PR. LiteLLM handles vendor routing transparently — if DeepSeek quality drifts, swapping to Qwen 3 Coder or Kimi K2 is a one-line config change.
Estimated steady-state monthly Aider spend: $15–40. Estimated Claude token savings on shifted work: 5–10× per shifted task. If 30% of monthly implementation specs migrate to Aider, net monthly savings vs status quo is in the low hundreds of dollars and quality stays equivalent because Opus still reviews.
Invariants preserved
This decision touches zero runtime invariants. All 15 invariants in .claude/rules/v5-invariants.md remain unchanged. Aider is a local development tool — it does not run on Cloudflare, does not call gateway endpoints in production, does not read or write KV / D1 / R2 / Vectorize. The only sources of truth it touches are git (which is invariant #15-compliant) and the local filesystem.
The 20/80 Opus/Sonnet model split in .claude/CLAUDE.md is extended, not replaced: Aider becomes a new lane for the subset of implementation work that is fully spec-determined. Opus still owns architecture and review. Sonnet still owns ambiguous implementation. Aider takes only what has a written spec.
Trade-offs
DeepSeek quality is ~92% of Opus on the SWE-bench Verified slice, not 100%. Accepted. The mitigation is the spec contract: allowed_files + acceptance_criteria + test_commands constrain the edit to a verifiable shape, and npm run check:pre-commit is the gate. If a spec is loose enough that DeepSeek’s quality matters, that’s a signal the work isn’t ready for Aider — write a better spec or do it in Claude Code.
Single-vendor risk on DeepSeek. Mitigated by LiteLLM routing under the hood — vendor swap is a one-line .aider.conf.yml change. Qwen 3 Coder, Kimi K2, GPT-4.1, and Gemini 2.5 Pro are all valid fallback editors.
Operator overhead of spec authoring. Real cost. The first 5 specs take 10–15 min each to author. Once the operator has a library of past specs to copy from, the per-spec cost drops to <3 min. Below that threshold, the Aider lane is net-faster than Claude Code for mechanical work.
Aider’s commit hygiene. Aider auto-commits with its own commit message format. Wrapper script forces --commit-prompt to align with the repo’s conventional-commit style. PR squash is the safety net on top.
Headless reliability. Aider exits non-zero on tool errors. Wrapper script captures stderr + retries up to 3 times before escalating. Three failures = the spec wasn’t right for Aider; manual takeover by Claude Code, no automation.
Reversal criteria
- DeepSeek API stability drops below 95% over a week → switch editor to Qwen 3 Coder via
.aider.conf.yml. No code change needed. - DeepSeek quality drops below acceptable on three consecutive specs (pre-commit fails twice in a row AND diff isn’t recoverable in <5 min Opus review) → write a successor ADR; delete the six artifacts; revert to status quo.
- A better headless coding tool ships and beats Aider on either reliability or token economics → successor ADR; one-week parallel run; switch on equivalent or better numbers.
- The architect/editor split itself stops paying off (likely never, but tracked) → switch to single-model Aider permanently.
This ADR can be reversed without code unwinding: delete six files, branch protections and CI are unchanged.
Acceptance criteria (this PR)
- ADR merged on
main(this file). - LEDGER row added in
docs/projects/LEDGER.md→ “Active projects”. - Plan permalinked at
docs/plans/aider-deepseek-orchestration.md. -
.aider.conf.ymlat repo root withmodel: deepseek/deepseek-chat,auto-commits: true,yes-always: true. -
docs/specs/_template.mdwith structured front-matter contract. -
scripts/aider-run.shexecutable, validates spec + runs aider + runs pre-commit + retries to cap. -
docs/agents/AIDER-WORKFLOW.mdunder 200 lines. - Proof point:
scripts/tids/seed/anthropic.yamlauthored by Aider from a 30-line spec, pre-commit passes.
References
- Plan (in-repo permalink):
docs/plans/aider-deepseek-orchestration.md - Aider architect/editor benchmark: https://aider.chat/2024/09/26/architect.html
- Aider headless usage: https://aider.chat/docs/scripting.html
- Cognition “Don’t Build Multi-Agents”: https://cognition.ai/blog/dont-build-multi-agents
- Kiro spec-driven development: https://kiro.dev/docs/specs
- DeepSeek V3.1 Terminus model card: https://platform.deepseek.com/api-docs/news/news250922
- LiteLLM model routing: https://docs.litellm.ai/docs/providers/deepseek
- LLM Model Policy:
~/.claude/CLAUDE.md→ “LLM Model Policy” - Model & Effort Configuration (Opus/Sonnet/Haiku routing):
.claude/CLAUDE.md→ “Model & Effort Configuration” - Plan-First PR Discipline:
.claude/CLAUDE.md→ “Plan-First PR Discipline (Anti-Orphan Rule)”