Imprint token savings benchmark

Measured with Claude Code CLI in print mode. Both modes use identical prompts and tool permissions; the ON run adds the Imprint MCP server and injects the project's CLAUDE.md as the system prompt. Median across 5 runs per prompt. Raw JSON for every run is checked in at benchmark/results/raw/<sha>/.

Tokens

-70.4%

10,283,635 → 3,045,992

Cost

-31.7%

$2.8413 → $1.9412

Quality

+29.5%

weighted avg · LLM-as-judge
vs OFF baseline (0% = tie)

Environment

OS: Ubuntu24.04
Python: 3.11.5
Model: Sonnet + Haiku for OFF - Just Sonnet for ON
Imprint: 0.5.1
Memories: 5,427

Reading the numbers: cost vs. quality

The two categories where ON costs more — Session Summary and Creation — aren't regressions. They're the categories where Imprint returns substantially richer responses (+35% and +44% quality vs OFF, judged by Claude on completeness, accuracy, structure, specificity, and scope fidelity). ON pulls facts from many indexed sources in one go where OFF reads one or two files and paraphrases. The extra tokens buy a more complete first answer — and for creation prompts especially, a more complete first answer is what cuts the follow-up round trips that never show up in a single-prompt benchmark.

Information

tokens -87.2% cost -42.6% quality +3%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Chunking system	326,056	42,430	-87.0%	$0.1329	$0.0880	$-0.0450
Embedding model	590,607	43,238	-92.7%	$0.1703	$0.0602	$-0.1100
Workspace isolation	1,512,190	163,335	-89.2%	$0.3007	$0.2093	$-0.0914
Metadata tagging	482,231	77,036	-84.0%	$0.1745	$0.0745	$-0.1000
Qdrant lifecycle	401,752	96,399	-76.0%	$0.1583	$0.1056	$-0.0527

Decision Recall

tokens -78.8% cost -46.1% quality +42%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Qdrant vs embedded	594,978	138,967	-76.6%	$0.1616	$0.1110	$-0.0506
LLM tagger default off	261,359	42,253	-83.8%	$0.1301	$0.0461	$-0.0841

Debugging

tokens -94.2% cost -68.3% quality +36%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Search returns nothing	986,996	13,821	-98.6%	$0.1917	$0.0249	$-0.1668
Slow embedding on first run	912,618	96,601	-89.4%	$0.2431	$0.1129	$-0.1301

Cross-Project

tokens -90.6% cost -46.9% quality +52%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Async work pattern	1,102,646	88,840	-91.9%	$0.2564	$0.1106	$-0.1459
Config precedence	1,247,907	131,959	-89.4%	$0.2390	$0.1523	$-0.0867

Session Summary

tokens +179.6% cost +204.1% quality +35%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Recent activity	37,050	103,597	+179.6%	$0.0316	$0.0961	$0.0645

Why

Higher cost buys richer recall

ON mode doesn't read a file and paraphrase — it searches memory, walks the graph, and returns a cross-project synthesis. Response B is ≈1.4× the chars of the OFF baseline with sharper framing and granular implementation details. The token/cost delta is the price of pulling context from many indexed sources instead of one `git log` + Read.

Creation

tokens +9.9% cost +15.1% quality +44%

Prompt	OFF tokens	ON tokens	Δ %	OFF cost	ON cost	Δ $
Qdrant connectivity test	609,423	749,982	+23.1%	$0.2298	$0.3170	$0.0872
Memory count utility	735,046	536,466	-27.0%	$0.2265	$0.2057	$-0.0207
Workspace stats CLI	482,776	721,068	+49.4%	$0.1947	$0.2269	$0.0321

Why

More complete first reply = fewer round trips

ON code answers include per-project breakdowns, call chains, docstrings, and config citations (≈1.5–2.2× the chars of OFF). A richer, more specific first response means fewer clarifying questions and retries downstream — so the +15% cost here avoids an often-larger spend across the follow-up turns you don't see in a single-prompt benchmark.

How to reproduce

# Full suite: 15 prompts × 5 runs × 2 modes = 150 runs
bash benchmark/run.sh

# Subset (one prompt per category)
bash benchmark/run.sh --subset

# Filter by category or prompt
bash benchmark/run.sh --category debug
bash benchmark/run.sh --prompt info-1

# Summarize existing results
python3 benchmark/summarize.py benchmark/results/raw/<sha>

What we're measuring

Total tokens: input + output + cache_read + cache_create, summed across all models (not just the primary Sonnet — Haiku sub-agents count).
Cost: costUSD from Claude Code's JSON output, summed across all models.
Turns: number of LLM turns per run. ON mode often takes more turns (search → answer) but with drastically smaller payloads.
Quality: optional LLM-as-judge (--llm-quality) scores ON vs OFF responses on completeness, accuracy, and structure.