Imprint token savings benchmark
Measured with Claude Code CLI in print mode. Both modes use identical prompts and tool permissions; the ON run adds the Imprint MCP server and injects the project's CLAUDE.md as the system prompt. Median across 5 runs per prompt. Raw JSON for every run is checked in at
benchmark/results/raw/<sha>/.
vs OFF baseline (0% = tie)
Python: 3.11.5
Model: Sonnet + Haiku for OFF - Just Sonnet for ON
Imprint: 0.5.1
Memories: 5,427
Reading the numbers: cost vs. quality
The two categories where ON costs more — Session Summary and Creation — aren't regressions. They're the categories where Imprint returns substantially richer responses (+35% and +44% quality vs OFF, judged by Claude on completeness, accuracy, structure, specificity, and scope fidelity). ON pulls facts from many indexed sources in one go where OFF reads one or two files and paraphrases. The extra tokens buy a more complete first answer — and for creation prompts especially, a more complete first answer is what cuts the follow-up round trips that never show up in a single-prompt benchmark.
Information
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Chunking system | 326,056 | 42,430 | -87.0% | $0.1329 | $0.0880 | $-0.0450 |
| Embedding model | 590,607 | 43,238 | -92.7% | $0.1703 | $0.0602 | $-0.1100 |
| Workspace isolation | 1,512,190 | 163,335 | -89.2% | $0.3007 | $0.2093 | $-0.0914 |
| Metadata tagging | 482,231 | 77,036 | -84.0% | $0.1745 | $0.0745 | $-0.1000 |
| Qdrant lifecycle | 401,752 | 96,399 | -76.0% | $0.1583 | $0.1056 | $-0.0527 |
Decision Recall
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Qdrant vs embedded | 594,978 | 138,967 | -76.6% | $0.1616 | $0.1110 | $-0.0506 |
| LLM tagger default off | 261,359 | 42,253 | -83.8% | $0.1301 | $0.0461 | $-0.0841 |
Debugging
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Search returns nothing | 986,996 | 13,821 | -98.6% | $0.1917 | $0.0249 | $-0.1668 |
| Slow embedding on first run | 912,618 | 96,601 | -89.4% | $0.2431 | $0.1129 | $-0.1301 |
Cross-Project
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Async work pattern | 1,102,646 | 88,840 | -91.9% | $0.2564 | $0.1106 | $-0.1459 |
| Config precedence | 1,247,907 | 131,959 | -89.4% | $0.2390 | $0.1523 | $-0.0867 |
Session Summary
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Recent activity | 37,050 | 103,597 | +179.6% | $0.0316 | $0.0961 | $0.0645 |
ON mode doesn't read a file and paraphrase — it searches memory, walks the graph, and returns a cross-project synthesis. Response B is ≈1.4× the chars of the OFF baseline with sharper framing and granular implementation details. The token/cost delta is the price of pulling context from many indexed sources instead of one `git log` + Read.
Creation
| Prompt | OFF tokens | ON tokens | Δ % | OFF cost | ON cost | Δ $ |
|---|---|---|---|---|---|---|
| Qdrant connectivity test | 609,423 | 749,982 | +23.1% | $0.2298 | $0.3170 | $0.0872 |
| Memory count utility | 735,046 | 536,466 | -27.0% | $0.2265 | $0.2057 | $-0.0207 |
| Workspace stats CLI | 482,776 | 721,068 | +49.4% | $0.1947 | $0.2269 | $0.0321 |
ON code answers include per-project breakdowns, call chains, docstrings, and config citations (≈1.5–2.2× the chars of OFF). A richer, more specific first response means fewer clarifying questions and retries downstream — so the +15% cost here avoids an often-larger spend across the follow-up turns you don't see in a single-prompt benchmark.
How to reproduce
# Full suite: 15 prompts × 5 runs × 2 modes = 150 runs
bash benchmark/run.sh
# Subset (one prompt per category)
bash benchmark/run.sh --subset
# Filter by category or prompt
bash benchmark/run.sh --category debug
bash benchmark/run.sh --prompt info-1
# Summarize existing results
python3 benchmark/summarize.py benchmark/results/raw/<sha> What we're measuring
- Total tokens: input + output + cache_read + cache_create, summed across all models (not just the primary Sonnet — Haiku sub-agents count).
- Cost:
costUSDfrom Claude Code's JSON output, summed across all models. - Turns: number of LLM turns per run. ON mode often takes more turns (search → answer) but with drastically smaller payloads.
- Quality: optional LLM-as-judge (
--llm-quality) scores ON vs OFF responses on completeness, accuracy, and structure.