← Back to overview

Imprint token savings benchmark

Measured with Claude Code CLI in print mode. Both modes use identical prompts and tool permissions; the ON run adds the Imprint MCP server and injects the project's CLAUDE.md as the system prompt. Median across 5 runs per prompt. Raw JSON for every run is checked in at benchmark/results/raw/<sha>/.

Tokens
-70.4%
10,283,635 → 3,045,992
Cost
-31.7%
$2.8413 → $1.9412
Quality
+29.5%
weighted avg · LLM-as-judge
vs OFF baseline (0% = tie)
Environment
OS: Ubuntu24.04
Python: 3.11.5
Model: Sonnet + Haiku for OFF - Just Sonnet for ON
Imprint: 0.5.1
Memories: 5,427

Reading the numbers: cost vs. quality

The two categories where ON costs more — Session Summary and Creation — aren't regressions. They're the categories where Imprint returns substantially richer responses (+35% and +44% quality vs OFF, judged by Claude on completeness, accuracy, structure, specificity, and scope fidelity). ON pulls facts from many indexed sources in one go where OFF reads one or two files and paraphrases. The extra tokens buy a more complete first answer — and for creation prompts especially, a more complete first answer is what cuts the follow-up round trips that never show up in a single-prompt benchmark.

Information

tokens -87.2% cost -42.6% quality +3%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Chunking system 326,056 42,430 -87.0% $0.1329 $0.0880 $-0.0450
Embedding model 590,607 43,238 -92.7% $0.1703 $0.0602 $-0.1100
Workspace isolation 1,512,190 163,335 -89.2% $0.3007 $0.2093 $-0.0914
Metadata tagging 482,231 77,036 -84.0% $0.1745 $0.0745 $-0.1000
Qdrant lifecycle 401,752 96,399 -76.0% $0.1583 $0.1056 $-0.0527

Decision Recall

tokens -78.8% cost -46.1% quality +42%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Qdrant vs embedded 594,978 138,967 -76.6% $0.1616 $0.1110 $-0.0506
LLM tagger default off 261,359 42,253 -83.8% $0.1301 $0.0461 $-0.0841

Debugging

tokens -94.2% cost -68.3% quality +36%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Search returns nothing 986,996 13,821 -98.6% $0.1917 $0.0249 $-0.1668
Slow embedding on first run 912,618 96,601 -89.4% $0.2431 $0.1129 $-0.1301

Cross-Project

tokens -90.6% cost -46.9% quality +52%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Async work pattern 1,102,646 88,840 -91.9% $0.2564 $0.1106 $-0.1459
Config precedence 1,247,907 131,959 -89.4% $0.2390 $0.1523 $-0.0867

Session Summary

tokens +179.6% cost +204.1% quality +35%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Recent activity 37,050 103,597 +179.6% $0.0316 $0.0961 $0.0645
Why
Higher cost buys richer recall

ON mode doesn't read a file and paraphrase — it searches memory, walks the graph, and returns a cross-project synthesis. Response B is ≈1.4× the chars of the OFF baseline with sharper framing and granular implementation details. The token/cost delta is the price of pulling context from many indexed sources instead of one `git log` + Read.

Creation

tokens +9.9% cost +15.1% quality +44%
Prompt OFF tokens ON tokens Δ % OFF cost ON cost Δ $
Qdrant connectivity test 609,423 749,982 +23.1% $0.2298 $0.3170 $0.0872
Memory count utility 735,046 536,466 -27.0% $0.2265 $0.2057 $-0.0207
Workspace stats CLI 482,776 721,068 +49.4% $0.1947 $0.2269 $0.0321
Why
More complete first reply = fewer round trips

ON code answers include per-project breakdowns, call chains, docstrings, and config citations (≈1.5–2.2× the chars of OFF). A richer, more specific first response means fewer clarifying questions and retries downstream — so the +15% cost here avoids an often-larger spend across the follow-up turns you don't see in a single-prompt benchmark.

How to reproduce

# Full suite: 15 prompts × 5 runs × 2 modes = 150 runs
bash benchmark/run.sh

# Subset (one prompt per category)
bash benchmark/run.sh --subset

# Filter by category or prompt
bash benchmark/run.sh --category debug
bash benchmark/run.sh --prompt info-1

# Summarize existing results
python3 benchmark/summarize.py benchmark/results/raw/<sha>

What we're measuring

  • Total tokens: input + output + cache_read + cache_create, summed across all models (not just the primary Sonnet — Haiku sub-agents count).
  • Cost: costUSD from Claude Code's JSON output, summed across all models.
  • Turns: number of LLM turns per run. ON mode often takes more turns (search → answer) but with drastically smaller payloads.
  • Quality: optional LLM-as-judge (--llm-quality) scores ON vs OFF responses on completeness, accuracy, and structure.