Chunking
Long files don’t embed well as a single vector — the model has limited context, and one giant vector blurs together too many concepts. The chunker splits text into focused, retrieval-friendly pieces.
graph TB INPUT[file content + rel_path] --> EXT{file extension?} EXT -->|.py .ts .go .rs .java<br/>.php .rb .swift .sh ...| CODE[Chonkie CodeChunker<br/>tree-sitter AST boundaries] EXT -->|.md .txt .rst| PROSE[Chonkie SemanticChunker<br/>topic-shift detection] EXT -->|other| LEGACY[regex boundary fallback]
CODE --> OVERLAP[apply sliding overlap<br/>tail of prev → head of next] PROSE --> OVERLAP LEGACY --> OVERLAP
OVERLAP --> CAP[enforce HARD_MAX<br/>cut at line boundary] CAP --> WRAP["prefix '[rel_path]\n'<br/>so embedding sees source"] WRAP --> OUT[(chunks<br/>typed for downstream)]
style CODE fill:#0d1117,stroke:#f472b6,color:#fff style PROSE fill:#0d1117,stroke:#a78bfa,color:#fff style LEGACY fill:#0d1117,stroke:#666,color:#fffChonkie hybrid dispatch in chunker.py:
CodeChunker— tree-sitter parses the file into an AST and splits at function/class/method boundaries. Language-aware (Python, TypeScript, Go, Rust, Java, PHP, Ruby, Swift, C/C++, etc.) so it never cuts mid-function.SemanticChunker— embeds sentences with a tiny Model2Vec static embedder, computes cosine similarity over a sliding window, and splits where similarity drops belowchunker.semantic_threshold(default0.5). Lower = sharper splits / more chunks, higher = fewer, broader chunks. Dynamic chunk size: a coherent topic stays together; a topic shift triggers a split.- Sliding overlap —
IMPRINT_CHUNK_OVERLAP=400chars from the tail of each chunk prepended to the next. Preserves boundary context so retrieval doesn’t miss signal sitting right at a split. - Semantic subsplit for code — oversized code chunks (>8000 chars) get secondary topic-shift splitting via SemanticChunker. Small focused functions stay whole; large functions split where the logic changes.
| Knob | Default | Effect |
|---|---|---|
IMPRINT_CHUNK_SIZE_CODE | 4000 chars | Soft target for code (semantic subsplit above 2×) |
IMPRINT_CHUNK_SIZE_PROSE | 6000 chars | Soft target for prose (topic-shift is primary boundary) |
IMPRINT_CHUNK_HARD_MAX | 8000 chars | Absolute cap (~2k tokens, fits within 2048 token context) |
IMPRINT_CHUNK_OVERLAP | 400 chars | Sliding window between chunks |
Full setting names for persistence via imprint config set: chunker.size_code, chunker.size_prose, chunker.hard_max, chunker.overlap, chunker.semantic_threshold. See configuration.md.