Embeddings
The system converts every chunk of text — code, markdown, conversation, decision — into a dense vector (768-dim by default) that captures its meaning. Semantically similar text lands close together in vector space, which is what makes semantic search possible.
graph LR TXT["text chunk<br/>'JWT token validation'"] --> TOK[XLM-RoBERTa tokenizer<br/>truncate to MAX_SEQ_LENGTH] TOK --> ONNX[EmbeddingGemma ONNX<br/>300M params, fp32] ONNX --> POOL[Pooling<br/>auto: mean / cls / last] POOL --> NORM[L2 normalize<br/>unit-length vector] NORM --> VEC["[0.041, -0.012, 0.087, ...]<br/>768 floats"] VEC --> QDR[(Qdrant<br/>cosine distance)]
style ONNX fill:#0d1117,stroke:#fbbf24,color:#fff style QDR fill:#1a1a3a,stroke:#60a5fa,color:#fffDefault model: Google EmbeddingGemma-300M — 300M params, 768-dim, 2048 token context, trained on 100+ languages from Gemma 3. We use the ONNX Community export. Supports fp32, q8, and q4 (no fp16).
License note: EmbeddingGemma weights are governed by the Gemma Terms of Use and Prohibited Use Policy, not Apache/MIT. Imprint does not bundle the weights — they’re downloaded at runtime from HuggingFace, where you accept Gemma’s terms. If you switch to a different model (e.g. BGE-M3, MIT-licensed) you’re on that model’s license instead.
Alternative: BAAI BGE-M3 — 568M params, 1024-dim, 8192 token context. Switch via imprint config set model.name Xenova/bge-m3 && imprint config set model.dim 1024.
Swap to any HuggingFace ONNX model via imprint config set model.name <repo> — set model.dim and model.seq_length to match.
Pooling is configurable (imprint config set model.pooling <strategy>): auto (default — picks per model), cls (BGE-M3), mean, last. If the ONNX model returns pre-pooled 2D output, pooling is skipped automatically.
Memory & speed safeguards (set in embeddings.py):
- ONNX
enable_cpu_mem_arena=False+enable_mem_pattern=False— releases activations between calls instead of pinning a worst-case arena. Keeps RSS bounded on WSL2/low-RAM boxes. - Length-bucketed batching in
embed_documents_batch— sorts chunks by length so each batch pads to the longest item in its bucket, not the global max. Critical because activation memory scales withbatch × seq_len. - Per-batch
gc.collect()— drops intermediate tensors before the next iteration. - GPU VRAM cap via
gpu_mem_limit(default 2 GB) +arena_extend_strategy=kSameAsRequested— avoids the unbounded power-of-two arena growth that crashed WSL2 on long ingests.
GPU Acceleration
Section titled “GPU Acceleration”Embedding throughput on CPU is sufficient for incremental refresh but slow for initial large ingests. GPU is ~20× faster.
# Force GPUIMPRINT_DEVICE=gpu imprint ingest ~/code
# Force CPU (e.g. on a headless box without CUDA)IMPRINT_DEVICE=cpu imprint ingest ~/code
# Auto-detect (default) — uses GPU if onnxruntime-gpu + CUDAExecutionProvider availableimprint ingest ~/codeSetup checklist (one-time):
.venv/bin/pip install onnxruntime-gpu \ nvidia-cuda-runtime-cu12 nvidia-cublas-cu12 nvidia-cudnn-cu12 \ nvidia-cufft-cu12 nvidia-curand-cu12The runtime _preload_cuda_libs() dlopens the pip-installed CUDA libraries before constructing the ORT session, so you don’t need LD_LIBRARY_PATH set at process start.
Tunables (also configurable via imprint config set model.* — full list in configuration.md):
| Setting key | Default | Notes |
|---|---|---|
model.device | auto | auto / cpu / gpu |
model.gpu_mem_mb | 2048 | VRAM cap for ORT CUDA arena (WSL2-safe; raise on dedicated GPUs) |
model.gpu_device | 0 | CUDA device index |
model.threads | 4 | CPU intra-op threads |
model.batch_size | 0 (auto) | Embedding batch size. 0 = auto (32 on GPU, 16 on CPU). Raise on dedicated GPUs for faster ingest. |
model.seq_length | 2048 | Token truncation cap |
model.name | onnx-community/embeddinggemma-300m-ONNX | HF repo (any HuggingFace ONNX model) |
model.dim | 768 | Embedding dimension (must match model) |
model.file | auto | Override variant pick |
model.pooling | auto | Pooling: auto / cls / mean / last |