Skip to content
Download

Embeddings

The system converts every chunk of text — code, markdown, conversation, decision — into a dense vector (768-dim by default) that captures its meaning. Semantically similar text lands close together in vector space, which is what makes semantic search possible.

graph LR
TXT["text chunk<br/>'JWT token validation'"] --> TOK[XLM-RoBERTa tokenizer<br/>truncate to MAX_SEQ_LENGTH]
TOK --> ONNX[EmbeddingGemma ONNX<br/>300M params, fp32]
ONNX --> POOL[Pooling<br/>auto: mean / cls / last]
POOL --> NORM[L2 normalize<br/>unit-length vector]
NORM --> VEC["[0.041, -0.012, 0.087, ...]<br/>768 floats"]
VEC --> QDR[(Qdrant<br/>cosine distance)]
style ONNX fill:#0d1117,stroke:#fbbf24,color:#fff
style QDR fill:#1a1a3a,stroke:#60a5fa,color:#fff

Default model: Google EmbeddingGemma-300M — 300M params, 768-dim, 2048 token context, trained on 100+ languages from Gemma 3. We use the ONNX Community export. Supports fp32, q8, and q4 (no fp16).

License note: EmbeddingGemma weights are governed by the Gemma Terms of Use and Prohibited Use Policy, not Apache/MIT. Imprint does not bundle the weights — they’re downloaded at runtime from HuggingFace, where you accept Gemma’s terms. If you switch to a different model (e.g. BGE-M3, MIT-licensed) you’re on that model’s license instead.

Alternative: BAAI BGE-M3 — 568M params, 1024-dim, 8192 token context. Switch via imprint config set model.name Xenova/bge-m3 && imprint config set model.dim 1024.

Swap to any HuggingFace ONNX model via imprint config set model.name <repo> — set model.dim and model.seq_length to match.

Pooling is configurable (imprint config set model.pooling <strategy>): auto (default — picks per model), cls (BGE-M3), mean, last. If the ONNX model returns pre-pooled 2D output, pooling is skipped automatically.

Memory & speed safeguards (set in embeddings.py):

  • ONNX enable_cpu_mem_arena=False + enable_mem_pattern=False — releases activations between calls instead of pinning a worst-case arena. Keeps RSS bounded on WSL2/low-RAM boxes.
  • Length-bucketed batching in embed_documents_batch — sorts chunks by length so each batch pads to the longest item in its bucket, not the global max. Critical because activation memory scales with batch × seq_len.
  • Per-batch gc.collect() — drops intermediate tensors before the next iteration.
  • GPU VRAM cap via gpu_mem_limit (default 2 GB) + arena_extend_strategy=kSameAsRequested — avoids the unbounded power-of-two arena growth that crashed WSL2 on long ingests.

Embedding throughput on CPU is sufficient for incremental refresh but slow for initial large ingests. GPU is ~20× faster.

Terminal window
# Force GPU
IMPRINT_DEVICE=gpu imprint ingest ~/code
# Force CPU (e.g. on a headless box without CUDA)
IMPRINT_DEVICE=cpu imprint ingest ~/code
# Auto-detect (default) — uses GPU if onnxruntime-gpu + CUDAExecutionProvider available
imprint ingest ~/code

Setup checklist (one-time):

Terminal window
.venv/bin/pip install onnxruntime-gpu \
nvidia-cuda-runtime-cu12 nvidia-cublas-cu12 nvidia-cudnn-cu12 \
nvidia-cufft-cu12 nvidia-curand-cu12

The runtime _preload_cuda_libs() dlopens the pip-installed CUDA libraries before constructing the ORT session, so you don’t need LD_LIBRARY_PATH set at process start.

Tunables (also configurable via imprint config set model.* — full list in configuration.md):

Setting keyDefaultNotes
model.deviceautoauto / cpu / gpu
model.gpu_mem_mb2048VRAM cap for ORT CUDA arena (WSL2-safe; raise on dedicated GPUs)
model.gpu_device0CUDA device index
model.threads4CPU intra-op threads
model.batch_size0 (auto)Embedding batch size. 0 = auto (32 on GPU, 16 on CPU). Raise on dedicated GPUs for faster ingest.
model.seq_length2048Token truncation cap
model.nameonnx-community/embeddinggemma-300m-ONNXHF repo (any HuggingFace ONNX model)
model.dim768Embedding dimension (must match model)
model.fileautoOverride variant pick
model.poolingautoPooling: auto / cls / mean / last