Skip to content

feat: embedding drift detection (canary check) #40

@rajkumarsakthivel

Description

@rajkumarsakthivel

Summary

Detect when the embedding model has silently changed (version bump, ONNX update) and alert the user or trigger a reindex, preventing silent retrieval quality degradation.

Problem

If fastembed or the ONNX model updates between sessions (e.g., user runs uv tool upgrade), the stored vectors no longer align with new query vectors. Retrieval quality degrades silently. The user sees worse results but has no indication why.

Approach

  1. At index time, embed 3-5 fixed canary strings (deterministic, hardcoded)
  2. Store the canary vectors alongside the index metadata
  3. On startup (or first query), re-embed the same canary strings
  4. Compare via cosine similarity. If drift exceeds threshold (e.g., similarity < 0.99), warn the user and suggest cce index --full

Implementation details

  • Canary strings should be short, diverse, and deterministic (e.g., "function hello world", "class UserAuthentication", "import os sys path")
  • Store in a canary table in the SQLite index or as a JSON file alongside the manifest
  • Check is cheap: 3-5 embeddings on startup, one cosine comparison each
  • Threshold: cosine < 0.99 means model changed (same model on same input should give ~1.0)

Inspiration

jCodeMunch's "embedding-drift canary" feature pins canary embeddings on first index and re-checks periodically.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions