feat: add local_bm25 sparse provider for hybrid retrieval#1857
feat: add local_bm25 sparse provider for hybrid retrieval#1857rocke2020 wants to merge 1 commit into
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
33ff666 to
1f4db25
Compare
1f4db25 to
cc811fc
Compare
|
Thanks. and there are some suggestions:
DEFAULT_TOKEN_PATTERN = r"\w+"
local_bm25_embedder.pydef init(
def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
|
| self.total_tokens = raw.get("total_tokens", 0) | ||
| self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()} | ||
| except (json.JSONDecodeError, ValueError, OSError) as e: | ||
| logger.warning("bm25: failed to load stats from %s: %s", path, e) |
| if is_query: | ||
| return self._embed_query(token_hashes) | ||
| return self._embed_document(token_hashes) | ||
|
|
|
|
||
| def _hash_token(token: str) -> int: | ||
| """CRC32 hash of token, matching Milvus approach.""" | ||
| return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF |
There was a problem hiding this comment.
可以不用照抄milvus的做法,crc32的空间有限,哈希碰撞概率大。这里python项目可以换用例如xxh64
| logger.warning("bm25: failed to load stats from %s: %s", path, e) | ||
|
|
||
|
|
||
| def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]: |
There was a problem hiding this comment.
分词器可以拎出来一个配置项,默认至少可以用jieba之类,性能不会太差。当前这个分词器相当于对中文没分词,效果应该不太好。
e6f6a8d to
1487897
Compare
…ud dependencies Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search. Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use live IDF from corpus stats. Dot product of the two produces correct BM25 ranking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1487897 to
06c6edd
Compare
thanks for your advice. I realized them. |
|
@rocke2020 LGTM |
Summary
local_bm25as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependenciesqwen3-embedding:0.6b) with local BM25 for hybrid search via the existingCompositeHybridEmbedderpathDesign
Follows the existing
SparseEmbedderBaseinterface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:tf / (tf + k1 * (1 - b + b * doclen/avgdl))per termidf(t) * (k1 + 1)per termConfig example:
Files changed
openviking/models/embedder/local_bm25_embedder.pyopenviking/models/embedder/__init__.pyLocalBM25Embedderopenviking_cli/utils/config/embedding_config.pylocal_bm25to provider validation + factory registrytests/unit/test_local_bm25_embedder.pyTest plan
local_bm25provider accepted, model defaults to "bm25"EmbeddingConfigwith dense+sparse createsCompositeHybridEmbedder🤖 Generated with Claude Code