feat: add local_bm25 sparse provider for hybrid retrieval by rocke2020 · Pull Request #1857 · volcengine/OpenViking

rocke2020 · 2026-05-05T08:09:01Z

Summary

Adds local_bm25 as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependencies
Users can combine any dense embedding (e.g., Ollama qwen3-embedding:0.6b) with local BM25 for hybrid search via the existing CompositeHybridEmbedder path
Uses Milvus-inspired TF/IDF split architecture: documents store length-normalized TF at insert time, queries compute IDF-weighted vectors from live corpus stats at search time

Design

Follows the existing SparseEmbedderBase interface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:

Document vector: tf / (tf + k1 * (1 - b + b * doclen/avgdl)) per term
Query vector: idf(t) * (k1 + 1) per term
dot_product(query, doc) = BM25 score

Config example:

embedding:
  dense:
    provider: ollama
    model: qwen3-embedding:0.6b
    dimension: 1024
  sparse:
    provider: local_bm25

Files changed

File	Change
`openviking/models/embedder/local_bm25_embedder.py`	NEW: BM25 embedder (tokenizer, CRC32 hashing, corpus stats, scoring)
`openviking/models/embedder/__init__.py`	Export `LocalBM25Embedder`
`openviking_cli/utils/config/embedding_config.py`	Add `local_bm25` to provider validation + factory registry
`tests/unit/test_local_bm25_embedder.py`	28 unit tests

Test plan

Unit test tokenization and CRC32 hashing
Unit test BM25 stats persistence (save/load)
Unit test IDF: rare terms get higher weight than common terms
Unit test dot-product ranking: query "openviking" ranks doc containing "openviking" higher
Config validation: local_bm25 provider accepted, model defaults to "bm25"
Regression: dense-only configs still work unchanged
Integration: EmbeddingConfig with dense+sparse creates CompositeHybridEmbedder

🤖 Generated with Claude Code

CLAassistant · 2026-05-05T08:09:09Z

All committers have signed the CLA.

github-actions · 2026-05-05T08:10:15Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Performance Concern Saving BM25 stats to disk after every document embed can cause excessive I/O and slow down bulk insertions. Consider adding a configurable flush interval or an explicit flush method instead. self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path)

github-actions · 2026-05-05T08:11:26Z

PR Code Suggestions ✨

No code suggestions found for the PR.

MaojiaSheng · 2026-05-14T12:15:20Z

Thanks. and there are some suggestions:

Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support

Configuration Flexibility

local_bm25_embedder.py

def init(
self,
model_name: str = "bm25",
k1: float = DEFAULT_K1,
b: float = DEFAULT_B,
token_pattern: str = DEFAULT_TOKEN_PATTERN,
stats_path: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
):

The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config

Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
# ...
self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
self.stats.save(self._stats_path) # Saves after EVERY document

Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
Consider adding a batch mode or periodic autosave instead

ByteDanceLiuYang · 2026-05-14T12:40:53Z

+                self.total_tokens = raw.get("total_tokens", 0)
+                self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()}
+        except (json.JSONDecodeError, ValueError, OSError) as e:
+            logger.warning("bm25: failed to load stats from %s: %s", path, e)


如果有文件损坏等严重错误，最好是抛出异常

ByteDanceLiuYang · 2026-05-14T12:43:18Z

+        if is_query:
+            return self._embed_query(token_hashes)
+        return self._embed_document(token_hashes)
+


todo: 可以考虑实现下embed_batch

ByteDanceLiuYang · 2026-05-14T12:50:08Z

+
+def _hash_token(token: str) -> int:
+    """CRC32 hash of token, matching Milvus approach."""
+    return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF


可以不用照抄milvus的做法，crc32的空间有限，哈希碰撞概率大。这里python项目可以换用例如xxh64

谢谢，已经修改为 xxhash

ByteDanceLiuYang · 2026-05-14T12:54:13Z

+            logger.warning("bm25: failed to load stats from %s: %s", path, e)
+
+
+def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:


分词器可以拎出来一个配置项，默认至少可以用jieba之类，性能不会太差。当前这个分词器相当于对中文没分词，效果应该不太好。

谢谢！已经默认改成 jieba

…ud dependencies Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search. Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use live IDF from corpus stats. Dot product of the two produces correct BM25 ranking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rocke2020 · 2026-05-16T12:48:43Z

Thanks. and there are some suggestions:

Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens

Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support

Configuration Flexibility

local_bm25_embedder.py

def init( self, model_name: str = "bm25", k1: float = DEFAULT_K1, b: float = DEFAULT_B, token_pattern: str = DEFAULT_TOKEN_PATTERN, stats_path: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ):

The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory

Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config

Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult: # ... self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path) # Saves after EVERY document

Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts

Consider adding a batch mode or periodic autosave instead

thanks for your advice. I realized them.

ByteDanceLiuYang · 2026-05-18T03:26:36Z

@rocke2020 LGTM
cc @MaojiaSheng @zhoujh01

github-project-automation Bot added this to OpenViking project May 5, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 5, 2026

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 33ff666 to 1f4db25 Compare May 5, 2026 13:28

qin-ctx requested a review from zhoujh01 May 6, 2026 03:13

qin-ctx assigned zhoujh01 May 6, 2026

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1f4db25 to cc811fc Compare May 12, 2026 22:44

ByteDanceLiuYang reviewed May 14, 2026

View reviewed changes

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch 5 times, most recently from e6f6a8d to 1487897 Compare May 16, 2026 12:33

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1487897 to 06c6edd Compare May 16, 2026 12:36

		logger.warning("bm25: failed to load stats from %s: %s", path, e)


		def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:

Conversation

rocke2020 commented May 5, 2026

Summary

Design

Files changed

Test plan

Uh oh!

CLAassistant commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 5, 2026

PR Code Suggestions ✨

Uh oh!

MaojiaSheng commented May 14, 2026

local_bm25_embedder.py

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocke2020 commented May 16, 2026

local_bm25_embedder.py

Uh oh!

ByteDanceLiuYang commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented May 5, 2026 •

edited

Loading