Skip to content

feat: add local_bm25 sparse provider for hybrid retrieval#1857

Open
rocke2020 wants to merge 1 commit into
volcengine:mainfrom
rocke2020:feat/local-bm25-sparse-provider
Open

feat: add local_bm25 sparse provider for hybrid retrieval#1857
rocke2020 wants to merge 1 commit into
volcengine:mainfrom
rocke2020:feat/local-bm25-sparse-provider

Conversation

@rocke2020
Copy link
Copy Markdown

Summary

  • Adds local_bm25 as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependencies
  • Users can combine any dense embedding (e.g., Ollama qwen3-embedding:0.6b) with local BM25 for hybrid search via the existing CompositeHybridEmbedder path
  • Uses Milvus-inspired TF/IDF split architecture: documents store length-normalized TF at insert time, queries compute IDF-weighted vectors from live corpus stats at search time

Design

Follows the existing SparseEmbedderBase interface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:

  • Document vector: tf / (tf + k1 * (1 - b + b * doclen/avgdl)) per term
  • Query vector: idf(t) * (k1 + 1) per term
  • dot_product(query, doc) = BM25 score

Config example:

embedding:
  dense:
    provider: ollama
    model: qwen3-embedding:0.6b
    dimension: 1024
  sparse:
    provider: local_bm25

Files changed

File Change
openviking/models/embedder/local_bm25_embedder.py NEW: BM25 embedder (tokenizer, CRC32 hashing, corpus stats, scoring)
openviking/models/embedder/__init__.py Export LocalBM25Embedder
openviking_cli/utils/config/embedding_config.py Add local_bm25 to provider validation + factory registry
tests/unit/test_local_bm25_embedder.py 28 unit tests

Test plan

  • Unit test tokenization and CRC32 hashing
  • Unit test BM25 stats persistence (save/load)
  • Unit test IDF: rare terms get higher weight than common terms
  • Unit test dot-product ranking: query "openviking" ranks doc containing "openviking" higher
  • Config validation: local_bm25 provider accepted, model defaults to "bm25"
  • Regression: dense-only configs still work unchanged
  • Integration: EmbeddingConfig with dense+sparse creates CompositeHybridEmbedder

🤖 Generated with Claude Code

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 5, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Performance Concern

Saving BM25 stats to disk after every document embed can cause excessive I/O and slow down bulk insertions. Consider adding a configurable flush interval or an explicit flush method instead.

self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
    self.stats.save(self._stats_path)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 33ff666 to 1f4db25 Compare May 5, 2026 13:28
@qin-ctx qin-ctx requested a review from zhoujh01 May 6, 2026 03:13
@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1f4db25 to cc811fc Compare May 12, 2026 22:44
@MaojiaSheng
Copy link
Copy Markdown
Collaborator

Thanks. and there are some suggestions:

  1. Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

  • The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
  • Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support
  1. Configuration Flexibility

local_bm25_embedder.py

def init(
self,
model_name: str = "bm25",
k1: float = DEFAULT_K1,
b: float = DEFAULT_B,
token_pattern: str = DEFAULT_TOKEN_PATTERN,
stats_path: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
):

  • The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
  • Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config
  1. Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
# ...
self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
self.stats.save(self._stats_path) # Saves after EVERY document

  • Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
  • Consider adding a batch mode or periodic autosave instead

self.total_tokens = raw.get("total_tokens", 0)
self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()}
except (json.JSONDecodeError, ValueError, OSError) as e:
logger.warning("bm25: failed to load stats from %s: %s", path, e)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有文件损坏等严重错误,最好是抛出异常

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

if is_query:
return self._embed_query(token_hashes)
return self._embed_document(token_hashes)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: 可以考虑实现下embed_batch

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


def _hash_token(token: str) -> int:
"""CRC32 hash of token, matching Milvus approach."""
return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以不用照抄milvus的做法,crc32的空间有限,哈希碰撞概率大。这里python项目可以换用例如xxh64

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢,已经修改为 xxhash

logger.warning("bm25: failed to load stats from %s: %s", path, e)


def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分词器可以拎出来一个配置项,默认至少可以用jieba之类,性能不会太差。当前这个分词器相当于对中文没分词,效果应该不太好。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢!已经默认改成 jieba

@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch 5 times, most recently from e6f6a8d to 1487897 Compare May 16, 2026 12:33
…ud dependencies

Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine
any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search.
Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use
live IDF from corpus stats. Dot product of the two produces correct BM25 ranking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1487897 to 06c6edd Compare May 16, 2026 12:36
@rocke2020
Copy link
Copy Markdown
Author

Thanks. and there are some suggestions:

  1. Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

  • The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
  • Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support
  1. Configuration Flexibility

local_bm25_embedder.py

def init( self, model_name: str = "bm25", k1: float = DEFAULT_K1, b: float = DEFAULT_B, token_pattern: str = DEFAULT_TOKEN_PATTERN, stats_path: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ):

  • The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
  • Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config
  1. Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult: # ... self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path) # Saves after EVERY document

  • Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
  • Consider adding a batch mode or periodic autosave instead

thanks for your advice. I realized them.

@ByteDanceLiuYang
Copy link
Copy Markdown

@rocke2020 LGTM
cc @MaojiaSheng @zhoujh01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants