Skip to content

feat(search): add BM25 ranked text search#9652

Open
shaunpatterson wants to merge 12 commits intodgraph-io:mainfrom
shaunpatterson:sp/bm25
Open

feat(search): add BM25 ranked text search#9652
shaunpatterson wants to merge 12 commits intodgraph-io:mainfrom
shaunpatterson:sp/bm25

Conversation

@shaunpatterson
Copy link

@shaunpatterson shaunpatterson commented Mar 4, 2026

Summary

  • Adds BM25 relevance-ranked text search to Dgraph as a new query function and filter
  • New @index(bm25) schema directive and bm25(predicate, "query") DQL syntax
  • Results sorted by BM25 relevance score (IDF-weighted term frequency with document length normalization)
  • Supports custom k/b tuning parameters: bm25(pred, "query", "1.5", "0.5")
  • Works as both root function and @filter

Changes

File Description
tok/tok.go BM25Tokenizer with duplicate-preserving tokens and TokensWithFrequency
tok/tokens.go GetBM25QueryTokens with deduplication and encoding
x/keys.go BM25IndexKey, BM25DocLenKey, BM25StatsKey helpers
posting/index.go addBM25IndexMutations, updateBM25Stats write path
worker/task.go handleBM25Search with full BM25 scoring engine
worker/tokens.go BM25 verification and token generation wiring
dql/parser.go Register "bm25" as valid function name
tok/tok_test.go 12 unit tests for tokenizer, frequencies, stemming, stopwords
query/query_bm25_test.go 12 integration tests for queries, ordering, filters, pagination
query/common_test.go BM25 test schema and 7 test documents

BM25 Formula

score(doc, term) = IDF(term) * (k+1) * tf / (k * (1 - b + b * docLen/avgDL) + tf)
IDF(term) = log1p((N - df + 0.5) / (df + 0.5))

Default: k=1.2, b=0.75 (Lucene/Elasticsearch variant with non-negative IDF)

Test plan

  • Unit tests pass: go test ./tok/... -run TestBM25 -v (12 tests)
  • go vet clean on all modified packages
  • All existing tests continue to pass
  • Integration tests: go test -tags integration -run TestBM25 ./query/ -v (requires Docker cluster)

🤖 Generated with Claude Code

@shaunpatterson shaunpatterson requested a review from a team as a code owner March 4, 2026 21:30
shaunpatterson and others added 3 commits March 4, 2026 23:08
Add BM25 relevance-ranked text search to Dgraph, enabling users to query
text predicates and receive results ordered by relevance score instead of
boolean matching.

Implementation:
- New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem)
  that preserves term frequencies for TF counting
- BM25-specific index storage: per-term TF posting lists, doc length lists,
  and corpus statistics (doc count, total terms)
- Query execution with full BM25 scoring:
  score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf)
  IDF = log1p((N - df + 0.5) / (df + 0.5))
- DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter
- Schema syntax: @index(bm25)
- Parameter validation (k > 0, 0 <= b <= 1)
- Early UID intersection for filter-mode performance
- All-stopword document and query handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three critical bugs fixed:

1. REF postings lose Value during rollup: The posting list encode/rollup
   cycle strips the Value field from REF postings without facets (list.go:1630).
   BM25 term frequencies and doc lengths were stored in Value and lost.
   Fix: Store TF and doclen as facets on REF postings, which are preserved.

2. Missing function validation: query/query.go has a separate isValidFuncName
   check from dql/parser.go. "bm25" was only added to the parser, causing
   "Invalid function name: bm25" at query time.

3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score,
   but the query pipeline (algo.MergeSorted, child predicate fetching) requires
   UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply
   first/offset pagination on score-sorted results before UID sorting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact
varint-encoded binary blobs stored as direct Badger KV entries (~4-6
bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for
variable-based score ordering following the similar_to pattern.

- Add posting/bm25enc package for compact binary encode/decode
- Rewrite write path in posting/index.go for direct Badger KV
- Add bm25Writes buffer to LocalCache with read-your-own-writes
- Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta
- Rewrite read path in worker/task.go with direct blob decoding
- Add bm25_score pseudo-predicate in query/query.go
- Add score ordering integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
shaunpatterson and others added 9 commits March 4, 2026 23:51
…cases

Cover incremental add/update/delete, IDF score stability as corpus
grows, large corpus pagination, unicode, stopwords, uid filtering,
score validation, and concurrent batch adds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c tests

Addresses test coverage gaps identified during code review against ArangoDB's
BM25 implementation:

- TestBM25ExactScoreValues: validates numerical correctness of BM25 formula
  using b=0 to enable hand-computed expected scores
- TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization
  and contrasts with default b=0.75 behavior
- TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of BM25 scaling plan. Introduces bm25block package with:
- BlockMeta/Dir types for block directory encoding/decoding
- SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks
- MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal
- ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation
- New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey,
  BM25DocLenBlockKey for block-addressed Badger KV storage

17 unit tests and benchmarks for the block storage format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phases 2-4 of BM25 scaling plan:

Phase 2 - Segmented mutation path:
- addBM25IndexMutations now writes to block-based storage
- Each term's postings split into ~128-entry blocks with a directory
- Blocks automatically split when exceeding 256 entries
- Doc-length list also uses block-based storage
- Block removal and directory cleanup on deletes

Phase 3 - WAND top-k query path:
- New bm25wand.go with listIter for block-based posting list iteration
- WAND algorithm with min-heap for top-k early termination
- Per-block upper bounds (UBPre) computed from maxTF at query time
- Suffix-max UBPre for efficient threshold checking
- Falls back to scoring all docs when no first: limit or offset is used

Phase 4 - Block-Max WAND:
- skipToWithBMW skips entire blocks whose UB + other terms can't beat theta
- Avoids Badger reads for blocks that can't contribute to top-k
- Enabled by default in handleBM25Search

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5 - Migration support:
- newListIter falls back to legacy monolithic blob when no block directory exists
- lookupDocLen falls back to legacy BM25DocLenKey blob
- wandSearch falls back to legacy BM25IndexKey for df computation
- Legacy data transparently served through synthetic single-block directory
- New writes always use block format; old data works until overwritten

Unit tests for WAND components:
- TestTopKHeapBasic: heap operations, threshold, eviction
- TestTopKHeapTieBreaking: deterministic ordering on score ties
- TestBm25ScoreFunction: formula verification, tf/dl/b edge cases
- TestBm25ScoreNaN: no NaN/Inf for edge-case inputs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes critical bugs and performance issues identified by GPT-5 review:

- Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against
  inBlockPos < 0 (possible before first next() call)
- Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now
  skip empty blocks instead of leaving iterator in invalid state with
  MaxUint64 pivotDoc
- Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer
  backwards, could cause re-scoring or infinite loops)
- Fix remainingUB panic: guard against blockIdx < 0 (before first next())
- Add docLenCache: caches doclen directory + block reads within a single
  query, avoiding repeated Badger reads per scored document
- Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of
  iterating all other terms (O(q^2) -> O(q))

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UB underestimate

Three fixes:
1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen
   blocks before incrementing stats, preventing double-counting on SET when the
   document was already indexed (defensive guard for batch mutations).
2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot),
   so BMW's otherUB calculation is correct and won't skip valid candidate blocks.
3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads
   (directory was read once for df, then again inside newListIter).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ength

Defensive hardening from GPT-5 review: if inBlockPos exceeds block length
after next() reaches end of block, the sort.Search span could go negative.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DecodeCount() to bm25enc for O(1) entry count reads without
  full decode, preventing OOM on legacy migration with large posting
  lists (e.g., common terms with millions of entries)
- Use DecodeCount in WAND search legacy DF calculation path
- Fix integer overflow in DecodeDir bounds check by using uint64
  arithmetic (prevents panic on corrupted data with MaxUint32 count)
- Pre-allocate shared score buffer in handleBM25Search with
  three-index slices to prevent accidental append corruption
- Document bm25Writes concurrency model and limitations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant