feat(grep): integrate VikingDB bm25 keyword search for grep engine by ByteDanceLiuYang · Pull Request #2144 · volcengine/OpenViking

ByteDanceLiuYang · 2026-05-20T08:36:14Z

Summary

The existing grep performs full filesystem traversal — walking the directory tree, reading every file, and applying regex line by line. This becomes prohibitively slow on large codebases (like tens of thousands of files, hundreds of MB), where a single grep can take minutes.

This PR introduces a two-phase grep strategy: use VikingDB search_by_keywords (bm25 mode) as a coarse-grained recall filter to narrow down candidate files, then perform precise local regex matching only on those candidates. The engine is configurable per-request (auto / fs). In auto mode, the system adaptively switches between pure fs and vikingdb+fs based on collection size and schema compatibility, with automatic fallback on any failure.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Feature Usage

API & Client

GrepRequest params: engine: Literal["auto", "fs"] = "auto", switch_to_remote_threshold: int = 1000, remote_return_limit: int = 100
All params threaded through FSService, all Python clients (base, http, sync_http, sync_client, async_client, local)
Rust CLI: --engine, --switch-to-remote-threshold, --remote-return-limit flags on ov grep command

API Parameters Add (in `grep` API)

Parameter	Type	Default	Constraints	Description
`engine`	`Literal["auto", "fs"]`	`"auto"`	—	Search engine mode
`switch_to_remote_threshold`	`int`	`1000`	≥0	L2 record count threshold; 0=always vikingdb
`remote_return_limit`	`int`	`100`	1–100000	Max files recalled by vikingdb bm25

These parameters are passed per-request on the grep API endpoint, alongside existing pattern, uri, etc. The count_cache_ttl is hardcoded to 60 seconds (not configurable).

Usage Example

1. Configure `ov.conf` for VikingDB backend

The storage.vectordb section must use volcengine or vikingdb backend to enable bm25 recall. Example:

{
  "storage": {
    // ...
    "vectordb": {
      "backend": "volcengine",
      "volcengine": {
        "ak": "YOUR_AK",
        "sk": "YOUR_SK",
        "region": "cn-beijing"
      },
      "name": "my_collection_for_ov",
      "index_name": "my_index_1"
    }
  }
}

Note: The collection must be created by OV >= v0.3.18 (with content field + FullText config). Existing collections from older versions will automatically fall back to fs mode.

2. Basic grep (auto mode, default)

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

This uses engine=auto by default. If the collection has ≥1000 L2 records and supports FullText, it will use vikingdb bm25 recall + fs precise match; otherwise falls back to pure fs.

3. Force filesystem grep

ov --account default --user default grep --uri viking://resources/code --engine fs 'VikingDB'

4. Always use vikingdb (threshold=0)

ov --account default --user default grep --uri viking://resources/code \
  --switch-to-remote-threshold 0 'VikingDB'

5. Increase vikingdb recall limit

ov --account default --user default grep --uri viking://resources/code \
  --switch-to-remote-threshold 0 --remote-return-limit 500 'VikingDB'

This recalls up to 500 candidate files from vikingdb bm25 before doing local regex matching.

Changes Made

1. Grep Engine Modes

Mode	Behavior
`auto` (default)	Adaptive: checks vector_store availability, backend type, schema compatibility (content + FullText), and L2 record count. If count ≥ `switch_to_remote_threshold`, uses vikingdb recall + fs match; otherwise falls back to pure fs.
`fs`	Forces traditional filesystem grep (original behavior).

Auto mode decision chain:

vector_store available? → no → fs
backend supports vikingdb? → no → fs
collection has content field + FullText config? → no → fs (with warning)
L2 count ≥ switch_to_remote_threshold? → yes → vikingdb recall + fs; no → fs

2. Collection Interface Layer

search_by_keywords gains mode: Optional[str] and fields: Optional[List[str]] params across all 6 collection implementations (vikingdb, volcengine, volcengine_api_key, http, local, mock)
None values are filtered before sending to API

3. Schema & Config

New content text field in context collection schema for FullText indexing
FullText config: [{"Field": "content", "Analyzer": {"Tokenizer": "standard"}}]
schema_version: "0.3.18" added to collection embedding metadata for version-aware compatibility checks
Startup compatibility check: warns (does not block) if old collection lacks content/FullText
API parameters: engine (Literal["auto", "fs"]), switch_to_remote_threshold (int, default=1000, ≥0), remote_return_limit (int, default=100, 1–100000); count_cache_ttl hardcoded to 60s

4. Data Pipeline

embedding_msg_converter.py writes vectorization_text[:65536] to content field

5. Business Logic (`viking_fs.py`)

grep() refactored with engine dispatch: _resolve_grep_engine() → _grep_fs() or _grep_vikingdb_then_fs()
_grep_vikingdb_then_fs(): bm25 recall → _grep_in_files() precise regex match; auto-fallback to fs on vikingdb errors
_get_cached_count(): per-URI count cache with hardcoded TTL=60s
_collection_has_fulltext(): checks content field + FullText config in collection metadata

6. Backend & Adapter

CollectionAdapter.search_by_keywords() delegates to collection, normalizes records
VikingVectorIndexBackend.search_by_keywords() and get_collection_meta() async methods

7. User-Agent Header

All VikingDB HTTP requests now include a User-Agent header with format openviking/{version} (e.g., openviking/0.3.18). This helps VikingDB server-side identify request sources for troubleshooting and traffic analytics.

8. Schema Compatibility

Existing collections created before v0.3.18 will not have the content field or FullText config. On startup, init_context_collection() detects this via schema_version in the Description metadata and logs a warning. The grep engine=auto path automatically falls back to fs mode for such collections. To enable vikingdb-based grep, the collection must be recreated.

Testing

113/113 storage tests pass
Updated test_init_context_collection_warns_on_mismatched_nonempty_collection — reflects current behavior (warn + return False instead of raising)
Updated test_context_collection_contains_content_field_for_fulltext — validates content field and FullText config presence
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Benchmark

1. Step-by-step workflow

Benchmark scripts are located at benchmark/retrieval/grep/vikingdb_bm25/:

# Step 1: Generate 80K test files (~4GB)
python3 step1_generate.py

# Step 2: Quick upload (skip VLM+embedding, fast)
python3 step2_quick_add_resource.py

# Step 3: Build index (embedding for bm25)
python3 step3_build_index.py

# Step 4: Benchmark
python3 step4_benchmark.py

2. Results

Environment: Debian 10, 12c24m
Total Data: 80,000 files, ~4GB, 4-level directory tree

Scenario	fs (ms)	auto (ms)	Speedup
Single keyword match	10,857	1,093	9.9x
Single keyword match	10,059	1,126	8.9x
2-keyword match	6,311	1,030	6.1x
3-keyword match	4,539	1,250	3.6x
Rare keyword match	10,592	1,016	10.4x
No match - 1 keyword	17,064	969	17.6x
No match - 2 keywords	17,729	1,023	17.3x
No match - 3 keywords	17,844	1,195	14.9x
Subdir match (~8K files)	3,420	554	6.2x
Subdir no match (~8K files)	3,376	602	5.6x

Key Findings:

Speedup ranges from 3.6x to 17.6x**. No-match scenarios benefit the most — fs must traverse all files to confirm
zero results, while bm25 returns empty immediately.
fs latency scales linearly with file count**, from 3.4s (8K files) to 17.8s (80K files). Multi-keyword regex |
matches faster in fs because it hits more files early.
auto latency stays within 0.55~1.25s across all scenarios**. The variance mainly comes from the second-stage local regex filtering on recalled candidates, not the bm25 recall phase itself.

github-actions · 2026-05-20T08:38:14Z

PR Reviewer Guide 🔍

(Review updated until commit `fbf4fea`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Grep Engine Integration (Core + CLI + Tests) Relevant files: openviking/storage/viking_fs.py openviking/storage/viking_vector_index_backend.py openviking/storage/collection_schemas.py openviking/server/routers/search.py openviking/service/fs_service.py openviking/async_client.py openviking/client/local.py openviking/sync_client.py openviking_cli/client/sync_http.py openviking_cli/client/http.py openviking_cli/client/base.py openviking/storage/queuefs/embedding_msg_converter.py openviking/storage/ovpack/vectors.py crates/ov_cli/src/main.rs crates/ov_cli/src/client.rs crates/ov_cli/src/handlers.rs crates/ov_cli/src/commands/search.rs tests/storage/test_rebuild_schema.py tests/storage/test_collection_schemas.py Sub-PR theme: Grep BM25 Benchmark Scripts Relevant files: benchmark/retrieval/grep/vikingdb_bm25/step1_generate.py benchmark/retrieval/grep/vikingdb_bm25/step2_quick_add_resource.py benchmark/retrieval/grep/vikingdb_bm25/step3_build_index.py benchmark/retrieval/grep/vikingdb_bm25/step4_benchmark.py Sub-PR theme: Vector DB Search by Keywords Extensions Relevant files: openviking/storage/vectordb/collection/http_collection.py openviking/storage/vectordb/collection/collection.py openviking/storage/vectordb/collection/volcengine_collection.py openviking/storage/vectordb/collection/vikingdb_collection.py openviking/storage/vectordb/collection/volcengine_api_key_collection.py openviking/storage/vectordb/collection/local_collection.py openviking/storage/vectordb/collection/volcengine_clients.py openviking/storage/vectordb/collection/vikingdb_clients.py openviking/storage/vectordb_adapters/base.py openviking/storage/vectordb/utils/validation.py openviking/storage/vectordb/service/app_models.py tests/storage/mock_backend.py
⚡ Recommended focus areas for review Naive Regex Splitting for BM25 Keywords The code splits the regex pattern on '\|' to extract keywords for BM25 search. This will fail for complex regex patterns with groups, character classes, or escaped '\|' (e.g., 'error\|(warning\|fail)', 'a\|b', '[a\|b]'), leading to incorrect keywords and potential fallback to fs mode unnecessarily. # for bm25 search. Limit to 10 keywords per VikingDB API constraint. keywords = [kw.strip() for kw in pattern.split("\|") if kw.strip()][:10] Count Cache Eviction Not Atomic The _count_cache eviction logic (checking size then deleting keys) is not atomic in an async context. Multiple concurrent calls could lead to unexpected cache behavior, though the impact is low since it's a cache. if len(self._count_cache) >= self._count_cache_max_size: oldest_keys = sorted(self._count_cache, key=lambda k: self._count_cache[k][1]) for k in oldest_keys[:len(oldest_keys) // 2]: del self._count_cache[k] self._count_cache[cache_key] = (count, now) Broad Exception Swallowing in search_by_keywords The search_by_keywords method catches all exceptions and returns an empty list, which could hide errors. Consider logging the exception and re-raising or falling back more explicitly. except Exception as e: logger.error("Error searching by keywords: %s", e) return []

github-actions · 2026-05-20T08:41:02Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-actions · 2026-05-25T04:03:27Z

Persistent review updated to latest commit fbf4fea

github-actions · 2026-05-25T04:05:43Z

PR Code Suggestions ✨

No code suggestions found for the PR.

MaojiaSheng · 2026-05-25T13:03:26Z

    node_limit: i32,
    level_limit: i32,
+    engine: Option<String>,
+    switch_to_remote_threshold: Option<i32>,


这些参数由于不常用，考虑放入 ovcli.conf 而不通过 flags 暴露

好的，已调整

MaojiaSheng · 2026-05-25T13:04:20Z

        offset: int = 0,
        filters: Optional[Dict[str, Any]] = None,
        output_fields: Optional[List[str]] = None,
+        mode: Optional[str] = None,


这两个参数设计还要斟酌下

好的，先删掉了。这2个参数目前在openviking其实可以不传，就是走的默认值行为

…che, use Literal, Split regex alternation into individual keywords for bm25 (max 10)

…v suffixes in version comparison

…nt, not necessary to fallback to local fs

… ov.conf

…arams in keywords search

github-project-automation Bot added this to OpenViking project May 20, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 20, 2026

github-actions Bot added the Review effort 3/5 label May 20, 2026

qin-ctx requested a review from zhoujh01 May 20, 2026 08:57

ByteDanceLiuYang marked this pull request as draft May 20, 2026 09:36

ByteDanceLiuYang force-pushed the grep_vikingdb branch 8 times, most recently from 92329f5 to fbf4fea Compare May 23, 2026 15:18

ByteDanceLiuYang marked this pull request as ready for review May 25, 2026 04:01

github-actions Bot added Review effort 4/5 and removed Review effort 3/5 labels May 25, 2026

ByteDanceLiuYang force-pushed the grep_vikingdb branch 6 times, most recently from 0672236 to 4d21a96 Compare May 25, 2026 07:59

ByteDanceLiuYang marked this pull request as draft May 25, 2026 12:51

MaojiaSheng reviewed May 25, 2026

View reviewed changes

feat(grep): integrate VikingDB bm25 keyword search for grep engine

a75798e

ByteDanceLiuYang added 9 commits May 26, 2026 17:35

fix(grep): address CI review feedback: max-size eviction to _count_ca…

89e911d

…che, use Literal, Split regex alternation into individual keywords for bm25 (max 10)

fix(schema): use dynamic __version__ for schema_version and handle de…

2b4c24c

…v suffixes in version comparison

fix(schema): upsert data to vikingdb lack of content

c69ddda

chore: add benchmark for retrieval

cfa4364

fix(grep): vikingdb return 200 and no results means no matching conte…

fe6c38b

…nt, not necessary to fallback to local fs

fix(benchmark): sub uri args; add report

90f7c24

refactor: code format by ruff

4ffe499

optimize: move grep config (engine and switch_to_remote_threshold) to…

ca53e7e

… ov.conf

optimize: auto adapt remote_return_limit by agg API; rm unnecessary p…

b80738d

…arams in keywords search

ByteDanceLiuYang force-pushed the grep_vikingdb branch from faaaf83 to b80738d Compare May 26, 2026 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
ByteDanceLiuYang wants to merge 10 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb

ByteDanceLiuYang commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

MaojiaSheng May 25, 2026

Uh oh!

ByteDanceLiuYang May 26, 2026

Uh oh!

MaojiaSheng May 25, 2026

Uh oh!

ByteDanceLiuYang May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ByteDanceLiuYang commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Feature Usage

API & Client

API Parameters Add (in grep API)

Usage Example

1. Configure ov.conf for VikingDB backend

2. Basic grep (auto mode, default)

3. Force filesystem grep

4. Always use vikingdb (threshold=0)

5. Increase vikingdb recall limit

Changes Made

1. Grep Engine Modes

2. Collection Interface Layer

3. Schema & Config

4. Data Pipeline

5. Business Logic (viking_fs.py)

6. Backend & Adapter

7. User-Agent Header

8. Schema Compatibility

Testing

Benchmark

1. Step-by-step workflow

2. Results

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit fbf4fea)

Uh oh!

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

PR Code Suggestions ✨

Uh oh!

MaojiaSheng May 25, 2026

Choose a reason for hiding this comment

Uh oh!

ByteDanceLiuYang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

MaojiaSheng May 25, 2026

Choose a reason for hiding this comment

Uh oh!

ByteDanceLiuYang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ByteDanceLiuYang commented May 20, 2026 •

edited

Loading

API Parameters Add (in `grep` API)

1. Configure `ov.conf` for VikingDB backend

5. Business Logic (`viking_fs.py`)

github-actions Bot commented May 20, 2026 •

edited

Loading

(Review updated until commit `fbf4fea`)