Skip to content

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144

Draft
ByteDanceLiuYang wants to merge 10 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb
Draft

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
ByteDanceLiuYang wants to merge 10 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb

Conversation

@ByteDanceLiuYang
Copy link
Copy Markdown

@ByteDanceLiuYang ByteDanceLiuYang commented May 20, 2026

Summary

The existing grep performs full filesystem traversal — walking the directory tree, reading every file, and applying regex line by line. This becomes prohibitively slow on large codebases (like tens of thousands of files, hundreds of MB), where a single grep can take minutes.

This PR introduces a two-phase grep strategy: use VikingDB search_by_keywords (bm25 mode) as a coarse-grained recall filter to narrow down candidate files, then perform precise local regex matching only on those candidates. The engine is configurable per-request (auto / fs). In auto mode, the system adaptively switches between pure fs and vikingdb+fs based on collection size and schema compatibility, with automatic fallback on any failure.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Feature Usage

API & Client

  • GrepRequest params: engine: Literal["auto", "fs"] = "auto", switch_to_remote_threshold: int = 1000, remote_return_limit: int = 100
  • All params threaded through FSService, all Python clients (base, http, sync_http, sync_client, async_client, local)
  • Rust CLI: --engine, --switch-to-remote-threshold, --remote-return-limit flags on ov grep command

API Parameters Add (in grep API)

Parameter Type Default Constraints Description
engine Literal["auto", "fs"] "auto" Search engine mode
switch_to_remote_threshold int 1000 ≥0 L2 record count threshold; 0=always vikingdb
remote_return_limit int 100 1–100000 Max files recalled by vikingdb bm25

These parameters are passed per-request on the grep API endpoint, alongside existing pattern, uri, etc. The count_cache_ttl is hardcoded to 60 seconds (not configurable).

Usage Example

1. Configure ov.conf for VikingDB backend

The storage.vectordb section must use volcengine or vikingdb backend to enable bm25 recall. Example:

{
  "storage": {
    // ...
    "vectordb": {
      "backend": "volcengine",
      "volcengine": {
        "ak": "YOUR_AK",
        "sk": "YOUR_SK",
        "region": "cn-beijing"
      },
      "name": "my_collection_for_ov",
      "index_name": "my_index_1"
    }
  }
}

Note: The collection must be created by OV >= v0.3.18 (with content field + FullText config). Existing collections from older versions will automatically fall back to fs mode.

2. Basic grep (auto mode, default)

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

This uses engine=auto by default. If the collection has ≥1000 L2 records and supports FullText, it will use vikingdb bm25 recall + fs precise match; otherwise falls back to pure fs.

3. Force filesystem grep

ov --account default --user default grep --uri viking://resources/code --engine fs 'VikingDB'

4. Always use vikingdb (threshold=0)

ov --account default --user default grep --uri viking://resources/code \
  --switch-to-remote-threshold 0 'VikingDB'

5. Increase vikingdb recall limit

ov --account default --user default grep --uri viking://resources/code \
  --switch-to-remote-threshold 0 --remote-return-limit 500 'VikingDB'

This recalls up to 500 candidate files from vikingdb bm25 before doing local regex matching.

Changes Made

1. Grep Engine Modes

Mode Behavior
auto (default) Adaptive: checks vector_store availability, backend type, schema compatibility (content + FullText), and L2 record count. If count ≥ switch_to_remote_threshold, uses vikingdb recall + fs match; otherwise falls back to pure fs.
fs Forces traditional filesystem grep (original behavior).

Auto mode decision chain:

  1. vector_store available? → no → fs
  2. backend supports vikingdb? → no → fs
  3. collection has content field + FullText config? → no → fs (with warning)
  4. L2 count ≥ switch_to_remote_threshold? → yes → vikingdb recall + fs; no → fs

2. Collection Interface Layer

  • search_by_keywords gains mode: Optional[str] and fields: Optional[List[str]] params across all 6 collection implementations (vikingdb, volcengine, volcengine_api_key, http, local, mock)
  • None values are filtered before sending to API

3. Schema & Config

  • New content text field in context collection schema for FullText indexing
  • FullText config: [{"Field": "content", "Analyzer": {"Tokenizer": "standard"}}]
  • schema_version: "0.3.18" added to collection embedding metadata for version-aware compatibility checks
  • Startup compatibility check: warns (does not block) if old collection lacks content/FullText
  • API parameters: engine (Literal["auto", "fs"]), switch_to_remote_threshold (int, default=1000, ≥0), remote_return_limit (int, default=100, 1–100000); count_cache_ttl hardcoded to 60s

4. Data Pipeline

  • embedding_msg_converter.py writes vectorization_text[:65536] to content field

5. Business Logic (viking_fs.py)

  • grep() refactored with engine dispatch: _resolve_grep_engine()_grep_fs() or _grep_vikingdb_then_fs()
  • _grep_vikingdb_then_fs(): bm25 recall → _grep_in_files() precise regex match; auto-fallback to fs on vikingdb errors
  • _get_cached_count(): per-URI count cache with hardcoded TTL=60s
  • _collection_has_fulltext(): checks content field + FullText config in collection metadata

6. Backend & Adapter

  • CollectionAdapter.search_by_keywords() delegates to collection, normalizes records
  • VikingVectorIndexBackend.search_by_keywords() and get_collection_meta() async methods

7. User-Agent Header

All VikingDB HTTP requests now include a User-Agent header with format openviking/{version} (e.g., openviking/0.3.18). This helps VikingDB server-side identify request sources for troubleshooting and traffic analytics.

8. Schema Compatibility

Existing collections created before v0.3.18 will not have the content field or FullText config. On startup, init_context_collection() detects this via schema_version in the Description metadata and logs a warning. The grep engine=auto path automatically falls back to fs mode for such collections. To enable vikingdb-based grep, the collection must be recreated.

Testing

  • 113/113 storage tests pass

  • Updated test_init_context_collection_warns_on_mismatched_nonempty_collection — reflects current behavior (warn + return False instead of raising)

  • Updated test_context_collection_contains_content_field_for_fulltext — validates content field and FullText config presence

  • I have added tests that prove my fix is effective or that my feature works

  • New and existing unit tests pass locally with my changes

  • I have tested this on the following platforms:

    • Linux
    • macOS
    • Windows

Benchmark

1. Step-by-step workflow

Benchmark scripts are located at benchmark/retrieval/grep/vikingdb_bm25/:

# Step 1: Generate 80K test files (~4GB)
python3 step1_generate.py

# Step 2: Quick upload (skip VLM+embedding, fast)
python3 step2_quick_add_resource.py

# Step 3: Build index (embedding for bm25)
python3 step3_build_index.py

# Step 4: Benchmark
python3 step4_benchmark.py

2. Results

Environment: Debian 10, 12c24m
Total Data: 80,000 files, ~4GB, 4-level directory tree

Scenario fs (ms) auto (ms) Speedup
Single keyword match 10,857 1,093 9.9x
Single keyword match 10,059 1,126 8.9x
2-keyword match 6,311 1,030 6.1x
3-keyword match 4,539 1,250 3.6x
Rare keyword match 10,592 1,016 10.4x
No match - 1 keyword 17,064 969 17.6x
No match - 2 keywords 17,729 1,023 17.3x
No match - 3 keywords 17,844 1,195 14.9x
Subdir match (~8K files) 3,420 554 6.2x
Subdir no match (~8K files) 3,376 602 5.6x

Key Findings:

  • Speedup ranges from 3.6x to 17.6x**. No-match scenarios benefit the most — fs must traverse all files to confirm
    zero results, while bm25 returns empty immediately.
  • fs latency scales linearly with file count**, from 3.4s (8K files) to 17.8s (80K files). Multi-keyword regex |
    matches faster in fs because it hits more files early.
  • auto latency stays within 0.55~1.25s across all scenarios**. The variance mainly comes from the second-stage local regex filtering on recalled candidates, not the bm25 recall phase itself.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

PR Reviewer Guide 🔍

(Review updated until commit fbf4fea)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Grep Engine Integration (Core + CLI + Tests)

Relevant files:

  • openviking/storage/viking_fs.py
  • openviking/storage/viking_vector_index_backend.py
  • openviking/storage/collection_schemas.py
  • openviking/server/routers/search.py
  • openviking/service/fs_service.py
  • openviking/async_client.py
  • openviking/client/local.py
  • openviking/sync_client.py
  • openviking_cli/client/sync_http.py
  • openviking_cli/client/http.py
  • openviking_cli/client/base.py
  • openviking/storage/queuefs/embedding_msg_converter.py
  • openviking/storage/ovpack/vectors.py
  • crates/ov_cli/src/main.rs
  • crates/ov_cli/src/client.rs
  • crates/ov_cli/src/handlers.rs
  • crates/ov_cli/src/commands/search.rs
  • tests/storage/test_rebuild_schema.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Grep BM25 Benchmark Scripts

Relevant files:

  • benchmark/retrieval/grep/vikingdb_bm25/step1_generate.py
  • benchmark/retrieval/grep/vikingdb_bm25/step2_quick_add_resource.py
  • benchmark/retrieval/grep/vikingdb_bm25/step3_build_index.py
  • benchmark/retrieval/grep/vikingdb_bm25/step4_benchmark.py

Sub-PR theme: Vector DB Search by Keywords Extensions

Relevant files:

  • openviking/storage/vectordb/collection/http_collection.py
  • openviking/storage/vectordb/collection/collection.py
  • openviking/storage/vectordb/collection/volcengine_collection.py
  • openviking/storage/vectordb/collection/vikingdb_collection.py
  • openviking/storage/vectordb/collection/volcengine_api_key_collection.py
  • openviking/storage/vectordb/collection/local_collection.py
  • openviking/storage/vectordb/collection/volcengine_clients.py
  • openviking/storage/vectordb/collection/vikingdb_clients.py
  • openviking/storage/vectordb_adapters/base.py
  • openviking/storage/vectordb/utils/validation.py
  • openviking/storage/vectordb/service/app_models.py
  • tests/storage/mock_backend.py

⚡ Recommended focus areas for review

Naive Regex Splitting for BM25 Keywords

The code splits the regex pattern on '|' to extract keywords for BM25 search. This will fail for complex regex patterns with groups, character classes, or escaped '|' (e.g., 'error|(warning|fail)', 'a|b', '[a|b]'), leading to incorrect keywords and potential fallback to fs mode unnecessarily.

# for bm25 search. Limit to 10 keywords per VikingDB API constraint.
keywords = [kw.strip() for kw in pattern.split("|") if kw.strip()][:10]
Count Cache Eviction Not Atomic

The _count_cache eviction logic (checking size then deleting keys) is not atomic in an async context. Multiple concurrent calls could lead to unexpected cache behavior, though the impact is low since it's a cache.

if len(self._count_cache) >= self._count_cache_max_size:
    oldest_keys = sorted(self._count_cache, key=lambda k: self._count_cache[k][1])
    for k in oldest_keys[:len(oldest_keys) // 2]:
        del self._count_cache[k]
self._count_cache[cache_key] = (count, now)
Broad Exception Swallowing in search_by_keywords

The search_by_keywords method catches all exceptions and returns an empty list, which could hide errors. Consider logging the exception and re-raising or falling back more explicitly.

except Exception as e:
    logger.error("Error searching by keywords: %s", e)
    return []

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@qin-ctx qin-ctx requested a review from zhoujh01 May 20, 2026 08:57
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as draft May 20, 2026 09:36
@ByteDanceLiuYang ByteDanceLiuYang force-pushed the grep_vikingdb branch 8 times, most recently from 92329f5 to fbf4fea Compare May 23, 2026 15:18
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as ready for review May 25, 2026 04:01
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit fbf4fea

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@ByteDanceLiuYang ByteDanceLiuYang force-pushed the grep_vikingdb branch 6 times, most recently from 0672236 to 4d21a96 Compare May 25, 2026 07:59
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as draft May 25, 2026 12:51
Comment thread crates/ov_cli/src/commands/search.rs Outdated
node_limit: i32,
level_limit: i32,
engine: Option<String>,
switch_to_remote_threshold: Option<i32>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些参数由于不常用,考虑放入 ovcli.conf 而不通过 flags 暴露

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,已调整

offset: int = 0,
filters: Optional[Dict[str, Any]] = None,
output_fields: Optional[List[str]] = None,
mode: Optional[str] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个参数设计还要斟酌下

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,先删掉了。这2个参数目前在openviking其实可以不传,就是走的默认值行为

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants