fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)#776
fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)#776jlin53882 wants to merge 11 commits into
Conversation
…issue CortexReach#692) Adds tree-sitter-based chunking to prevent splitting code mid-function. Changes: - detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content - astChunk(): split code at declaration boundaries (function/class/method) - smartChunk(): route through astChunk when file is detected as code - 20 unit tests covering all destructive split scenarios Test: node --test test/ast-code-chunking.test.mjs (20/20 pass) Config: astAwareCodeSplit defaults to true
…apture architecture, upgrade isCliMode→isCliRegistrationMode(api)
|
Review finding: P2: the new MMR tests do not exercise the production implementation
|
…I manifest Resolves PR CortexReach#776 reviewer comment: - test/mmr-tiny.test.mjs previously imported local copies of the MMR algorithm instead of exercising the production method, allowing the test to pass even if the actual retriever implementation diverged. Fix: rewrite test to call retriever.applyMMRDiversity() directly via jiti import, covering optimized path, fallback path (duplicate IDs), null/undefined vectors, and threshold boundaries. Also registers test/mmr-tiny.test.mjs in scripts/ci-test-manifest.mjs under core-regression so it runs in CI checks.
9cdc7cf to
c26117b
Compare
…nflicts - index.ts: keep upstream Phase 1 code, use isCliRegistrationMode(api) for logReg - verify-ci-test-manifest.mjs: use CI_TEST_MANIFEST variable (upstream refactor)
…CortexReach#763) Optimizations applied: 1. Pre-convert all vectors once at entry (eliminates O(n) Array.from() from inside the inner loop — was causing O(n³) total cost) 2. id→index Map for O(1) lookup — replaces O(n) selected.findIndex inside selected.some() 3. Duplicate ID detection — upfront Set check routes to applyMMRDiversity_Fallback() which preserves original semantics for duplicate-ID edge case applyMMRDiversity: O(n²) — pre-convert vectors + Map-based O(1) id lookup applyMMRDiversity_Fallback: O(n²) — findIndex-based (safe for small dup sets)
Merge of upstream/master introduced duplicate entry for test/issue-690-cross-call-batch.test.mjs. Remove second occurrence.
…Fallback Strengthen existing tests: - orthogonal vectors: add assert.deepStrictEqual for exact id order - similar vectors deferred: add explicit r1/r2/r3 position assertions + verify deferred portion contains r2 - null/undefined vector: assert returned id matches expected New cross-validation tests (3): - optimized path matches Fallback for diverse inputs (unique IDs) - optimized path matches Fallback for similar inputs (unique IDs) - optimized path routes to Fallback for duplicate IDs These tests directly verify that the O(n²) optimized path produces identical results to the O(n²) Fallback path, ensuring correctness of the Map-based optimization.
rwmjhb
left a comment
There was a problem hiding this comment.
PR #776 Review: fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)
Verdict: REQUEST-CHANGES | 6 rounds completed | Value: 42% | Size: XL | Author: jlin53882
Value Assessment
Problem: applyMMRDiversity in the retriever performs repeated vector conversion and selected-result lookup inside nested loops, making larger retrieval result sets slower than necessary. The PR claims to reduce the MMR diversity pass from O(n³) to O(n²) while preserving behavior for duplicate IDs.
| Dimension | Assessment |
|---|---|
| Value Score | 42% |
| Value Verdict | review |
| Issue Linked | true |
| Project Aligned | true |
| Duplicate | false |
| AI Slop Score | 1/6 |
| User Impact | medium |
| Urgency | medium |
Scope Drift: 3 flag(s)
- src/chunker.ts and test/ast-code-chunking.test.mjs implement Issue #692 AST-based chunking, which is not justified by the #763 MMR optimization problem statement
- package.json/package-lock.json add tree-sitter native dependencies for AST chunking, not for applyMMRDiversity
- index.ts CLI registration-mode changes are unrelated to the retriever MMR performance fix
AI Slop Signals:
- Claim/diff mismatch: PR title and body center on #763 applyMMRDiversity optimization, but the diff includes src/chunker.ts AST chunking and test/ast-code-chunking.test.mjs for #692.
Open Questions:
- Is Issue #763 maintainer-acknowledged? The provided issue context has no labels, assignment, or maintainer comments visible.
- Should the Issue #692 AST chunking changes and tree-sitter dependencies be split into a separate PR before reviewing #763?
- Is the P3 event-loop-blocking portion of Issue #763 intentionally out of scope for this PR?
Summary
applyMMRDiversity in the retriever performs repeated vector conversion and selected-result lookup inside nested loops, making larger retrieval result sets slower than necessary. The PR claims to reduce the MMR diversity pass from O(n³) to O(n²) while preserving behavior for duplicate IDs.
Evaluation Signals
| Signal | Value |
|---|---|
| Blockers | 0 |
| Warnings | 1 |
| PR Size | XL |
| Verdict Floor | approve |
| Risk Level | high |
| Value Model | codex |
| Primary Model | codex |
| Adversarial Model | claude |
Must Fix
- F3: Oversized code declarations violate chunk size limits
Nice to Have
- F1: New any types in AST chunker
- F2: JS and TS imports are classified as Python
- F4: AST loader uses require in an ESM module
- MR2: CLI mode detection semantic change rides along with no test coverage and no rationale tied to #763
- MR3: No benchmark validates the claimed O(n³)→O(n²) improvement
- MR4: Dead
subChunkhelper inside chunker.ts - MR5: detectCodeLanguage docstring/code mismatch (200 vs 400 chars)
Recommended Action
Author should address must-fix findings before merge.
Reviewed at 2026-05-12T17:31:38Z | 6 rounds | Value: codex | Primary: codex | Adversarial: claude
…R\n\n- Delete test/ast-code-chunking.test.mjs (not in upstream)\n- Remove ast-code-chunking.test.mjs from ci-test-manifest\n- Retain Issue CortexReach#763 MMR tests
Issue
Issue #763 P2 — applyMMRDiversity O(n³)→O(n²) optimization
Problem
applyMMRDiversity() was calling Array.from() inside the inner loop on every similarity comparison:
Fix
Three optimizations applied in
applyMMRDiversity():Map<id, number[]>; replaces O(n)selected.findIndex()with O(1)Map.get()When duplicate IDs are detected,
applyMMRDiversity_Fallback()(findIndex-based O(n²)) is called to preserve original semantics for the ambiguous duplicate-ID case.Complexity: O(n³) → O(n²)
Changes
applyMMRDiversity()optimized + newapplyMMRDiversity_Fallback()mmr-tiny.test.mjsin core-regression + deduplicate issue-690 entry from merge conflictindex.ts: merge conflict resolution (upstream Phase 1 code retained,isCliRegistrationMode(api)preserved)package.json,package-lock.json: upstream version updatesVerification
applyMMRDiversity— O(1) Map.get() for vector lookup, pre-conversion loop O(n²) totalapplyMMRDiversity_Fallback— called only when duplicate IDs detected; preserves original findIndex semantics