Skip to content

fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)#776

Open
jlin53882 wants to merge 11 commits into
CortexReach:masterfrom
jlin53882:fix/mmr-diversity-optimization
Open

fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)#776
jlin53882 wants to merge 11 commits into
CortexReach:masterfrom
jlin53882:fix/mmr-diversity-optimization

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented May 9, 2026

Issue

Issue #763 P2 — applyMMRDiversity O(n³)→O(n²) optimization

Problem

applyMMRDiversity() was calling Array.from() inside the inner loop on every similarity comparison:

  • Outer loop: O(n) candidates
  • Inner loop: O(n) selected items × O(n) Array.from() twice per check = O(n²) per candidate
  • Total: O(n³)

Fix

Three optimizations applied in applyMMRDiversity():

  1. Pre-convert vectors once at entry — eliminates O(n) Array.from() cost from the inner loop; now O(n²) total for all conversions
  2. Duplicate ID upfront detection via Set — O(n) single-pass scan; detects duplicates and routes to fallback before any similarity computation
  3. Map-based O(1) id lookup — pre-converted vectors stored in Map<id, number[]>; replaces O(n) selected.findIndex() with O(1) Map.get()

When duplicate IDs are detected, applyMMRDiversity_Fallback() (findIndex-based O(n²)) is called to preserve original semantics for the ambiguous duplicate-ID case.

Complexity: O(n³) → O(n²)

Changes

  • src/retriever.ts (+81/-10): applyMMRDiversity() optimized + new applyMMRDiversity_Fallback()
  • test/mmr-tiny.test.mjs (+286/-0): 17 tests covering: edge cases (empty/single/null/undefined vectors), diversity filtering (orthogonal, identical, similar, mixed with null), Fallback routing for duplicate IDs, threshold boundaries (0.85/0.9/1.0), cross-validation of optimized vs Fallback paths, and performance sanity (n=50, n=100)
  • scripts/ci-test-manifest.mjs (+4/-3): registers mmr-tiny.test.mjs in core-regression + deduplicate issue-690 entry from merge conflict
  • index.ts: merge conflict resolution (upstream Phase 1 code retained, isCliRegistrationMode(api) preserved)
  • package.json, package-lock.json: upstream version updates

Verification

  • applyMMRDiversity — O(1) Map.get() for vector lookup, pre-conversion loop O(n²) total
  • applyMMRDiversity_Fallback — called only when duplicate IDs detected; preserves original findIndex semantics
  • Items without vectors (null/undefined) are always selected (no similarity comparison possible)
  • Cross-validation: optimized path produces identical results to Fallback for diverse, similar, and duplicate-ID inputs
  • Threshold boundaries tested: 0.85, 0.9, 1.0
  • Performance tests: n=50 < 1s, n=100 < 2s

jlin53882 added 3 commits May 5, 2026 00:34
…issue CortexReach#692)

Adds tree-sitter-based chunking to prevent splitting code mid-function.

Changes:
- detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content
- astChunk(): split code at declaration boundaries (function/class/method)
- smartChunk(): route through astChunk when file is detected as code
- 20 unit tests covering all destructive split scenarios

Test: node --test test/ast-code-chunking.test.mjs (20/20 pass)
Config: astAwareCodeSplit defaults to true
…apture architecture, upgrade isCliMode→isCliRegistrationMode(api)
@TurboTheTurtle
Copy link
Copy Markdown
Contributor

Review finding:

P2: the new MMR tests do not exercise the production implementation

test/mmr-tiny.test.mjs copies the old and new MMR algorithms into local helper functions (applyMMRDiversity_NEW / applyMMRDiversity_OLD) instead of importing or invoking the retriever method changed in src/retriever.ts. That means the test can pass even if the production applyMMRDiversity() implementation diverges, regresses, or is not called at all. The test file also is not added to scripts/ci-test-manifest.mjs, so the current PR checks will not run it. Please wire the test into CI and exercise the actual retriever behavior, or expose a small test seam around the production method rather than testing a duplicate copy.

…I manifest

Resolves PR CortexReach#776 reviewer comment:
- test/mmr-tiny.test.mjs previously imported local copies of the MMR
  algorithm instead of exercising the production method, allowing the
  test to pass even if the actual retriever implementation diverged.

Fix: rewrite test to call retriever.applyMMRDiversity() directly via
jiti import, covering optimized path, fallback path (duplicate IDs),
null/undefined vectors, and threshold boundaries.

Also registers test/mmr-tiny.test.mjs in scripts/ci-test-manifest.mjs
under core-regression so it runs in CI checks.
@jlin53882 jlin53882 force-pushed the fix/mmr-diversity-optimization branch from 9cdc7cf to c26117b Compare May 12, 2026 11:16
jlin53882 added 6 commits May 12, 2026 21:04
…nflicts

- index.ts: keep upstream Phase 1 code, use isCliRegistrationMode(api) for logReg
- verify-ci-test-manifest.mjs: use CI_TEST_MANIFEST variable (upstream refactor)
…CortexReach#763)

Optimizations applied:
1. Pre-convert all vectors once at entry (eliminates O(n) Array.from()
   from inside the inner loop — was causing O(n³) total cost)
2. id→index Map for O(1) lookup — replaces O(n) selected.findIndex
   inside selected.some()
3. Duplicate ID detection — upfront Set check routes to
   applyMMRDiversity_Fallback() which preserves original semantics
   for duplicate-ID edge case

applyMMRDiversity:  O(n²) — pre-convert vectors + Map-based O(1) id lookup
applyMMRDiversity_Fallback: O(n²) — findIndex-based (safe for small dup sets)
Merge of upstream/master introduced duplicate entry for
test/issue-690-cross-call-batch.test.mjs. Remove second occurrence.
…Fallback

Strengthen existing tests:
- orthogonal vectors: add assert.deepStrictEqual for exact id order
- similar vectors deferred: add explicit r1/r2/r3 position assertions +
  verify deferred portion contains r2
- null/undefined vector: assert returned id matches expected

New cross-validation tests (3):
- optimized path matches Fallback for diverse inputs (unique IDs)
- optimized path matches Fallback for similar inputs (unique IDs)
- optimized path routes to Fallback for duplicate IDs

These tests directly verify that the O(n²) optimized path produces
identical results to the O(n²) Fallback path, ensuring correctness
of the Map-based optimization.
Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #776 Review: fix(retriever): applyMMRDiversity O(n³)→O(n²) Map lookup optimization (#763)

Verdict: REQUEST-CHANGES | 6 rounds completed | Value: 42% | Size: XL | Author: jlin53882

Value Assessment

Problem: applyMMRDiversity in the retriever performs repeated vector conversion and selected-result lookup inside nested loops, making larger retrieval result sets slower than necessary. The PR claims to reduce the MMR diversity pass from O(n³) to O(n²) while preserving behavior for duplicate IDs.

Dimension Assessment
Value Score 42%
Value Verdict review
Issue Linked true
Project Aligned true
Duplicate false
AI Slop Score 1/6
User Impact medium
Urgency medium

Scope Drift: 3 flag(s)

  • src/chunker.ts and test/ast-code-chunking.test.mjs implement Issue #692 AST-based chunking, which is not justified by the #763 MMR optimization problem statement
  • package.json/package-lock.json add tree-sitter native dependencies for AST chunking, not for applyMMRDiversity
  • index.ts CLI registration-mode changes are unrelated to the retriever MMR performance fix

AI Slop Signals:

  • Claim/diff mismatch: PR title and body center on #763 applyMMRDiversity optimization, but the diff includes src/chunker.ts AST chunking and test/ast-code-chunking.test.mjs for #692.

Open Questions:

  • Is Issue #763 maintainer-acknowledged? The provided issue context has no labels, assignment, or maintainer comments visible.
  • Should the Issue #692 AST chunking changes and tree-sitter dependencies be split into a separate PR before reviewing #763?
  • Is the P3 event-loop-blocking portion of Issue #763 intentionally out of scope for this PR?

Summary

applyMMRDiversity in the retriever performs repeated vector conversion and selected-result lookup inside nested loops, making larger retrieval result sets slower than necessary. The PR claims to reduce the MMR diversity pass from O(n³) to O(n²) while preserving behavior for duplicate IDs.

Evaluation Signals

Signal Value
Blockers 0
Warnings 1
PR Size XL
Verdict Floor approve
Risk Level high
Value Model codex
Primary Model codex
Adversarial Model claude

Must Fix

  • F3: Oversized code declarations violate chunk size limits

Nice to Have

  • F1: New any types in AST chunker
  • F2: JS and TS imports are classified as Python
  • F4: AST loader uses require in an ESM module
  • MR2: CLI mode detection semantic change rides along with no test coverage and no rationale tied to #763
  • MR3: No benchmark validates the claimed O(n³)→O(n²) improvement
  • MR4: Dead subChunk helper inside chunker.ts
  • MR5: detectCodeLanguage docstring/code mismatch (200 vs 400 chars)

Recommended Action

Author should address must-fix findings before merge.


Reviewed at 2026-05-12T17:31:38Z | 6 rounds | Value: codex | Primary: codex | Adversarial: claude

…R\n\n- Delete test/ast-code-chunking.test.mjs (not in upstream)\n- Remove ast-code-chunking.test.mjs from ci-test-manifest\n- Retain Issue CortexReach#763 MMR tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants