Skip to content

Conversation

@tnm
Copy link
Contributor

@tnm tnm commented Jan 7, 2026

Summary

Two performance optimizations from Tier 2 (removed module map change after benchmarking showed pathlib overhead made it slower):

1. Vector search collection reset (vector_searcher.py)

  • Track reset state to avoid redundant resets on multiple add() calls
  • Use delete_collection as primary fast path (avoids count() call)
  • Only fall back to expensive ID enumeration when necessary
  • ~2-5x faster build_index() restarts

2. Context extractor file caching (context_extractor.py)

  • Add mtime-based file content cache shared across methods
  • chunk_file_by_lines, chunk_file_by_symbols, extract_context_around_line now share cached file reads
  • Simple FIFO eviction when cache exceeds 100 files
  • 10-94x faster for repeated reads of same file

Benchmark Results

File Caching (synthetic)

Reads of same file Uncached Cached Speedup
10 0.78ms 0.08ms 10x
50 3.69ms 0.12ms 30x
100 7.49ms 0.08ms 94x

Real-world (kit codebase)

Operation Result
Context extractor (180 calls on 20 files) 0.88ms avg per call

Removed

  • Module map pathlib optimization - Benchmarking showed pathlib object creation overhead (Path(), .parents, str()) made it 10x slower than the simple nested os.path.dirname() loops. The theoretical O(n²) → O(n) improvement was overwhelmed by constant factors.

Test plan

  • All existing tests pass
  • Formatting/linting passes
  • Benchmark verification completed

Two performance optimizations (removed module map change after benchmarking
showed pathlib overhead made it slower):

1. Vector search collection reset (vector_searcher.py)
   - Track reset state to avoid redundant resets on multiple add() calls
   - Use delete_collection as primary fast path (avoids count() call)
   - Only fall back to expensive ID enumeration when necessary
   - ~2-5x faster build_index() restarts

2. Context extractor file caching (context_extractor.py)
   - Add mtime-based file content cache shared across methods
   - chunk_file_by_lines, chunk_file_by_symbols, extract_context_around_line
     now share cached file reads
   - Simple FIFO eviction when cache exceeds 100 files
   - 10-94x faster for repeated reads of same file

Benchmark results:
- File caching: 10x speedup (10 reads), 94x speedup (100 reads)
- Context extractor: 0.88ms avg per call with caching enabled
@tnm tnm force-pushed the perf/tier2-optimizations branch from f0eafce to 51b4dd0 Compare January 7, 2026 06:40
@tnm tnm changed the title perf: Tier 2 optimizations - caching and algorithm improvements perf: Tier 2 optimizations - vector reset and file caching Jan 7, 2026
@tnm tnm merged commit 523acf6 into main Jan 7, 2026
2 checks passed
tnm added a commit that referenced this pull request Jan 7, 2026
Add missing performance improvements from PRs #179 and #180:
- Tier 1: O(n²)→O(1) dependency graph, O(n)→O(log n) line lookup, regex precompile
- Tier 2: Vector search collection reset, context extractor file caching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants