perf: reduce memory use when splitting IVF partitions by wjones127 · Pull Request #6687 · lance-format/lance

wjones127 · 2026-05-05T17:35:33Z

We improve memory use when splitting IVF partitions in two of the stages:

Training new IVF centroids. Currently, all raw vectors to partition to split are loaded into memory and used to train the centroids. This could be 400MB for a 3072 f32 vector dataset when partition size has reached 33k, triggering a split. We now just sample 512 of the vectors, which should be sufficient to train for just 2 centroids.
Shuffle. Currently, all vectors that will be moved across all partitions being split are held in memory simultaneously in Vec<SplitPlan>. This is the largest source of peak memory use currently. If many partitions are being split, this can be > 100GB. We now instead stream these raw vectors through the partition assignment and quantization pipeline, just like we do in the case of new indices.

This PR also adds progress reporting to optimize_indices, to make this more observable.

Test Workload: IVF_PQ append on 560K base rows (16 partitions, 3072-dim float32 vectors) with 160K new rows — triggers partition splitting since each partition exceeds the 32K row threshold.

Peak RSS: 26.2 GB before, 4.1 GB after.
Runtime: 93s before, 16.5s after — 5.6x faster as well

Closes #6378

The optimize/append path created `IvfIndexBuilder` with `NoopIndexBuildProgress`, so progress callbacks were silently ignored. This adds a `progress` field to `OptimizeOptions` and passes it through to the builder in all index type variants of `optimize_vector_indices_v2`. Also adds shuffle stage reporting in `shuffle_data()`. Ref #6378 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…signment Previously, `build_split_plan` loaded all raw vectors for every partition being split, and ran up to `num_cpus` partitions in parallel. For high-dimensional vectors this caused OOM. Similarly, `collect_candidate_moves` loaded neighbor partitions in parallel. This splits the work into two phases: - Training (parallel, low memory): sample 512 row IDs per partition, load only those vectors, train kmeans. - Assignment (sequential, high memory): load full raw vectors one partition at a time. Candidate moves also run sequentially. Ref #6378 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, splitting oversized IVF partitions during index optimization loaded all raw vectors for every split partition and their neighbors into memory simultaneously (~11.5 GB for 30 partitions at 3072 dims). This refactors the split path to reuse the existing streaming shuffle infrastructure: train new centroids from samples, then stream affected partition vectors through the IVF+quantizer transform pipeline into temp files on disk. Peak memory drops from O(all split vectors) to O(one batch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Simplify three-way partition routing by extracting split_reader once - Remove dead AssignOp::Remove variant and simplify build_assign_batch - Add Debug impl for PartitionAdjustment - Add SPLIT_SAMPLE_SIZE constant for kmeans training sample size - Include partition index in "centroid not found" error message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-05-05T20:13:30Z

Codecov Report

❌ Patch coverage is 72.52396% with 86 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/builder.rs	73.44%	55 Missing and 22 partials ⚠️
rust/lance/src/index/vector/ivf.rs	50.00%	5 Missing ⚠️
rust/lance-index/src/optimize.rs	69.23%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Extract `apply_centroid_splits` from `compute_split_centroids` to make the centroid ordering logic directly testable. Add a unit test verifying that K simultaneous splits on N partitions produce N+K centroids with unchanged partitions at their original indices and centroid2s appended in split order. Replaces the removed `finalize_split_plans_reassigns_filtered_centroid_ids` test. The other two removed tests' properties are now covered structurally (global nearest-centroid assignment) and by existing integration tests (`test_split_multiple_partitions_in_one_optimize`, `test_partition_split_on_append_multivec`). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

github-actions Bot added the performance label May 5, 2026

wjones127 added the indexes Related to secondary index implementations label May 5, 2026

wjones127 and others added 5 commits May 5, 2026 12:38

lint

efc816e

wjones127 force-pushed the 6378-split-index-ram branch from 57cd64b to efc816e Compare May 5, 2026 19:42

wjones127 commented May 5, 2026

View reviewed changes

Comment thread rust/lance/src/index/vector/builder.rs Outdated

wjones127 and others added 2 commits May 5, 2026 15:03

improve progress

dc39f19

wjones127 marked this pull request as ready for review May 5, 2026 23:52

claude Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce memory use when splitting IVF partitions#6687

perf: reduce memory use when splitting IVF partitions#6687
wjones127 wants to merge 7 commits intomainfrom
6378-split-index-ram

wjones127 commented May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjones127 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wjones127 commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading