feat: add scanner.plan_splits function #5792
Open
+466
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a
plan_splitsfunction the toScannerstruct. The goal is that this serves as a singular endpoint where distributed compute frameworks can effectively partition a Lance dataset for parallelized processing. The main goals are:(1) Prune fragments that do not satisfy a filter (if exists): We use an index lookup to determine which fragments contain rows (and which do not) to prune unnecessary fragments.
(2) Bin pack fragments into spiits: Distributed compute frameworks typically work best with a "sweet-spot" partition size. Within Lance, this means a partition should typically contain multiple fragments. We expose a user configurable strategy, namely max row count or split size, and then estimate row sizes based on the schema to determine the size of the resultant split.