Skip to content

Conversation

@hamersaw
Copy link

@hamersaw hamersaw commented Jan 22, 2026

This PR adds a plan_splits function the to Scanner struct. The goal is that this serves as a singular endpoint where distributed compute frameworks can effectively partition a Lance dataset for parallelized processing. The main goals are:
(1) Prune fragments that do not satisfy a filter (if exists): We use an index lookup to determine which fragments contain rows (and which do not) to prune unnecessary fragments.
(2) Bin pack fragments into spiits: Distributed compute frameworks typically work best with a "sweet-spot" partition size. Within Lance, this means a partition should typically contain multiple fragments. We expose a user configurable strategy, namely max row count or split size, and then estimate row sizes based on the schema to determine the size of the resultant split.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added enhancement New feature or request python labels Jan 22, 2026
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@hamersaw
Copy link
Author

Before I plumb this through to the Lance Spark connector just wanted to get some input from interested parties:

@majin1102 / @fangbo in this thread I know you expressed interest in a solution. This does currently work with zone maps. @fangbo you'll recognize a large bit of code from your PR - thanks!

@Jay-ju IIUC your PR here is targeted at estimating row counts to achieve similar ends. I really like the idea of index hinting, as in my testing I noticed filtering index choices were not always what I expected them to be.

return self._scanner.analyze_plan()

def plan_splits(
self, max_split_size_bytes: Optional[int] = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to update this to include both max_split_size_bytes and max_row_count options, with one trumping the other if both are provided. I'm interested if people think this paradigm is useful? My intuition is that since we are estimating row sizes based on the schema that we could be VERY wrong (just using 64B for everything that is not known size - string / blob could be 1B - 1M+). In these scenarios a user will know their data better and can use a max_row_count to target a partition size. So basically, hopefully most use-cases we're close and estimation works well, but there are knobs to fine-tune in the other cases.

@hamersaw hamersaw marked this pull request as ready for review January 23, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant