feat: add scanner.plan_splits function #5792

hamersaw · 2026-01-22T21:56:39Z

This PR adds a plan_splits function the to Scanner struct. The goal is that this serves as a singular endpoint where distributed compute frameworks can effectively partition a Lance dataset for parallelized processing. The main goals are:
(1) Prune fragments that do not satisfy a filter (if exists): We use an index lookup to determine which fragments contain rows (and which do not) to prune unnecessary fragments.
(2) Bin pack fragments into spiits: Distributed compute frameworks typically work best with a "sweet-spot" partition size. Within Lance, this means a partition should typically contain multiple fragments. We expose a user configurable strategy, namely max row count or split size, and then estimate row sizes based on the schema to determine the size of the resultant split.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw · 2026-01-23T15:43:24Z

Before I plumb this through to the Lance Spark connector just wanted to get some input from interested parties:

@majin1102 / @fangbo in this thread I know you expressed interest in a solution. This does currently work with zone maps. @fangbo you'll recognize a large bit of code from your PR - thanks!

@Jay-ju IIUC your PR here is targeted at estimating row counts to achieve similar ends. I really like the idea of index hinting, as in my testing I noticed filtering index choices were not always what I expected them to be.

hamersaw · 2026-01-23T15:47:05Z

python/python/lance/dataset.py

        return self._scanner.analyze_plan()

+    def plan_splits(
+        self, max_split_size_bytes: Optional[int] = None


Will need to update this to include both max_split_size_bytes and max_row_count options, with one trumping the other if both are provided. I'm interested if people think this paradigm is useful? My intuition is that since we are estimating row sizes based on the schema that we could be VERY wrong (just using 64B for everything that is not known size - string / blob could be 1B - 1M+). In these scenarios a user will know their data better and can use a max_row_count to target a partition size. So basically, hopefully most use-cases we're close and estimation works well, but there are knobs to fine-tune in the other cases.

hamersaw added 2 commits January 22, 2026 12:38

initial pass

046fe61

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

updated to use max_split_size_bytes

c176bb9

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added enhancement New feature or request python labels Jan 22, 2026

hamersaw added 2 commits January 22, 2026 16:19

added unit tests

dc752bb

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

adding SplitStrategy to allow users to specify split size or max rows

f1650d6

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw commented Jan 23, 2026

View reviewed changes

hamersaw marked this pull request as ready for review January 23, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add scanner.plan_splits function #5792

feat: add scanner.plan_splits function #5792

hamersaw commented Jan 22, 2026 •

edited

Loading

Uh oh!

hamersaw commented Jan 23, 2026

Uh oh!

hamersaw Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add scanner.plan_splits function #5792

Are you sure you want to change the base?

feat: add scanner.plan_splits function #5792

Conversation

hamersaw commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamersaw commented Jan 23, 2026

Uh oh!

hamersaw Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hamersaw commented Jan 22, 2026 •

edited

Loading