Skip to content

[spark] Add scan.max.records.per.partition config to split log table input partitions#3260

Open
Yohahaha wants to merge 1 commit intoapache:mainfrom
Yohahaha:spark-split-partition
Open

[spark] Add scan.max.records.per.partition config to split log table input partitions#3260
Yohahaha wants to merge 1 commit intoapache:mainfrom
Yohahaha:spark-split-partition

Conversation

@Yohahaha
Copy link
Copy Markdown
Contributor

@Yohahaha Yohahaha commented May 7, 2026

Purpose

Linked issue: close #3215

Brief change log

  • Introduce scan.max.records.per.partition config option for Spark log table reads. When set, each Fluss
    bucket whose offset range exceeds this value will be split into multiple Spark input partitions, improving
    read parallelism for large offset ranges.
  • Update BucketOffsetsRetrieverImpl to support fetching real earliest offsets when needed.

Tests

SparkLogTableReadTest: "Spark Read: split partition by config"

API and Format

Documentation

@Yohahaha Yohahaha marked this pull request as ready for review May 7, 2026 02:52
@Yohahaha
Copy link
Copy Markdown
Contributor Author

Yohahaha commented May 7, 2026

@YannByron

@Yohahaha
Copy link
Copy Markdown
Contributor Author

Yohahaha commented May 7, 2026

@luoyuxia @fresh-borzoni PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[spark] Add config to split input partition by input size

1 participant