Parquet: Add opt-in uncompressed row group size tracking by nssalian · Pull Request #16327 · apache/iceberg

nssalian · 2026-05-14T01:48:55Z

Rationale for this Change

Adds write.parquet.row-group-size-check-uncompressed (default false) to accurately enforce write.parquet.row-group-size-bytes when using compressing codecs (GZIP, ZSTD, etc.).

ParquetWriter.checkSize() uses writeStore.getBufferedSize() which reports compressed bytes for flushed pages. With effective compression, the writer never sees the target exceeded because it's comparing compressed data against an uncompressed limit. Row groups grow unbounded.

What changes are included in this PR?

When write.parquet.row-group-size-check-uncompressed=true:

Measures getBufferedSize() before and after model.write() per record. Between these points, data is in uncompressed column buffers (no page flush occurs during model.write()). The delta is the exact uncompressed record size.
Accumulates into rowGroupUncompressedSize. Flushes when it hits the target.
Removes the 100-record minimum check interval floor for the uncompressed path.

Disabled by default.

When enabled getBufferedSize() calls per record. Each call iterates column writers adding field reads. It's the same pattern parquet-mr uses in ColumnWriteStoreBase.sizeCheck().

Are these changes tested?

Parameterized test across all codecs (gzip, snappy, zstd, uncompressed)
Existing parquet tests pass locally

Are there any user-facing changes?

Yes. New configuration but set to false by default.

nssalian · 2026-05-15T14:08:40Z

CC: @pvary @steveloughran @huaxingao PTAL

steveloughran · 2026-05-15T15:53:42Z

+    }
+  }
+
+  private void checkSizeDefault() {


I'd give it a clearer name which makes clear it's the size on the filesystem; "default" just says it's the default option, not what it does

Let me think of a better name.

steveloughran · 2026-05-15T16:04:55Z

+
+  @ParameterizedTest
+  @ValueSource(strings = {"gzip", "snappy", "zstd", "uncompressed"})
+  public void testRowGroupSizeEnforcedWhenCompressionEnabled(String codec) throws IOException {


is there an equivalent test which verifies that with the default setting it's the compressed byte count that's used? that's critical for regression testing

Parquet: Add opt-in uncompressed row group size tracking

74daded

github-actions Bot added parquet core docs labels May 14, 2026

nssalian marked this pull request as ready for review May 14, 2026 03:14

nssalian mentioned this pull request May 15, 2026

Parquet: Row group size limit not enforced when using GZIP or ZSTD compression #16325

Open

3 tasks

steveloughran reviewed May 15, 2026

View reviewed changes

nssalian mentioned this pull request May 15, 2026

Parquet: Enforce row group size limit with compression #16347

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Add opt-in uncompressed row group size tracking#16327

Parquet: Add opt-in uncompressed row group size tracking#16327
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:fix-parquet-row-group-size

nssalian commented May 14, 2026 •

edited

Loading

Uh oh!

nssalian commented May 15, 2026

Uh oh!

steveloughran May 15, 2026

Uh oh!

nssalian May 15, 2026

Uh oh!

steveloughran May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nssalian commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this Change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented May 15, 2026

Uh oh!

steveloughran May 15, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian May 15, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nssalian commented May 14, 2026 •

edited

Loading