Skip to content

Parquet: Add opt-in uncompressed row group size tracking#16327

Open
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:fix-parquet-row-group-size
Open

Parquet: Add opt-in uncompressed row group size tracking#16327
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:fix-parquet-row-group-size

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

@nssalian nssalian commented May 14, 2026

Closes: #16325

Rationale for this Change

Adds write.parquet.row-group-size-check-uncompressed (default false) to accurately enforce write.parquet.row-group-size-bytes when using compressing codecs (GZIP, ZSTD, etc.).

ParquetWriter.checkSize() uses writeStore.getBufferedSize() which reports compressed bytes for flushed pages. With effective compression, the writer never sees the target exceeded because it's comparing compressed data against an uncompressed limit. Row groups grow unbounded.

What changes are included in this PR?

When write.parquet.row-group-size-check-uncompressed=true:

  1. Measures getBufferedSize() before and after model.write() per record. Between these points, data is in uncompressed column buffers (no page flush occurs during model.write()). The delta is the exact uncompressed record size.
  2. Accumulates into rowGroupUncompressedSize. Flushes when it hits the target.
  3. Removes the 100-record minimum check interval floor for the uncompressed path.

Disabled by default.

When enabled getBufferedSize() calls per record. Each call iterates column writers adding field reads. It's the same pattern parquet-mr uses in ColumnWriteStoreBase.sizeCheck().

Are these changes tested?

  • Parameterized test across all codecs (gzip, snappy, zstd, uncompressed)
  • Existing parquet tests pass locally

Are there any user-facing changes?

Yes. New configuration but set to false by default.

@nssalian
Copy link
Copy Markdown
Contributor Author

CC: @pvary @steveloughran @huaxingao PTAL

}
}

private void checkSizeDefault() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd give it a clearer name which makes clear it's the size on the filesystem; "default" just says it's the default option, not what it does

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think of a better name.


@ParameterizedTest
@ValueSource(strings = {"gzip", "snappy", "zstd", "uncompressed"})
public void testRowGroupSizeEnforcedWhenCompressionEnabled(String codec) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an equivalent test which verifies that with the default setting it's the compressed byte count that's used? that's critical for regression testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Row group size limit not enforced when using GZIP or ZSTD compression

2 participants