Parquet: Enforce row group size limit with compression#16347
Open
yadavay-amzn wants to merge 1 commit into
Open
Parquet: Enforce row group size limit with compression#16347yadavay-amzn wants to merge 1 commit into
yadavay-amzn wants to merge 1 commit into
Conversation
When using GZIP or ZSTD compression, the row group size check uses compressed bytes which are significantly smaller than the configured limit, causing row groups to grow unbounded. Fixes apache#16325
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #16325.
Problem
When using GZIP or ZSTD compression, the row group size check in
ParquetWriteruseswriteStore.getBufferedSize()which reports compressed bytes after page flushes. Since compressed size is significantly smaller than the configuredtargetRowGroupSize, the threshold is never reached and row groups grow unbounded.Fix
Track uncompressed bytes by measuring the
getBufferedSize()delta before and after eachmodel.write()call (beforeendRecord()triggers page flush and compression). Use this accumulated uncompressed size incheckSize()instead of the post-compression buffered size. Reset on row group flush.Testing
Added
testRowGroupSizeEnforcedWithCompressioninTestParquet-- writes 500 records of ~1KB each with GZIP compression and a 64KB row group target. Asserts multiple row groups are created.