Skip to content

Parquet: Enforce row group size limit with compression#16347

Open
yadavay-amzn wants to merge 1 commit into
apache:mainfrom
yadavay-amzn:fix/16325-row-group-size-enforcement
Open

Parquet: Enforce row group size limit with compression#16347
yadavay-amzn wants to merge 1 commit into
apache:mainfrom
yadavay-amzn:fix/16325-row-group-size-enforcement

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

Fixes #16325.

Problem

When using GZIP or ZSTD compression, the row group size check in ParquetWriter uses writeStore.getBufferedSize() which reports compressed bytes after page flushes. Since compressed size is significantly smaller than the configured targetRowGroupSize, the threshold is never reached and row groups grow unbounded.

Fix

Track uncompressed bytes by measuring the getBufferedSize() delta before and after each model.write() call (before endRecord() triggers page flush and compression). Use this accumulated uncompressed size in checkSize() instead of the post-compression buffered size. Reset on row group flush.

Testing

Added testRowGroupSizeEnforcedWithCompression in TestParquet -- writes 500 records of ~1KB each with GZIP compression and a 64KB row group target. Asserts multiple row groups are created.

  • Without fix: all 500 records end up in 1 row group (compressed size never hits threshold)
  • With fix: multiple row groups created respecting the 64KB target

When using GZIP or ZSTD compression, the row group size check
uses compressed bytes which are significantly smaller than the
configured limit, causing row groups to grow unbounded.

Fixes apache#16325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Row group size limit not enforced when using GZIP or ZSTD compression

1 participant