Skip to content

Core: Implement LZ4 frame compression for Puffin#16348

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:core-puffin-lz4-frame
Open

Core: Implement LZ4 frame compression for Puffin#16348
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:core-puffin-lz4-frame

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Implements the LZ4 codec for Puffin, replacing the long-standing TODOs in PuffinFormat.compress / PuffinFormat.decompress that pointed at airlift/aircompressor#142.

Motivation

Puffin declared lz4 as a valid codec (used unconditionally for footer compression via Puffin.write(...).compressFooter()), but the implementation threw UnsupportedOperationException("Unsupported codec: LZ4"). The referenced aircompressor PR #142 was never merged into the version Iceberg ships (io.airlift:aircompressor:2.0.3), which provides only raw LZ4 + Hadoop streams — not the standard LZ4 frame format the Puffin spec requires. As a result, footer compression was unusable and lz4 blob compression was unreachable.

Implementation

LZ4 frame support is provided by net.jpountz.lz4 (shipped as at.yawk.lz4:lz4-java, already pinned in this repo via a CVE resolutionStrategy substitution). It is promoted from a transitive-only dependency to a direct implementation dependency of iceberg-core.

  • compress: LZ4FrameOutputStream with BLOCKSIZE.SIZE_4MB, the known content length, and FLG.Bits.CONTENT_SIZE + FLG.Bits.BLOCK_INDEPENDENCE.
  • decompress: LZ4FrameInputStream drained via Guava ByteStreams.

This conforms to the Puffin spec: "Single LZ4 compression frame, with content size present". Content size is encoded in the frame descriptor. BLOCK_INDEPENDENCE is required by lz4-java (it only supports independent blocks) and is orthogonal to the spec — it is also the reference lz4 CLI default. aircompressor is retained for ZSTD.

Tests

  • TestPuffinWriter.testEmptyFooterCompressed converted from a negative test (asserting the UnsupportedOperationException) to a positive round-trip + byte-fixture test.
  • Added testWriteMetricDataCompressedLz4 / testReadMetricDataCompressedLz4 and testValidateLz4FooterSizeValue, mirroring the existing ZSTD coverage, against two new committed fixtures (empty-puffin-compressed-footer.bin, sample-metric-data-compressed-lz4.bin).
  • Added codec-level round-trip + empty-input tests in TestPuffinFormat, parameterized over NONE / LZ4 / ZSTD.

Verified locally: :iceberg-core:build -x integrationTest green; checkRuntimeDeps green for the spark-4.1 / flink-2.1 / kafka-connect bundles.

Runtime deps & LICENSE

Making lz4-java a direct dependency of iceberg-core propagates it onto the runtime classpath of every shaded runtime bundle that ships iceberg-core. Accordingly:

  • runtime-deps.txt baselines updated for the affected bundles (spark v3.4/v3.5/v4.0/v4.1, flink v1.20/v2.0/v2.1, kafka-connect-runtime). Only the single new at.yawk.lz4:lz4-java line was added; unrelated patch-level baseline drift was intentionally left out.
  • Bundle LICENSE files updated with a "This product bundles lz4-java" stanza, mirroring the existing Airlift Aircompressor precedent. lz4-java ships no NOTICE file, so NOTICE was not modified.

Open item for maintainers: please sanity-check the LICENSE attribution wording / project URL for the at.yawk.lz4 fork against ASF policy — this is the documented manual step in runtime-deps.gradle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant