[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor to mitigate perf regression#53454
[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor to mitigate perf regression#53454pan3793 wants to merge 5 commits intoapache:masterfrom
Conversation
|
the test failure is caused by |
|
cc @dbtsai @huaxingao, I checked all versions available in Maven Central, all of them have the same issue.
Update: contacted the DB2 JDBC driver's author, new release that bundles the latest lz4-java is working in progress Update: DB2 team provides a special JDBC driver 12.1.3.0_special_74723 that bundles lz4-java 1.10.1 which addressed |
JIRA Issue Information=== Improvement SPARK-54571 === This comment was automatically generated by GitHub Actions |
pom.xml
Outdated
| <postgresql.version>42.7.7</postgresql.version> | ||
| <db2.jcc.version>11.5.9.0</db2.jcc.version> | ||
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
this dep is only used for testing
pom.xml
Outdated
| <postgresql.version>42.7.7</postgresql.version> | ||
| <db2.jcc.version>11.5.9.0</db2.jcc.version> | ||
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
Can we upgrade it first in a separate pr?
There was a problem hiding this comment.
okay, let me upgrade it first
There was a problem hiding this comment.
@LuciferYang FYI, I opened SPARK-55136 (#53920) for it
|
@viirya can you take a look? Thanks. |
|
Hmm, so that's said, in lz4-java 1.10.1, fastDecompressor is paradoxically much slower than safeDecompressor? |
|
@viirya yeah, it was mentioned in https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183 |
| LZ4BlockInputStream.newBuilder() | ||
| .withDecompressor(lz4Factory.safeDecompressor()) | ||
| .withChecksum(xxHashFactory.newStreamingHash32(defaultSeed).asChecksum) | ||
| .withStopOnEmptyBlock(disableConcatenationOfByteStream) |
There was a problem hiding this comment.
Is this .withStopOnEmptyBlock(disableConcatenationOfByteStream) a correct mapping to previous behavior?
There was a problem hiding this comment.
yes, exactly, you can check the source code at
public LZ4BlockInputStream(InputStream in, LZ4FastDecompressor fastDecompressor, Checksum checksum, boolean stopOnEmptyBlock) {
this(in, fastDecompressor, null, checksum, stopOnEmptyBlock);
}
pom.xml
Outdated
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
Does it mean we need to update this dependency when upgrading lz4-java?
|
@pan3793 thanks for the great effort! I was not aware of this PR and opened the other one due to performance regression. Do we know what are the current blockers from merging this PR? |
|
@Yicong-Huang, you can find all related issues/background in the PR description |
| Compression: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------------------------------ | ||
| Compression 4 times 2605 2611 8 0.0 651243236.8 1.0X | ||
| Compression 4 times 2613 2621 12 0.0 653232793.3 1.0X |
There was a problem hiding this comment.
Although safeDecompressor shows slower performance, but it's in the marginal scope. There might be some improvements at lz4-java 1.10.3.
Given the above, could you update the PR description?
There was a problem hiding this comment.
nvm. I forgot that this PR is about Decompression (lz4Factory.safeDecompressor).
|
BTW, I updated the PR description to include the following because the previous benchmark result was generated based on that. |
|
@dongjoon-hyun, the benchmark results in this PR are generated by lz4-java 1.10.1 a few months ago, I will regenerate that once we resolve the Db2 JDBC driver issue. |
|
What I meant (by the following) was that the result in the current
For the new result, yes. We need to update it, of course. |
|
I see |
d9e4e55 to
453e2ca
Compare
…17, Scala 2.13, split 1 of 1)
…21, Scala 2.13, split 1 of 1)
|
After SPARK-55706, we don't need to worry about the Db2 JDBC driver issue. I refreshed the benchmark results, now it's ready for review. @Yicong-Huang @dongjoon-hyun @viirya @dbtsai |
Yicong-Huang
left a comment
There was a problem hiding this comment.
LGTM! Many thanks for the great effort!
|
thanks all! merged to master |
… CVE‐2025‐12183 and CVE-2025-66566 ### What changes were proposed in this pull request? - Bump lz4-java version from 1.8.0 to 1.10.4 to resolve CVE‐2025‐12183 and CVE-2025-66566. - `Lz4Decompressor` follows the [suggestion](apache/spark#53290 (comment)) to move from `fastDecompressor` to `safeDecompressor` to mitigate the performance. Backport: - apache/spark#53327 - apache/spark#53347 - apache/spark#53971 - apache/spark#53454 - apache/spark#54585 ### Why are the changes needed? - [CVE‐2025‐12183](https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183): Various lz4-java compression and decompression implementations do not guard against out-of-bounds memory access. Untrusted input may lead to denial of service and information disclosure. Vulnerable Maven coordinates: org.lz4:lz4-java up to and including 1.8.0. - [CVE-2025-66566](GHSA-cmp6-m4wj-q63q): Insufficient clearing of the output buffer in Java-based decompressor implementations in lz4-java 1.10.0 and earlier allows remote attackers to read previous buffer contents via crafted compressed input. In applications where the output buffer is reused without being cleared, this may lead to disclosure of sensitive data. JNI-based implementations are not affected. Therefore, lz4-java version should upgrade to 1.10.4. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #3555 from SteNicholas/CELEBORN-2218. Lead-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: SteNicholas <programgeek@163.com>
… CVE‐2025‐12183 and CVE-2025-66566 - Bump lz4-java version from 1.8.0 to 1.10.4 to resolve CVE‐2025‐12183 and CVE-2025-66566. - `Lz4Decompressor` follows the [suggestion](apache/spark#53290 (comment)) to move from `fastDecompressor` to `safeDecompressor` to mitigate the performance. Backport: - apache/spark#53327 - apache/spark#53347 - apache/spark#53971 - apache/spark#53454 - apache/spark#54585 - [CVE‐2025‐12183](https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183): Various lz4-java compression and decompression implementations do not guard against out-of-bounds memory access. Untrusted input may lead to denial of service and information disclosure. Vulnerable Maven coordinates: org.lz4:lz4-java up to and including 1.8.0. - [CVE-2025-66566](GHSA-cmp6-m4wj-q63q): Insufficient clearing of the output buffer in Java-based decompressor implementations in lz4-java 1.10.0 and earlier allows remote attackers to read previous buffer contents via crafted compressed input. In applications where the output buffer is reused without being cleared, this may lead to disclosure of sensitive data. JNI-based implementations are not affected. Therefore, lz4-java version should upgrade to 1.10.4. No. No. CI. Closes #3555 from SteNicholas/CELEBORN-2218. Lead-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit dca3749) Signed-off-by: SteNicholas <programgeek@163.com>
What changes were proposed in this pull request?
Previously, lz4-java was upgraded to 1.10.x to address CVEs,
lz4-javato 1.10.0 #53327lz4-javato 1.10.1 #53347while this casues significant performance drop, see the benchmark report at
this PR follows the suggestion to migrate to safeDecompressor.
Why are the changes needed?
Mitigate performance regression.
Does this PR introduce any user-facing change?
No, except for performance.
How was this patch tested?
GHA for functionality, benchmark for performance.
Was this patch authored or co-authored using generative AI tooling?
No.