Build: remove Hadoop 2 dependency #12348

Kontinuation · 2025-02-20T11:46:59Z

This is the continuation of #10932

The removal of Hadoop 2 was blocked by hive2 before. Now we have hive runtime removed from the source tree since #11801, Hadoop 2 removal should have been unblocked.

Removing Hadoop 2 is also required for upgrading parquet packages to the next version, as described in #12347 (comment), latest parquet-hadoop code uses new FileSystem APIs introduced in Hadoop 3.

Related issue: #10940

Now that the minimum java version is 11, it's impossible for Iceberg to work on a Hadoop release less than 3.3.0. Removing the hadoop2 version and libraries forces all building and testing onto a compatible version, and permits followup work using modern hadoop APIs. Co-authored-by: Kristin Cowalcijk <kontinuation@apache.com> Co-authored-by: Steve Loughran <stevel@cloudera.com>

Kontinuation · 2025-02-20T15:27:29Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestCompressionSettings.java

+  @Before
+  public void resetSpecificConfigurations() {
+    spark.conf().unset(COMPRESSION_CODEC);
+    spark.conf().unset(COMPRESSION_LEVEL);
+    spark.conf().unset(COMPRESSION_STRATEGY);
+  }


spark.sql.iceberg.compression-level = 1 is not a valid compression level setting for gzip. If we don't unset all these options before running each test, the old configs set by the previous run will be left over and make gzip tests fail.

for other reviewers: the same is already being done in the Spark 3.5 version of the test and was added by #11333

manuzhang · 2025-02-21T13:39:09Z

aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java

      md.update(bytes, 0, numBytes);
    }
-    return new String(Hex.encodeHex(md.digest())).toUpperCase(Locale.ROOT);
+    return Hex.encodeHexString(md.digest(), false);


why is this changed?

org.apache.directory.api.util.Hex is not available after switching to Hadoop 3. Also, I think it is more reasonable to use a function from Apache Commons for this purpose.

manuzhang · 2025-02-21T13:40:34Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreaming.java

      // remove the last commit to force Spark to reprocess batch #1
      File lastCommitFile = new File(checkpoint + "/commits/1");
      Assert.assertTrue("The commit file must be deleted", lastCommitFile.delete());
+      Files.deleteIfExists(Paths.get(checkpoint + "/commits/.1.crc"));


why do we need this now?

The .crc file will be renamed along with the main file since HADOOP-16255, deleting the main file without deleting the crc file will result in a failure when renaming to the main file again:

org.apache.hadoop.fs.FileAlreadyExistsException: Rename destination file:/var/folders/jw/nz45tb550rbgjkndd37m8rrh0000gn/T/junit-12664551068658781194/parquet/checkpoint/commits/.1.crc already exists. at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:876) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:807) at org.apache.hadoop.fs.ChecksumFs.renameInternal(ChecksumFs.java:519) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:807) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:1044) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:372) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:154) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.write(HDFSMetadataLog.scala:204) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.addNewBatchByStream(HDFSMetadataLog.scala:237) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:130) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$markMicroBatchEnd$1(MicroBatchExecution.scala:785)

manuzhang · 2025-02-21T13:43:16Z

The title sounds a bit strange to me. How about Build: remove Hadoop 2 dependency?

Kontinuation · 2025-02-21T20:01:30Z

The title sounds a bit strange to me. How about Build: remove Hadoop 2 dependency?

Renamed the title as requested.

Fokko · 2025-02-26T11:53:36Z

Looks good, thanks @Kontinuation for working on this, and thanks @manuzhang and @nastra for the review!

github-actions bot added spark flink MR build ALIYUN labels Feb 20, 2025

Fix test failures for Spark

0464cdc

Kontinuation marked this pull request as ready for review February 20, 2025 15:22

Kontinuation commented Feb 20, 2025

View reviewed changes

nastra approved these changes Feb 21, 2025

View reviewed changes

nastra requested review from Fokko and pvary February 21, 2025 08:30

manuzhang reviewed Feb 21, 2025

View reviewed changes

Kontinuation changed the title ~~Build: remove hadoop 2 support~~ Build: Build: remove Hadoop 2 dependency Feb 21, 2025

Kontinuation changed the title ~~Build: Build: remove Hadoop 2 dependency~~ Build: remove Hadoop 2 dependency Feb 21, 2025

manuzhang approved these changes Feb 26, 2025

View reviewed changes

Fokko approved these changes Feb 26, 2025

View reviewed changes

Fokko merged commit ad5dc14 into apache:main Feb 26, 2025
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build: remove Hadoop 2 dependency #12348

Build: remove Hadoop 2 dependency #12348

Uh oh!

Kontinuation commented Feb 20, 2025

Uh oh!

Kontinuation Feb 20, 2025

Uh oh!

nastra Feb 21, 2025

Uh oh!

manuzhang Feb 21, 2025

Uh oh!

Kontinuation Feb 21, 2025

Uh oh!

manuzhang Feb 21, 2025

Uh oh!

Kontinuation Feb 21, 2025 •

edited

Loading

Uh oh!

manuzhang commented Feb 21, 2025

Uh oh!

Kontinuation commented Feb 21, 2025

Uh oh!

Uh oh!

Fokko commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Build: remove Hadoop 2 dependency #12348

Build: remove Hadoop 2 dependency #12348

Uh oh!

Conversation

Kontinuation commented Feb 20, 2025

Uh oh!

Kontinuation Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

nastra Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

manuzhang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Kontinuation Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

manuzhang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Kontinuation Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manuzhang commented Feb 21, 2025

Uh oh!

Kontinuation commented Feb 21, 2025

Uh oh!

Uh oh!

Fokko commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Kontinuation Feb 21, 2025 •

edited

Loading