API, CORE, Flink, Spark: Deprecate Snapshot Change Methods by RussellSpitzer · Pull Request #15241 · apache/iceberg

RussellSpitzer · 2026-02-05T18:45:17Z

Context

This is part of the write metadata with columnar formats change. When we start writing parquet manifests, calls to ManifestReader() without passing through the partitionSpecByID will error out. The reason for the error is that we have no way of reading the metadata in the new file APIs and we have decided we aren't going add it in the future.

Snapshot.addedFiles and it's friends are some of the main users of this ManifestRead(path, IO) path (that aren't test code) so we need to remove those methods and switch our usage to a version which passes through partitionSpecByID. Otherwise switching to parquet manifests will cause issues throughout the codebase.

See #13769

This PR

We currently offer several methods for getting files changed in a snapshot but they rely on the assumption that you can read the partition_spec from the manifest metadata. In advance of the move to Parquet Manifest, we'll be no longer able to rely on this part of the manifest read code.

In this PR we deprecate those existing methods and create a new utility class which can do the same thing as the old Snapshot methods. The new utility class does not assume that the manifest read code can actually read the partition_spec info and instead takes it as an arguement.

Production Code Changes

Core

CherryPickOperation
MicroBatches

Flink

TableChange

Spark

MicroBatchStream

Test Changes

Unfortunately there are also a huge number of test usages of these methods, the majority of this commit is cleaning those up.

As a disclaimer, I did use Cursor and Claude code when writing this PR, It did the majority of the test refactoring although I have checked them all as well.

We currently offer several methods for getting files changed in a snapshot but they rely on the assumption that you can read the partition_spec from the manifest metadata. In advance of the move to Parquet Manifest, we'll be no longer able to rely on this part of the manifest read code. In this PR we deprecate those existing methods and create a new utility class which can do the same thing as the old Snapshot methods. The new utility class does not assume that the manifest read code can actually read the partition_spec info and instead takes it as an arguement. In production code there are only a small number of actual uses 1. CherryPickOperation 2. MicroBatches Within our other modules we also had a few usages Flink 1. TableChange Spark 2. MicroBatchStream Unfortuantely there are also a huge number of test usages of these methods, the majority of this commit is cleaning those up.

RussellSpitzer · 2026-02-05T18:47:03Z

api/src/main/java/org/apache/iceberg/Snapshot.java

@@ -112,7 +112,11 @@ public interface Snapshot extends Serializable {
   *
   * @param io a {@link FileIO} instance used for reading files from storage
   * @return all data files added to the table in this snapshot.


The Main Deprecations are Here

+1 for deprecating these methods. We've had issues with these in the past, like when we had to add FileIO to the signature. A utility class is the better option.

Should we also deprecate the equivalent manifest methods?

RussellSpitzer · 2026-02-05T18:48:20Z

core/src/main/java/org/apache/iceberg/CherryPickOperation.java

      // Pick modifications from the snapshot
-      for (DataFile addedFile : cherrypickSnapshot.addedDataFiles(io)) {
+      SnapshotFileChanges changes =
+          SnapshotFileChanges.builder(cherrypickSnapshot, io, specsById).build();


We technically don't have to use the cached utility here, but we use it in the other usage within this file so I thought it was a bit clearer this way.

This looks reasonable to me. Cached or not is the internal impl of SnapshotFileChanges

RussellSpitzer · 2026-02-05T18:49:06Z

core/src/main/java/org/apache/iceberg/CherryPickOperation.java

      failMissingDeletePaths();

-      // copy adds from the picked snapshot
+      // copy adds and deletes from the picked snapshot


Suggested change

// copy adds and deletes from the picked snapshot

// copy adds from the picked snapshot

RussellSpitzer · 2026-02-05T18:51:44Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

@@ -0,0 +1,276 @@
+/*


This is the actual Utility we are switching to

With V4 metadata change, the change detection can be more complicated with manfest DVs. I like this direction of moving the change detection out of the Snapshot class, which can just focus on core data structures.

This could be a good foundation for the change detection in the V4 adaptive tree.

In V4, if we are going to colocate DV (deleted old and added new) and data file, it might make sense to expose a combined result. Otherwise, the associations get split first and then need to be joined again.

I think this is a good way to think about this class for reviews.

RussellSpitzer · 2026-02-05T18:57:00Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+   * @param specsById a map of partition spec IDs to partition specs
+   * @return a new Builder
+   */
+  public static Builder builder(


I think we may want to in the future extend this to take multiple snapshots so ti may make sense to break the api into a "specs,io" and "snapshot" seperately, but not now

Why not pass everything in builder methods? Or do you want to keep those focused on just the snapshot configuration?

RussellSpitzer · 2026-02-05T18:58:41Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+  }
+
+  private void cacheDataFileChanges() {
+    List<ManifestFile> changedManifests =


I changed the logic from the BaseSnapshot implementation to optionally use an ExecutorService. I want to save actually using that for a followup PR but I think it was probably a mistake that we had this single threaded before.

RussellSpitzer · 2026-02-05T19:00:42Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+  }
+
+  private void cacheDeleteFileChanges() {
+    List<ManifestFile> changedManifests =


Similar to above this differs from the BaseSnapshot impl by using an optional executor service

RussellSpitzer · 2026-02-05T19:02:14Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+  }
+
+  public static class Builder {
+    private final Snapshot snapshot;


I am attempting to hid some of the constructor details from the Changes Class with this builder, but that may be premature.

I think this is a good idea, but I think you should choose what is required and what is used to configure the builder.

I might suggest passing a table in since this seems like a useful interface for callers that don't want to get lots of things from a table just to pass it in here.

SnapshotChanges changes = SnapshotChanges.builderFor(table) .snapshot(id) // maybe this can support refs, too? .executeWith(threadPool) .build(); SnapshotChanges changes = SnapshotChanges.builderFor(table) .startingSnapshot(start) .endingSnapshot(end) .executeWith(threadPool) .build();

RussellSpitzer · 2026-02-05T19:03:35Z

core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java


    return metadata.snapshot(ref.snapshotId());
  }
+


A set of static methods which allow for getting just one element of the Changes class without actually constructing it. We can drop these as well but I think they make the test code refactor a bit smaller.

RussellSpitzer · 2026-02-05T19:19:09Z

core/src/main/java/org/apache/iceberg/MicroBatches.java

@@ -31,6 +31,7 @@
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;


Another production code change here

RussellSpitzer · 2026-02-05T19:21:07Z

core/src/test/java/org/apache/iceberg/util/TestSnapshotUtil.java

    assertThat(table.schema().asStruct()).isEqualTo(expected.asStruct());
    assertThat(SnapshotUtil.schemaFor(table, tag).asStruct()).isEqualTo(initialSchema.asStruct());
  }
+


Actual Tests for the Utility

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TableChange.java

RussellSpitzer · 2026-02-05T19:27:07Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

    // If snapshotSummary doesn't have SnapshotSummary.ADDED_FILES_PROP,
    // iterate through addedFiles iterator to find addedFilesCount.
    return addedFilesCount == -1
-        ? Iterables.size(snapshot.addedDataFiles(table.io()))


This is the only Spark Usage, pretty straight forward fix

RussellSpitzer · 2026-02-05T19:34:24Z

If it's easier for reviewers, I can also split this into "deprecate and utility" and then do all the production and test updates in a follow up PR. I just bundled this all up so that we would be able to avoid having a build that has active deprecation warnings.

stevenzwu · 2026-02-06T00:14:54Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

@@ -0,0 +1,276 @@
+/*


With V4 metadata change, the change detection can be more complicated with manfest DVs. I like this direction of moving the change detection out of the Snapshot class, which can just focus on core data structures.

This could be a good foundation for the change detection in the V4 adaptive tree.

In V4, if we are going to colocate DV (deleted old and added new) and data file, it might make sense to expose a combined result. Otherwise, the associations get split first and then need to be joined again.

core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java

stevenzwu · 2026-02-06T00:21:59Z

core/src/main/java/org/apache/iceberg/CherryPickOperation.java

      // Pick modifications from the snapshot
-      for (DataFile addedFile : cherrypickSnapshot.addedDataFiles(io)) {
+      SnapshotFileChanges changes =
+          SnapshotFileChanges.builder(cherrypickSnapshot, io, specsById).build();


This looks reasonable to me. Cached or not is the internal impl of SnapshotFileChanges

stevenzwu · 2026-02-06T00:38:51Z

core/src/main/java/org/apache/iceberg/SnapshotFileChanges.java

+ * query multiple file change types for the same snapshot. By default, manifests are read
+ * sequentially. Use {@link Builder#executeWith(ExecutorService)} to enable parallel reading.
+ */
+public class SnapshotFileChanges {


can we keep this class as package private to start with? SnapshotUtil is in a diff package which prevents this.

Not sure if ChangelogUtil is a better place. Or maybe add a new ChangeDetectionUtil?

core/src/test/java/org/apache/iceberg/util/TestSnapshotUtil.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TableChange.java

pvary · 2026-02-06T09:53:17Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+            Iterables.filter(
+                Iterables.filter(
+                    snapshot.allManifests(io),
+                    manifest -> manifest.content() == ManifestContent.DATA),
+                manifest -> Objects.equals(manifest.snapshotId(), snapshot.snapshotId())));


nit: why two filters? Could we combine them?

Just being lazy and mimicking the BaseSnapshot code, Let me tighten that up

pvary · 2026-02-06T09:53:49Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+    }
+    return addedDataFiles;


We need a prompt for the AIs to add newlines 🤖

We can do a "agents.md" i know a bunch of other projects have done this

Although if we really want this to not be an issue we need to encode it in our style rules

+1 for adding the config.

pvary · 2026-02-06T09:56:36Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+  }
+
+  /** Returns all data files removed from the table in this snapshot. */
+  public Iterable<DataFile> removedDataFiles() {


Do we want Iterable or Iterator?
Iterator could be a bit more flexible, if we end up in a situation where the data don't fit into memory, but we might end up duplicating data.

Minimally we should return immutable results

Currently i'm just mimicing the old signature, but your right it make be nice to change to interator although our current implementation is just iterable. I'm wondering for "iterator" invocations if we should change the implementation of the helper methods I added to SnapshotUtil since this class is basically for caching.

Iterable is the right class.

Iterable provides something that can be iterated over multiple times, each time producing a new Iterator. That gives us the most flexibility, instead of forcing the caller to go back to this API to process removed data files another time -- the result of this method can be passed around and used without exposing this.

Iterable also does not need to load anything into memory. A List is Iterable, but so is ManifestReader that loads chunks of data at a time.

Java's enhanced for syntax is another reason to use Iterable:

// this works for (DataFile removed : changes.removedDataFiles()) { ... }

Otherwise you have to wrap to produce an Iterable:

// this has to create an Iterable using a lambda for (DataFile removed : (Iterable<DataFile>) () -> changes.removedDataFiles()) { ... }

core/src/test/java/org/apache/iceberg/util/TestSnapshotUtil.java

stevenzwu

LGTM

rdblue · 2026-02-12T23:48:51Z

api/src/main/java/org/apache/iceberg/Snapshot.java

   *
   * @param io a {@link FileIO} instance used for reading files from storage
   * @return all data files added to the table in this snapshot.
+   * @deprecated will be removed in 2.0.0; use org.apache.iceberg.SnapshotChanges#builder(Snapshot,


It's probably worth an import for SnapshotChanges to avoid the fully-qualified names in javadoc.

rdblue · 2026-02-13T00:09:22Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+            Iterables.filter(
+                Iterables.filter(
+                    snapshot.allManifests(io),
+                    manifest -> manifest.content() == ManifestContent.DATA),


snapshot.dataManifests(io)?

rdblue · 2026-02-13T00:20:17Z

core/src/main/java/org/apache/iceberg/SnapshotChanges.java

+        .stopOnFailure()
+        .throwFailureWhenFinished()
+        .executeWith(executorService)
+        .run(manifest -> fileChangesByManifest.add(readDataFileChanges(manifest)));


You may want to use ParallelIterable instead of directly building on Tasks. The benefit of using ParallelIterable is that it manages the queue internally, doubles the number of running tasks (2x the size of the thread pool), handles closing for CloseableIterable, and allows you to consume the results while threads are running. The main benefit would be higher parallelism, but this would also be a little simpler because the original implementation that used ManifestGroup already supports passing a threadpool if you call planWith.

Was the intent to avoid complication by replacing ManifestGroup?

Oh, I see. The delete code used readDeleteManifest directly, so this is probably just being consistent.

I think I'd still recommend using ParallelIterable. Here's what I came up with:

private Iterable<Pair<ManifestEntry.Status, DeleteFile>> readDeleteFiles(ManifestFile manifest) { Iterable<ManifestEntry<DeleteFile>> entries = ManifestFiles.readDeleteManifest(manifest, fileIO, null).entries(); Iterable<Pair<ManifestEntry.Status, DeleteFile>> copied = Iterables.transform(entries, entry -> switch (entry.status()) { case ADDED -> Pair.of(ManifestEntry.Status.ADDED, entry.file().copy()); case DELETED -> Pair.of(ManifestEntry.Status.DELETED, entry.file().copyWithoutStats()); default -> null; }); return Iterables.filter(copied, java.util.Objects::nonNull); } private void cacheDeleteFileChanges() { ImmutableList.Builder<DeleteFile> adds = ImmutableList.builder(); ImmutableList.Builder<DeleteFile> deletes = ImmutableList.builder(); Iterable<ManifestFile> changedManifests = Iterables.filter( deleteManifests(fileIO), manifest -> Objects.equal(manifest.snapshotId(), snapshotId)); Iterable<Iterable<Pair<ManifestEntry.Status, DeleteFile>>> changedDeletes = Iterables.transform(changedManifests, this::readDeleteFiles); try (CloseableIterable<Pair<ManifestEntry.Status, DeleteFile>> pairs = new ParallelIterable<>(changedDeletes, ThreadPools.getWorkerPool())) { for (Pair<ManifestEntry.Status, DeleteFile> delete : pairs) { switch (delete.first()) { case ADDED -> adds.add(delete.second()); case DELETED -> deletes.add(delete.second()); } } } catch (IOException e) { throw new UncheckedIOException("Failed to close manifest reader", e); } this.addedDeleteFiles = adds.build(); this.removedDeleteFiles = deletes.build(); }

This uses Pair because the files need to be copied before returning them to ParallelIterable, but that would lose the added/deleted status. This is the kind of thing that inspired inverting the manifest_entry / delete_file relationship in v4 so that delete_file contains tracking_info.

github-actions bot added API spark core flink INFRA labels Feb 5, 2026

RussellSpitzer commented Feb 5, 2026

View reviewed changes

Spotless, Self Review Thoughts

a6bd8c3

RussellSpitzer commented Feb 5, 2026

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TableChange.java Show resolved Hide resolved

RussellSpitzer commented Feb 5, 2026

View reviewed changes

Remove gitignore changes

53d0ea6

github-actions bot removed the INFRA label Feb 5, 2026

Spotless Apply on Everything

a4ab20c

Fix JavaDocs

848a19b

RussellSpitzer requested review from nastra, pvary and stevenzwu and removed request for pvary and stevenzwu February 5, 2026 19:42

RussellSpitzer requested review from pvary and stevenzwu February 5, 2026 19:43

stevenzwu reviewed Feb 6, 2026

View reviewed changes

pvary reviewed Feb 6, 2026

View reviewed changes

RussellSpitzer commented Feb 6, 2026

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestSnapshotUtil.java Outdated Show resolved Hide resolved

RussellSpitzer added 2 commits February 9, 2026 09:45

More Review Comments

cc5ac9b

Spotless Apply

c8c185e

stevenzwu approved these changes Feb 10, 2026

View reviewed changes

More Spotless More Apply

d33da87

rdblue reviewed Feb 12, 2026

View reviewed changes

rdblue reviewed Feb 13, 2026

View reviewed changes

	// copy adds and deletes from the picked snapshot
	// copy adds from the picked snapshot

		@@ -31,6 +31,7 @@
		import org.apache.iceberg.relocated.com.google.common.collect.Iterables;

Conversation

RussellSpitzer commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

This PR

Production Code Changes

Core

Flink

Spark

Test Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Feb 5, 2026 •

edited

Loading

RussellSpitzer Feb 5, 2026 •

edited

Loading

rdblue Feb 13, 2026 •

edited

Loading