Spark:Add branch support to rewrite_data_files procedure #14964

97harsh · 2026-01-05T09:23:16Z

Spark: Add branch support to rewrite_data_files procedure

This change enables the rewrite_data_files stored procedure to rewrite
data files on specific branches instead of only on the main branch.

Implementation:

Core: Extended RewriteDataFilesCommitManager to accept and use branch parameter
Action: Added toBranch() method to RewriteDataFilesSparkAction (v3.4, v3.5, v4.0, v4.1)
Procedure: Added optional branch parameter to RewriteDataFilesProcedure (all versions)
Tests: Added branch-specific test coverage for all Spark versions

Users can specify branches in two ways:

Table identifier: CALL system.rewrite_data_files('db.table.branch_myBranch')
Explicit parameter: CALL system.rewrite_data_files(table => 'db.table', branch => 'myBranch')

The implementation follows the existing pattern used by SparkWrite and other
branch-aware operations. The commit manager already had branch support built in,
this change wires it through the action and procedure layers.

Fixes #14813

This commit adds branch support to the rewrite_data_files Spark SQL procedure, allowing users to rewrite data files on specific branches instead of only the main branch. Changes: - Core: Updated RewriteDataFilesCommitManager to accept and apply branch parameter - Action: Added toBranch() method to RewriteDataFilesSparkAction - Procedure: Added branch parameter to all Spark versions (v3.4, v3.5, v4.0, v4.1) Users can now specify branches in two ways: 1. Via table identifier: CALL system.rewrite_data_files('db.table.branch_myBranch') 2. Via explicit parameter: CALL system.rewrite_data_files(table => 'db.table', branch => 'myBranch') Fixes apache#14813

This commit fixes the compilation error by implementing the missing toBranch() method in RewriteDataFilesSparkAction for Spark versions 3.4, 3.5, and 4.0. Changes: - Added toBranch(String targetBranch) method to RewriteDataFilesSparkAction - Updated commitManager() to pass branch parameter to RewriteDataFilesCommitManager - Added comprehensive branch tests to TestRewriteDataFilesProcedure (all versions) The implementation follows the same pattern as v4.1 and matches how SparkWrite handles branches. Integration tests passing: iceberg-delta-lake:check

- Add missing Table import in Spark 3.4 test file - Fix branch names to use camelCase (testBranch, filteredBranch) to avoid SQL parsing errors - Ensure files are actually rewritten by inserting multiple small files - Add min-input-files option to force file compaction - Remove incorrect snapshot ID ordering assertions - Add explicit assertions to verify files are rewritten and snapshots change

The previous implementation incorrectly passed null as a 4th parameter in the 2-arg and 3-arg constructors, which caused them to call the wrong constructor overload. This resulted in snapshotProperties being null, leading to a NullPointerException when commitFileGroups() tried to iterate over properties with forEach(). The issue broke Flink maintenance API tests (TestRewriteDataFiles and TestFlinkTableSinkCompaction) because the Flink DataFileRewriteCommitter uses the 2-arg constructor. Files were not being rewritten as expected. Changes: - Line 51: Remove null parameter to call 3-arg constructor - Line 56: Remove null parameter to call 4-arg constructor with Map This ensures the constructor chain properly passes through existing constructors without introducing null values, and branch parameter is correctly passed only through the appropriate constructors.

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

When rewrite_data_files was called with a branch parameter, the planner incorrectly used the main branch's snapshot to scan for files to compact, while the commit targeted the specified branch. This caused validation failures when branches diverged. The fix ensures RewriteDataFilesSparkAction.execute() uses the branch's snapshot ID when a branch is specified, allowing the planner to correctly identify and compact files from the branch. This change applies to all Spark versions (3.4, 3.5, 4.0, 4.1) and fixes all rewrite strategies (binpack, sort, z-order) since they all rely on the snapshot ID passed from RewriteDataFilesSparkAction.

Enhanced testBranchCompactionDoesNotAffectMain to verify that the new snapshot created by rewrite_data_files is a child of the previous branch snapshot. This ensures the compaction is committed to the branch's history chain, not to main. The assertion checks that: table.snapshot(branchSnapshotAfterCompaction).parentId() == branchSnapshotBeforeCompaction This provides stronger validation that the rewrite operation correctly targets and modifies the specified branch.

Apply spotless formatting to multi-line sql() and assertThat() calls across Spark 3.4, 3.5, and 4.0 modules.

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

Adds a precondition check in RewriteDataFilesSparkAction.execute() to verify that the specified branch exists before attempting to access its snapshot. This provides a clear error message instead of a cryptic NullPointerException when a non-existent branch is specified.

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

pvary · 2026-01-06T11:34:57Z

Left some comments.
Could you please remove the Spark 3.4, 3.5, 4.0 changes for now, so it is easier to remove, and apply the changes requested by the reviewers?

In a next PR we will do the backport which should be easy and clean.

Thanks!

…thods Change checkAndApplyFilter and checkAndApplyStrategy to accept and return RewriteDataFilesSparkAction instead of the RewriteDataFiles interface, eliminating unnecessary casts at call sites.

Remove rewrite_data_files branch support from older Spark versions, keeping the feature only in Spark 4.1.

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

singhpk234

LGTM as well, just a minor / optional suggestion

singhpk234 · 2026-01-07T09:42:30Z

core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java

      long startingSnapshotId,
      boolean useStartingSequenceNumber,
      Map<String, String> snapshotProperties) {
+    this(table, startingSnapshotId, useStartingSequenceNumber, snapshotProperties, null);


minor : i wonder if we could just set this to MAIN branch as default, so we don't have to do isNull checks

I think it makes sense to use null to delegate to SnapshotProducer's default behavior rather than explicitly setting "main".
This avoids imposing an opinion - if the default branch behavior changes in the future, this will automatically follow. The null check pattern is also consistent with how branch handling works elsewhere in the codebase (see SnapshotUtil methods that treat null as "use default").

Then again, it makes sense to remove a lot of isNull checks by making toBranch(null) a no-op. This should make the code cleaner.

@pvary / @singhpk234 any suggestions?

This is something, I have also considered during the review, but I don't have a strong opinion, so I left as it is.

@singhpk234?

Leaning +1 on letting null mean “use default branch” and making toBranch(null) a no‑op.

Thank you for your review, done

toBranch(null) results in no-op

removed one condition check for branch!=null

97harsh · 2026-01-07T17:22:07Z

Are you good @singhpk234 leaving as is?
Can we merge if good?

dramaticlly · 2026-01-07T18:33:58Z

...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

  private boolean removeDanglingDeletes;
  private boolean useStartingSequenceNumber;
  private boolean caseSensitive;
+  private String branch = null;


same as what Prashant suggested above, I think we can probably default branch to main and can always assert on table.snapshot(branch) != null

Hey @dramaticlly , per @huaxingao 's recommendation kept

null->default branch

toBranch(null) results in no-op

This change ensures that calling toBranch(null) does not modify the internal branch state, treating null as "use default branch". This simplifies the calling code and removes redundant null checks.

97harsh · 2026-01-08T13:31:06Z

@huaxingao
Can we merge if good?

pvary · 2026-01-08T15:32:21Z

...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

+    if (targetBranch != null) {
+      this.branch = targetBranch;
+    }
+    return this;


nit: newline - if we keep this code

Why do we do this?

It would make sense if we would define the branch differently, like:

private String branch = SnapshotRef.MAIN_BRANCH;

If it is just null then this code is just confusing. Basically equals

this.branch = targetBranch; return this;

Just a little more confusing, because you only can not reset the value back to null 😄

Thank you this makes sense, the only logical path out of this I see is to default it to MAIN_BRANCH

Updated to align with core Iceberg API behavior. SnapshotProducer.targetBranch() defaults to MAIN_BRANCH and rejects null with IllegalArgumentException - since RewriteDataFilesSparkAction eventually calls through to SnapshotProducer via rewrite.toBranch(), it makes sense to have consistent behavior at the action level.

Changes:

branch now defaults to SnapshotRef.MAIN_BRANCH instead of null

toBranch(null) now throws IllegalArgumentException (matching SnapshotProducer)

Removed null-check guards in execute() that are no longer needed

Added test testRewriteDataFilesToNullBranchFails

cc: @huaxingao

Thanks for the update. The new change makes sense to me.

…ject null This change aligns RewriteDataFilesSparkAction.toBranch() with the core Iceberg API behavior in SnapshotProducer.targetBranch(): - Default branch to SnapshotRef.MAIN_BRANCH instead of null - Reject null branch with IllegalArgumentException - Remove null-check guards that are no longer needed Since RewriteDataFilesSparkAction eventually calls rewrite.toBranch() which invokes SnapshotProducer.targetBranch(), having consistent behavior at the action level prevents confusion and potential runtime errors.

Default to MAIN_BRANCH when branch parameter is not provided, preventing IllegalArgumentException from toBranch() validation. Simplifies branch resolution logic by using Objects.requireNonNullElse and removes unused SparkTable import.

97harsh · 2026-01-11T15:42:58Z

@pvary, @huaxingao What's the process to merge if this looks good

huaxingao · 2026-01-11T17:12:54Z

I will wait a day or two to see if there are any further comments. If not, I will merge it by the end of tomorrow.

pvary · 2026-01-12T12:44:19Z

.../v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+    if (targetBranch == null) {
+      targetBranch = loadSparkTable(tableIdent).branch();
+    }
+    if (targetBranch == null) {


nit: newline

pvary · 2026-01-12T12:44:31Z

.../v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+    }
+    if (targetBranch == null) {
+      targetBranch = SnapshotRef.MAIN_BRANCH;
+    }


nit: newline

pvary · 2026-01-12T12:54:52Z

.../v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+    String explicitBranch = input.asString(BRANCH_PARAM, null);
+
+    // Determine target branch: explicit parameter > table branch > main branch
+    String targetBranch = explicitBranch;
+    if (targetBranch == null) {
+      targetBranch = loadSparkTable(tableIdent).branch();
+    }
+    if (targetBranch == null) {
+      targetBranch = SnapshotRef.MAIN_BRANCH;
+    }
+    String branch = targetBranch;


Suggested change

String explicitBranch = input.asString(BRANCH_PARAM, null);

// Determine target branch: explicit parameter > table branch > main branch

String targetBranch = explicitBranch;

if (targetBranch == null) {

targetBranch = loadSparkTable(tableIdent).branch();

}

if (targetBranch == null) {

targetBranch = SnapshotRef.MAIN_BRANCH;

}

String branch = targetBranch;

// Determine target branch: explicit parameter > table branch > main branch

String branch = input.asString(BRANCH_PARAM, null);

if (branch == null) {

branch = loadSparkTable(tableIdent).branch();

if (branch == null) {

branch = SnapshotRef.MAIN_BRANCH;

}

}

Running into java compilation error with this, so reverting, will nest the ifs

The branch variable is used inside a lambda. Java requires variables used in lambdas to be effectively final. So needs a separate variable.

…ures/RewriteDataFilesProcedure.java Co-authored-by: pvary <peter.vary.apache@gmail.com>

The branch variable was reassigned multiple times before being used in a lambda, violating Java's effectively final requirement.

97harsh · 2026-01-16T13:42:05Z

@pvary @huaxingao Ready to merge now?

pvary · 2026-01-16T13:53:33Z

Merged to main.
Sorry for the delay.
Thanks @97harsh for the new feature, and @huaxingao for the review!

manuzhang · 2026-01-16T14:12:24Z

...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java

  }

+  public RewriteDataFilesSparkAction toBranch(String targetBranch) {
+    Preconditions.checkArgument(targetBranch != null, "Invalid branch name: null");


Can we have a branch with name "null"? The error message looks like targetBranch.equals("null") to me.

@manuzhang: Sorry, I haven't seen your comment. Care to raise a PR?

I commented after the merge. @97harsh Can you create a follow-up PR?

….5, and 3.4 This backports the branch support feature from Spark 4.1 (PR apache#14964) to Spark 4.0, 3.5, and 3.4. The changes add a new `branch` parameter to the `rewrite_data_files` procedure that allows users to specify which branch to compact data files for. Key changes: - Add `toBranch()` method to RewriteDataFilesSparkAction - Add `branch` parameter to RewriteDataFilesProcedure - Use branch snapshot ID instead of current snapshot for compaction - Pass branch parameter to RewriteDataFilesCommitManager

….0, 3.5, and 3.4 Backport test cases from apache#14964: - testRewriteDataFilesOnBranch - testRewriteDataFilesToNullBranchFails - testRewriteDataFilesOnBranchWithFilter - testBranchCompactionDoesNotAffectMain

github-actions bot added spark core labels Jan 5, 2026

97harsh marked this pull request as draft January 5, 2026 09:41

97harsh added 3 commits January 5, 2026 16:01

gradlew -Dall

0e696f7

97harsh marked this pull request as ready for review January 5, 2026 12:06

pvary reviewed Jan 5, 2026

View reviewed changes

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

97harsh marked this pull request as draft January 5, 2026 13:53

97harsh added 2 commits January 5, 2026 20:42

97harsh marked this pull request as ready for review January 5, 2026 15:20

Fix spotless formatting in test and action files

d4e0a6e

Apply spotless formatting to multi-line sql() and assertThat() calls across Spark 3.4, 3.5, and 4.0 modules.

pvary reviewed Jan 5, 2026

View reviewed changes

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated Show resolved Hide resolved

97harsh requested a review from pvary January 6, 2026 03:47

pvary reviewed Jan 6, 2026

View reviewed changes

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

97harsh added 5 commits January 6, 2026 17:54

Refactor RewriteDataFilesProcedure to use concrete types in helper me…

40c5a4a

…thods Change checkAndApplyFilter and checkAndApplyStrategy to accept and return RewriteDataFilesSparkAction instead of the RewriteDataFiles interface, eliminating unnecessary casts at call sites.

Revert branch support from Spark 3.4, 3.5, 4.0

640603c

Remove rewrite_data_files branch support from older Spark versions, keeping the feature only in Spark 4.1.

Remove debug comment from branch compaction test

b4fcf5f

Trigger CI

560ff4e

Trigger CI

efd06dd

pvary reviewed Jan 6, 2026

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

singhpk234 reviewed Jan 7, 2026

View reviewed changes

huaxingao approved these changes Jan 7, 2026

View reviewed changes

97harsh requested a review from singhpk234 January 7, 2026 17:21

dramaticlly reviewed Jan 7, 2026

View reviewed changes

Make toBranch(null) a no-op for default branch behavior

23ec30c

This change ensures that calling toBranch(null) does not modify the internal branch state, treating null as "use default branch". This simplifies the calling code and removes redundant null checks.

pvary reviewed Jan 8, 2026

View reviewed changes

97harsh added 3 commits January 9, 2026 20:29

Trigger Build

2e232c7

97harsh force-pushed the feature/rewrite-data-files-branch-support branch from 8f0b83c to 870cc50 Compare January 10, 2026 17:23

97harsh requested a review from pvary January 11, 2026 15:40

pvary reviewed Jan 12, 2026

View reviewed changes

97harsh and others added 2 commits January 13, 2026 06:41

Update spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/proced…

f220b40

…ures/RewriteDataFilesProcedure.java Co-authored-by: pvary <peter.vary.apache@gmail.com>

Fix effectively final variable for lambda capture

f7acee3

The branch variable was reassigned multiple times before being used in a lambda, violating Java's effectively final requirement.

pvary approved these changes Jan 13, 2026

View reviewed changes

pvary merged commit b4ef17c into apache:main Jan 16, 2026
32 checks passed

manuzhang reviewed Jan 16, 2026

View reviewed changes

This was referenced Jan 16, 2026

Spark: Backport branch support for rewrite_data_files to Spark 4.0, 3.5, and 3.4 97harsh/iceberg#3

Closed

Spark: Backport branch support for rewrite_data_files to Spark 4.0, 3.5, and 3.4 97harsh/iceberg#4

Closed

97harsh mentioned this pull request Jan 17, 2026

Spark: Backport branch support for rewrite_data_files to Spark 4.0, 3.5, and 3.4 #15067

Open

4 tasks

Spark:Add branch support to rewrite_data_files procedure #14964

Spark:Add branch support to rewrite_data_files procedure #14964

Uh oh!

Conversation

97harsh commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

97harsh commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

97harsh commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

97harsh commented Jan 11, 2026

Uh oh!

huaxingao commented Jan 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

97harsh commented Jan 16, 2026

Uh oh!

Uh oh!

pvary commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

97harsh commented Jan 5, 2026 •

edited

Loading