Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7f2d5b0
Add ParquetFileMerger for efficient row-group level file merging
shangxinli Oct 13, 2025
fa1d073
Address feedbacks
shangxinli Nov 4, 2025
7a34353
Address feedbacks for second round
shangxinli Nov 12, 2025
c150887
Address comments round 3
shangxinli Nov 16, 2025
c593e9e
Trigger CI
shangxinli Nov 17, 2025
c71b419
Add row lineage preservation to ParquetFileMerger with binpack-compat…
shangxinli Nov 23, 2025
4ddb5b4
Address feedback of adding row_id support
shangxinli Nov 24, 2025
4130a79
Address feedback for another round
shangxinli Nov 26, 2025
4e4874e
Address comments for another round
shangxinli Nov 26, 2025
eabaa0d
Address another round of feedbacks
shangxinli Nov 27, 2025
55aa295
Merge branch 'main' into rewrite_data_files2
shangxinli Nov 27, 2025
047f9b6
Address feedback for another round
shangxinli Nov 28, 2025
0709582
Simplify ParquetFileMerger API to accept DataFile objects
shangxinli Dec 2, 2025
cdc322d
Initialize columnIndexTruncateLength internally in ParquetFileMerger
shangxinli Dec 3, 2025
853fd19
Address review feedback: refactor ParquetFileMerger API and validation
shangxinli Dec 5, 2025
4c4f2cb
Refactor ParquetFileMerger API to return MessageType
shangxinli Dec 6, 2025
aa5fc36
Address review feedback: optimize validation and file I/O
shangxinli Dec 8, 2025
5962e74
Address review feedback
shangxinli Dec 20, 2025
2eca995
Refactor SparkParquetFileMergeRunner to pass RewriteFileGroup to exec…
shangxinli Dec 21, 2025
45b0197
Inline ParquetFileReader in try-with-resources block
shangxinli Dec 28, 2025
3194f1e
Address pvary's review comments on ParquetFileMerger PR
shangxinli Jan 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,32 @@ public interface RewriteDataFiles
*/
String OUTPUT_SPEC_ID = "output-spec-id";

/**
* Use Parquet row-group level merging during rewrite operations when applicable.
*
* <p>When enabled, Parquet files will be merged at the row-group level by directly copying row
* groups without deserialization and re-serialization. This provides significant performance
* improvements for compatible Parquet files.
*
* <p>Requirements for row-group merging:
*
* <ul>
* <li>All files must be in Parquet format
* <li>Files must have compatible schemas
* <li>Files must not be encrypted
* <li>Files must not have associated delete files or delete vectors
* <li>Table must not have a sort order (including z-ordered tables)
* </ul>
*
* <p>If requirements are not met, the rewrite will automatically fall back to the standard
* read-rewrite approach with a warning logged.
*
* <p>Defaults to false.
*/
String USE_PARQUET_ROW_GROUP_MERGE = "use-parquet-row-group-merge";

boolean USE_PARQUET_ROW_GROUP_MERGE_DEFAULT = false;

/**
* Choose BINPACK as a strategy for this rewrite operation
*
Expand Down
23 changes: 23 additions & 0 deletions docs/docs/maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,29 @@ The `files` metadata table is useful for inspecting data file sizes and determin

See the [`RewriteDataFiles` Javadoc](../../javadoc/{{ icebergVersion }}/org/apache/iceberg/actions/RewriteDataFiles.html) to see more configuration options.

#### Parquet row-group level merging

For Parquet tables, `rewriteDataFiles` can use an optimized row-group level merge strategy that is significantly faster than the standard read-rewrite approach. This optimization directly copies row groups without deserialization and re-serialization.

```java
Table table = ...
SparkActions
.get()
.rewriteDataFiles(table)
.option(RewriteDataFiles.USE_PARQUET_ROW_GROUP_MERGE, "true")
.execute();
```

This optimization is applied when the following requirements are met:

* All files are in Parquet format
* Files have compatible schemas
* Files are not encrypted
* Files do not have associated delete files or delete vectors
* Table does not have a sort order (including z-ordered tables)

If the requirements are not met, the rewrite automatically falls back to the standard read-rewrite approach with a warning logged.

### Rewrite manifests

Iceberg uses metadata in its manifest list and manifest files to speed up query planning and to prune unnecessary data files. The metadata tree functions as an index over a table's data.
Expand Down
Loading