Skip to content

Optimization: Merge manifest in Overwrite operations and filter out Manifests with non-live data files#3103

Open
gabeiglio wants to merge 2 commits intoapache:mainfrom
gabeiglio:merge-manifests
Open

Optimization: Merge manifest in Overwrite operations and filter out Manifests with non-live data files#3103
gabeiglio wants to merge 2 commits intoapache:mainfrom
gabeiglio:merge-manifests

Conversation

@gabeiglio
Copy link
Contributor

@gabeiglio gabeiglio commented Feb 26, 2026

Rationale for this change

Following from the optimizations when benchmarking Overwrites, I noticed that in each full overwrite of a partition we would linearly increment the number of manifest files for that partition, even though, only one of those manifests contained the data. And, therefore, each new overwrite would take a bit longer in each iteration.

For example, doing 20 overwrite iterations of full overwrite to a partition would look like

> select partition_summaries from default.table.manifests where partition_summaries[0]['lower_bound'] = 20250101;

[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
...
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
Time taken: 0.137 seconds, Fetched 21 row(s)

WIth merge enabled for overwrites, 20 iterations would leave:

> select partition_summaries from default.table.manifests where partition_summaries[0]['lower_bound'] = 20250101;

[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
Time taken: 0.137 seconds, Fetched 2 row(s)

So there are two changes being made in this PR:

  1. Make _OverwriteFiles merge manifests before committing
  2. Filter out manifests while merging that contain no live data

Are these changes tested?

Created tests/integration/test_writes/test_manifest_merging.py

with three integration tests testing the number of manifests of overwrites and appends (with and without manifests merging enabled) to test for data correctness and number of manifest.

Are there any user-facing changes?

User will potentially see less manifests as a result of overwrite operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant