Optimization: Merge manifest in Overwrite operations and filter out Manifests with non-live data files#3103
Open
gabeiglio wants to merge 2 commits intoapache:mainfrom
Open
Optimization: Merge manifest in Overwrite operations and filter out Manifests with non-live data files#3103gabeiglio wants to merge 2 commits intoapache:mainfrom
gabeiglio wants to merge 2 commits intoapache:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Following from the optimizations when benchmarking Overwrites, I noticed that in each full overwrite of a partition we would linearly increment the number of manifest files for that partition, even though, only one of those manifests contained the data. And, therefore, each new overwrite would take a bit longer in each iteration.
For example, doing 20 overwrite iterations of full overwrite to a partition would look like
WIth merge enabled for overwrites, 20 iterations would leave:
So there are two changes being made in this PR:
Are these changes tested?
Created
tests/integration/test_writes/test_manifest_merging.pywith three integration tests testing the number of manifests of overwrites and appends (with and without manifests merging enabled) to test for data correctness and number of manifest.
Are there any user-facing changes?
User will potentially see less manifests as a result of overwrite operations