-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, API: Report metrics about deleted files in ExpireSnapshots #14921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Core, API: Report metrics about deleted files in ExpireSnapshots #14921
Conversation
81eea28 to
da13abd
Compare
| import org.immutables.value.Value; | ||
|
|
||
| @Value.Immutable | ||
| public abstract class RemoveSnapshotsReport implements MetricsReport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if there is a way to get this according to the contract defined here :
| interface Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use MetricsReporter as that is a more flexible approach and similar to org.apache.iceberg.Scan#metricsReporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit skeptical on introducing a new report type. Don't we already have all of this information in the CommitReport, which carries a CommitMetricsResult ?
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java
Outdated
Show resolved
Hide resolved
da13abd to
e6c8161
Compare
6aafe27 to
37c631d
Compare
37c631d to
c21feaf
Compare
| } | ||
| } | ||
|
|
||
| public void deletedFile(String type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this call deletedFiles(type, 1) to avoid duplicate logic?
| protected static final String MANIFEST = "manifest"; | ||
| protected static final String MANIFEST_LIST = "manifest list"; | ||
| protected static final String STATISTICS_FILES = "statistics files"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an enum?
| Map<FileContent, Set<String>> groupedFilesToDelete = | ||
| filesToDelete.stream() | ||
| .collect( | ||
| Collectors.groupingBy( | ||
| FileInfo::getContent, | ||
| Collectors.mapping(FileInfo::getPath, Collectors.toSet()))); | ||
|
|
||
| for (Map.Entry<FileContent, Set<String>> entry : groupedFilesToDelete.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why group now? It seems orthogonal to collecting summary
| paths.forEach(filesToDelete::remove); | ||
| try (CloseableIterable<DataFile> entries = | ||
| ManifestFiles.readColumns( | ||
| manifest, fileIO, ImmutableList.of("content", "file_path"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ImmutableList.of("content", "file_path") be a static constant in this class?
| return Sets.difference(statsFileLocationsBeforeExpiration, statsFileLocationsAfterExpiration); | ||
| } | ||
|
|
||
| protected static class FileInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a 3rd FileInfo class. Can this one have a more distinct name?
| this.content = content; | ||
| this.path = path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this project code style is generally null friendly, we don't need to adhere to this in new code.
I would use Preconditions.checkNotNull here (and then maybe put @Nonnull on the getters)
| for (ManifestEntry<? extends ContentFile<?>> entry : reader.entries()) { | ||
| // delete any ADDED file from manifests that were reverted | ||
| if (entry.status() == ManifestEntry.Status.ADDED) { | ||
| // use toString to ensure the path will not change (Utf8 is reused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing. This is obsolete. Would be nice to remove. (same for same comment above)
| } | ||
|
|
||
| /** Report metrics about the ExpireSnapshots operation to the provided reporter */ | ||
| default ExpireSnapshots metricsReporter(MetricsReporter reporter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for SnapshotProducer we defined a protected method so that this doesn't have to be exposed in general APIs:
iceberg/core/src/main/java/org/apache/iceberg/SnapshotProducer.java
Lines 191 to 194 in 026ec35
| protected ThisT reportWith(MetricsReporter newReporter) { | |
| this.reporter = newReporter; | |
| return self(); | |
| } |
We might want to do the same in
RemoveSnapshots
| new: "method void org.apache.iceberg.PartitionStats::deletedEntry(org.apache.iceberg.Snapshot)" | ||
| justification: "Changing deprecated code" | ||
| - code: "java.method.removed" | ||
| old: "method org.apache.iceberg.io.CloseableIterable<java.lang.String> org.apache.iceberg.ManifestFiles::readPaths(org.apache.iceberg.ManifestFile, org.apache.iceberg.io.FileIO)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this breaks an existing API, which we must avoid
| new: "class org.apache.iceberg.encryption.EncryptingFileIO" | ||
| justification: "New method for Manifest List reading" | ||
| - code: "java.method.addedToInterface" | ||
| new: "method org.apache.iceberg.ExpireSnapshots org.apache.iceberg.ExpireSnapshots::metricsReporter(org.apache.iceberg.metrics.MetricsReporter)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/apache/iceberg/pull/14921/files#r2668035317 on how to avoid breaking this API
| } | ||
|
|
||
| static class DeleteSummary { | ||
| private final AtomicLong dataFilesCount = new AtomicLong(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we already have all of this in the SnapshotSummary when the snapshot is committed?
ExpireSnapshots currently does not expose metrics about the operations performed by it to the caller.
This change allows the caller to collect metrics about deleted data, position deletes, equality deletes, manifest lists, manifests and statistics files.
This is needed to allow Trino to show an output similar to https://iceberg.apache.org/docs/latest/spark-procedures/#output_6 when running expire_snaphots procedure in Trino.