Skip to content

Conversation

@raunaqmorarka
Copy link
Contributor

ExpireSnapshots currently does not expose metrics about the operations performed by it to the caller.
This change allows the caller to collect metrics about deleted data, position deletes, equality deletes, manifest lists, manifests and statistics files.
This is needed to allow Trino to show an output similar to https://iceberg.apache.org/docs/latest/spark-procedures/#output_6 when running expire_snaphots procedure in Trino.

@raunaqmorarka raunaqmorarka force-pushed the raunaq/expire-snapshots-report branch 6 times, most recently from 81eea28 to da13abd Compare December 24, 2025 08:56
import org.immutables.value.Value;

@Value.Immutable
public abstract class RemoveSnapshotsReport implements MetricsReport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there is a way to get this according to the contract defined here :

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use MetricsReporter as that is a more flexible approach and similar to org.apache.iceberg.Scan#metricsReporter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical on introducing a new report type. Don't we already have all of this information in the CommitReport, which carries a CommitMetricsResult ?

@singhpk234 singhpk234 requested a review from nastra December 26, 2025 17:24
@raunaqmorarka raunaqmorarka force-pushed the raunaq/expire-snapshots-report branch from da13abd to e6c8161 Compare December 29, 2025 05:54
@raunaqmorarka raunaqmorarka force-pushed the raunaq/expire-snapshots-report branch 4 times, most recently from 6aafe27 to 37c631d Compare December 29, 2025 06:43
@raunaqmorarka raunaqmorarka force-pushed the raunaq/expire-snapshots-report branch from 37c631d to c21feaf Compare December 31, 2025 06:36
}
}

public void deletedFile(String type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this call deletedFiles(type, 1) to avoid duplicate logic?

Comment on lines +49 to +51
protected static final String MANIFEST = "manifest";
protected static final String MANIFEST_LIST = "manifest list";
protected static final String STATISTICS_FILES = "statistics files";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an enum?

Comment on lines +267 to +274
Map<FileContent, Set<String>> groupedFilesToDelete =
filesToDelete.stream()
.collect(
Collectors.groupingBy(
FileInfo::getContent,
Collectors.mapping(FileInfo::getPath, Collectors.toSet())));

for (Map.Entry<FileContent, Set<String>> entry : groupedFilesToDelete.entrySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why group now? It seems orthogonal to collecting summary

paths.forEach(filesToDelete::remove);
try (CloseableIterable<DataFile> entries =
ManifestFiles.readColumns(
manifest, fileIO, ImmutableList.of("content", "file_path"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ImmutableList.of("content", "file_path") be a static constant in this class?

return Sets.difference(statsFileLocationsBeforeExpiration, statsFileLocationsAfterExpiration);
}

protected static class FileInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a 3rd FileInfo class. Can this one have a more distinct name?

Comment on lines +244 to +245
this.content = content;
this.path = path;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this project code style is generally null friendly, we don't need to adhere to this in new code.
I would use Preconditions.checkNotNull here (and then maybe put @Nonnull on the getters)

for (ManifestEntry<? extends ContentFile<?>> entry : reader.entries()) {
// delete any ADDED file from manifests that were reverted
if (entry.status() == ManifestEntry.Status.ADDED) {
// use toString to ensure the path will not change (Utf8 is reused)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing. This is obsolete. Would be nice to remove. (same for same comment above)

}

/** Report metrics about the ExpireSnapshots operation to the provided reporter */
default ExpireSnapshots metricsReporter(MetricsReporter reporter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for SnapshotProducer we defined a protected method so that this doesn't have to be exposed in general APIs:

protected ThisT reportWith(MetricsReporter newReporter) {
this.reporter = newReporter;
return self();
}

We might want to do the same in RemoveSnapshots

new: "method void org.apache.iceberg.PartitionStats::deletedEntry(org.apache.iceberg.Snapshot)"
justification: "Changing deprecated code"
- code: "java.method.removed"
old: "method org.apache.iceberg.io.CloseableIterable<java.lang.String> org.apache.iceberg.ManifestFiles::readPaths(org.apache.iceberg.ManifestFile, org.apache.iceberg.io.FileIO)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks an existing API, which we must avoid

new: "class org.apache.iceberg.encryption.EncryptingFileIO"
justification: "New method for Manifest List reading"
- code: "java.method.addedToInterface"
new: "method org.apache.iceberg.ExpireSnapshots org.apache.iceberg.ExpireSnapshots::metricsReporter(org.apache.iceberg.metrics.MetricsReporter)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

static class DeleteSummary {
private final AtomicLong dataFilesCount = new AtomicLong(0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we already have all of this in the SnapshotSummary when the snapshot is committed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants