-
Notifications
You must be signed in to change notification settings - Fork 414
Description
Is your feature request related to a problem or challenge?
Feature Request: Incremental Snapshot Scanning (from/to snapshot-id)
Is your feature request related to a problem? Please describe.
Currently, iceberg-rust only supports reading a table at a specific snapshot (time-travel) or the current snapshot. There is no way to read only the data that was added between two snapshots.
The Java Iceberg client supports this via IncrementalDataTableScan:
// Java API
TableScan scan = table.newIncrementalDataTableScan()
.fromSnapshotExclusive(100)
.toSnapshot(200);This is a critical feature for:
- Change Data Capture (CDC) - Reading only new/changed data for downstream systems
- Incremental ETL pipelines - Processing only new data since the last checkpoint
- Efficient data synchronization - Syncing only deltas between systems
- Streaming workloads - Reading appends as they happen
Describe the solution you'd like
Add incremental scan support to TableScanBuilder with methods similar to the Java client:
// Scan changes between two snapshots (from exclusive, to inclusive)
let scan = table.scan()
.from_snapshot_exclusive(from_id)
.to_snapshot(to_id)
.build()?;
// Scan with inclusive from
let scan = table.scan()
.from_snapshot_inclusive(from_id)
.to_snapshot(to_id)
.build()?;
// Convenience methods
let scan = table.scan().appends_after(from_id).build()?;
let scan = table.scan().appends_between(from_id, to_id).build()?;Additionally, expose this feature through the DataFusion integration:
// DataFusion integration
let provider = IcebergStaticTableProvider::try_new_incremental(table, from_id, to_id).await?;
ctx.register_table("changes", Arc::new(provider))?;
let df = ctx.sql("SELECT * FROM changes").await?;Implementation Notes
Based on the Java implementation (IncrementalDataTableScan.java):
-
Snapshot Range Validation - Walk the snapshot ancestry chain to validate that
from_snapshotis an ancestor ofto_snapshot -
Manifest Entry Filtering - Only include manifest entries where:
status == ADDED(not EXISTING or DELETED)snapshot_idis within the specified range
-
Operation Validation - Initially only support
APPENDoperations (same as Java).OVERWRITEandDELETEoperations require additional handling for delete files. -
Mutual Exclusivity - The
snapshot_id()method (for time-travel) should be mutually exclusive with incremental scan methods.
Describe alternatives you've considered
-
Post-filtering - Users could scan the full table and filter by
_snapshot_idmetadata column, but this is inefficient as it scans all data. -
Manual manifest parsing - Users could manually read manifests and filter entries, but this defeats the purpose of having a scan API.
Additional context
- Java implementation reference:
IncrementalDataTableScan.javain apache/iceberg - This feature is commonly requested for building CDC pipelines with Iceberg
- Spark's
incrementalAppendreader option provides similar functionality
Willingness to contribute
Yes