Skip to content

Support incremental reads between snapshot-ids #2152

@xanderbailey

Description

@xanderbailey

Is your feature request related to a problem or challenge?

Feature Request: Incremental Snapshot Scanning (from/to snapshot-id)

Is your feature request related to a problem? Please describe.

Currently, iceberg-rust only supports reading a table at a specific snapshot (time-travel) or the current snapshot. There is no way to read only the data that was added between two snapshots.

The Java Iceberg client supports this via IncrementalDataTableScan:

// Java API
TableScan scan = table.newIncrementalDataTableScan()
    .fromSnapshotExclusive(100)
    .toSnapshot(200);

This is a critical feature for:

  • Change Data Capture (CDC) - Reading only new/changed data for downstream systems
  • Incremental ETL pipelines - Processing only new data since the last checkpoint
  • Efficient data synchronization - Syncing only deltas between systems
  • Streaming workloads - Reading appends as they happen

Describe the solution you'd like

Add incremental scan support to TableScanBuilder with methods similar to the Java client:

// Scan changes between two snapshots (from exclusive, to inclusive)
let scan = table.scan()
    .from_snapshot_exclusive(from_id)
    .to_snapshot(to_id)
    .build()?;

// Scan with inclusive from
let scan = table.scan()
    .from_snapshot_inclusive(from_id)
    .to_snapshot(to_id)
    .build()?;

// Convenience methods
let scan = table.scan().appends_after(from_id).build()?;
let scan = table.scan().appends_between(from_id, to_id).build()?;

Additionally, expose this feature through the DataFusion integration:

// DataFusion integration
let provider = IcebergStaticTableProvider::try_new_incremental(table, from_id, to_id).await?;
ctx.register_table("changes", Arc::new(provider))?;
let df = ctx.sql("SELECT * FROM changes").await?;

Implementation Notes

Based on the Java implementation (IncrementalDataTableScan.java):

  1. Snapshot Range Validation - Walk the snapshot ancestry chain to validate that from_snapshot is an ancestor of to_snapshot

  2. Manifest Entry Filtering - Only include manifest entries where:

    • status == ADDED (not EXISTING or DELETED)
    • snapshot_id is within the specified range
  3. Operation Validation - Initially only support APPEND operations (same as Java). OVERWRITE and DELETE operations require additional handling for delete files.

  4. Mutual Exclusivity - The snapshot_id() method (for time-travel) should be mutually exclusive with incremental scan methods.

Describe alternatives you've considered

  1. Post-filtering - Users could scan the full table and filter by _snapshot_id metadata column, but this is inefficient as it scans all data.

  2. Manual manifest parsing - Users could manually read manifests and filter entries, but this defeats the purpose of having a scan API.

Additional context

  • Java implementation reference: IncrementalDataTableScan.java in apache/iceberg
  • This feature is commonly requested for building CDC pipelines with Iceberg
  • Spark's incrementalAppend reader option provides similar functionality

Willingness to contribute

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions