Table scan rejects current-schema column names after UpdateSchemaAction commit
Label: bug
Is your feature request related to a problem or challenge?
A default TableScanBuilder::build() validates caller-supplied column names against the snapshot's schema, not the table's current schema. After an UpdateSchemaAction commit changes the current schema (rename / add / delete column), pre-existing snapshots still point at the pre-evolution schema_id, so the scan rejects names that are valid against the post-evolution schema.
Reproducer
Setup: any iceberg table with at least one snapshot. Apply a schema-evolution transaction (uses the action shipped in #2120 / UpdateSchemaAction):
let tx = Transaction::new(&table);
let action = tx.update_schema()
.add_column(AddColumn::optional("note", Type::Primitive(PrimitiveType::String)));
let tx = action.apply(tx)?;
let table = tx.commit(&catalog).await?;
The catalog now reports the post-evolution schema (verified via catalog.load_table().metadata().current_schema()). But a scan over the same Table:
table.scan().select(["note"]).build()
returns:
DataInvalid => Column note not found in table. Schema: table {
1: id: optional long
2: name: optional string
3: tmp: optional double
}
The schema dump is the snapshot's schema — the column added a moment ago is missing.
Root cause
crates/iceberg/src/scan/mod.rs:221:
let schema = snapshot.schema(self.table.metadata())?;
snapshot.schema(metadata) resolves the snapshot's schema_id against metadata.schemas and returns the schema the snapshot was written under. For time-travel scans (.snapshot_id(...)) that's exactly right — the caller is asking for "the table as it existed at this snapshot." But for a default scan, the caller is asking for "the table as it is now," and the post-evolution columns are legitimately part of that vocabulary.
The downstream Parquet projection (crates/iceberg/src/arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids) already maps field IDs to on-disk column names via PARQUET:field_id metadata, so resolving names against the current schema is safe end-to-end — field IDs are stable across schema versions, and the file's original column names live in the parquet metadata until the file is rewritten. PyIceberg's reader (pyiceberg/io/pyarrow.py::_task_to_record_batches) implements exactly this pattern: project by field ID, rename the arrow batch on the way out.
Why this wasn't caught upstream
UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them call table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the column-name validation loop entirely (it falls through to column_names.unwrap_or_else(|| schema.as_struct().fields()...)).
So a column-name lookup combined with a schema-evolved table is the gap. Both add_column and delete_column (already in main) trigger it; rename_column (#2563) trips it even more cleanly because the old name continues to exist on disk.
Describe the solution you'd like
Branch on whether the caller asked for a specific snapshot:
let schema = if self.snapshot_id.is_some() {
snapshot.schema(self.table.metadata())?
} else {
self.table.metadata().current_schema().clone()
};
- Explicit
snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking "what existed at snapshot N" should see schema N's columns.
- Default scan (no
snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream projection still finds the right on-disk columns.
Both the column-name validation loop and the subsequent field_id_by_name lookup share the same schema variable, so the fix is one assignment.
Willingness to contribute
I can contribute this independently. I have a working branch with the fix + three regression tests (rename-then-read works, old-name-after-rename errors, time-travel still uses snapshot schema), all 1299 iceberg lib tests passing, clippy + rustfmt clean. PR ready to open once this issue is filed for reference.
Table scan rejects current-schema column names after
UpdateSchemaActioncommitLabel:
bugIs your feature request related to a problem or challenge?
A default
TableScanBuilder::build()validates caller-supplied column names against the snapshot's schema, not the table's current schema. After anUpdateSchemaActioncommit changes the current schema (rename / add / delete column), pre-existing snapshots still point at the pre-evolutionschema_id, so the scan rejects names that are valid against the post-evolution schema.Reproducer
Setup: any iceberg table with at least one snapshot. Apply a schema-evolution transaction (uses the action shipped in #2120 /
UpdateSchemaAction):The catalog now reports the post-evolution schema (verified via
catalog.load_table().metadata().current_schema()). But a scan over the sameTable:returns:
The schema dump is the snapshot's schema — the column added a moment ago is missing.
Root cause
crates/iceberg/src/scan/mod.rs:221:snapshot.schema(metadata)resolves the snapshot'sschema_idagainstmetadata.schemasand returns the schema the snapshot was written under. For time-travel scans (.snapshot_id(...)) that's exactly right — the caller is asking for "the table as it existed at this snapshot." But for a default scan, the caller is asking for "the table as it is now," and the post-evolution columns are legitimately part of that vocabulary.The downstream Parquet projection (
crates/iceberg/src/arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids) already maps field IDs to on-disk column names viaPARQUET:field_idmetadata, so resolving names against the current schema is safe end-to-end — field IDs are stable across schema versions, and the file's original column names live in the parquet metadata until the file is rewritten. PyIceberg's reader (pyiceberg/io/pyarrow.py::_task_to_record_batches) implements exactly this pattern: project by field ID, rename the arrow batch on the way out.Why this wasn't caught upstream
UpdateSchemaAction(#2120) shipped with metadata-only tests incrates/catalog/loader/tests/schema_update_suite.rs— none of them calltable.scan().select_columns(...)after the schema commit. The pre-existingcrates/integration_tests/tests/read_evolved_schema.rsonly usestable.scan().build()with noselect_columns, which bypasses the column-name validation loop entirely (it falls through tocolumn_names.unwrap_or_else(|| schema.as_struct().fields()...)).So a column-name lookup combined with a schema-evolved table is the gap. Both
add_columnanddelete_column(already inmain) trigger it;rename_column(#2563) trips it even more cleanly because the old name continues to exist on disk.Describe the solution you'd like
Branch on whether the caller asked for a specific snapshot:
snapshot_id(time-travel): keep the snapshot-time vocabulary. A caller asking "what existed at snapshot N" should see schema N's columns.snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream projection still finds the right on-disk columns.Both the column-name validation loop and the subsequent
field_id_by_namelookup share the sameschemavariable, so the fix is one assignment.Willingness to contribute
I can contribute this independently. I have a working branch with the fix + three regression tests (rename-then-read works, old-name-after-rename errors, time-travel still uses snapshot schema), all 1299 iceberg lib tests passing, clippy + rustfmt clean. PR ready to open once this issue is filed for reference.