Skip to content

Apache Iceberg read/write support #942

@hrodmn

Description

@hrodmn

I am unsure if this repo is the right place for any of this, but I just want to get the idea out there!

I am interested in building operational pipelines that can stream new STAC records into a stac-geoparquet store. So far I have just built a pipeline that appends records via hive-partitioned parquet files but that solution would not scale well for a continuously updated metadata archive. Apache Iceberg could be a good option because it is simple to append an arrow dataframe to an Iceberg table in a transactional way and queries don't require potentially slow/expensive ListBucket requests.

Here is a rough pseudo-code Python example of how I have written STAC records to Iceberg:

import pyarrow.parquet as pq
from pyiceberg.catalog import load_catalog
from pyiceberg.partitioning import PartitionSpec

catalog = load_catalog()

# get arrow table with stac records
# could probably do this with rustac.arrow, too
df = pq.read_table("/path/to/stac.parquet")

table = catalog.create_table(
    "stac.{collection_id}",
    schema=df.schema,  # need to convert to an actual Iceberg schema but you get the idea
    partition_spec=PartitionSpec(),
    properties={"geo": ...},
)

table.append(df)

Then subsequent tasks can write items with the same schema to the table with more table.append(df) calls in some distributed context without worrying about manual partitioning or file locks.

In any case it could be nice to be able to read from an iceberg table using the duckdb client. The only change I think you need to make (compared to reading a parquet file) is iceberg_scan('/path/to/iceberg.metadata.json') instead of read_parquet('/path/to/stac.parquet').

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions