I am unsure if this repo is the right place for any of this, but I just want to get the idea out there!
I am interested in building operational pipelines that can stream new STAC records into a stac-geoparquet store. So far I have just built a pipeline that appends records via hive-partitioned parquet files but that solution would not scale well for a continuously updated metadata archive. Apache Iceberg could be a good option because it is simple to append an arrow dataframe to an Iceberg table in a transactional way and queries don't require potentially slow/expensive ListBucket requests.
Here is a rough pseudo-code Python example of how I have written STAC records to Iceberg:
import pyarrow.parquet as pq
from pyiceberg.catalog import load_catalog
from pyiceberg.partitioning import PartitionSpec
catalog = load_catalog()
# get arrow table with stac records
# could probably do this with rustac.arrow, too
df = pq.read_table("/path/to/stac.parquet")
table = catalog.create_table(
"stac.{collection_id}",
schema=df.schema, # need to convert to an actual Iceberg schema but you get the idea
partition_spec=PartitionSpec(),
properties={"geo": ...},
)
table.append(df)
Then subsequent tasks can write items with the same schema to the table with more table.append(df) calls in some distributed context without worrying about manual partitioning or file locks.
In any case it could be nice to be able to read from an iceberg table using the duckdb client. The only change I think you need to make (compared to reading a parquet file) is iceberg_scan('/path/to/iceberg.metadata.json') instead of read_parquet('/path/to/stac.parquet').
I am unsure if this repo is the right place for any of this, but I just want to get the idea out there!
I am interested in building operational pipelines that can stream new STAC records into a stac-geoparquet store. So far I have just built a pipeline that appends records via hive-partitioned parquet files but that solution would not scale well for a continuously updated metadata archive. Apache Iceberg could be a good option because it is simple to append an arrow dataframe to an Iceberg table in a transactional way and queries don't require potentially slow/expensive ListBucket requests.
Here is a rough pseudo-code Python example of how I have written STAC records to Iceberg:
Then subsequent tasks can write items with the same schema to the table with more
table.append(df)calls in some distributed context without worrying about manual partitioning or file locks.In any case it could be nice to be able to read from an iceberg table using the duckdb client. The only change I think you need to make (compared to reading a parquet file) is
iceberg_scan('/path/to/iceberg.metadata.json')instead ofread_parquet('/path/to/stac.parquet').