File Format API for PyIceberg

### Feature Request / Improvement

## Problem
The write path in `pyiceberg/io/pyarrow.py` is hardcoded to Parquet. The `write.format.default` table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic `write_file()` function. The read path already dispatches multiple formats; the write path should too.

## Proposal
Introduce a File Format API aligned with Java Iceberg's [File Format API](https://github.com/apache/iceberg/pull/12774) ([design doc](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds)).

New module `pyiceberg/io/fileformat.py`:

- `FileFormatWriter` (ABC) 
- `FileFormatModel` (ABC)
- `FormatRegistry`
- `DataFileStatistics` (it's in `pyarrow.py` currently but I think this might be good to consolidate for metrics)

Changes to `pyiceberg/io/pyarrow.py`:

- `ParquetFormatWriter` / `ParquetFormatModel` using the `write_parquet()` (inside `write_file()`
- `write_file()` refactored to read `write.format.default`, look up the format model, and dispatch.

TCK `tests/io/test_file_format_tck.py`: 
- pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format.

Phased rollout: 
- ABCs and registry first, then Parquet extraction with TCK tests, then `write_file()` dispatch

## Java ↔ Python Mapping

| Java | Python |
|---|---|
| `FormatModel<D, S>` | `FileFormatModel` (ABC, no type params) |
| `FileAppender<D>` / `ModelWriteBuilder` | `FileFormatWriter` (ABC) |
| `FormatModelRegistry` | `FormatRegistry` (keyed by `FileFormat` only) |
| `Metrics` | `DataFileStatistics` (existing) |
| TCK | `test_file_format_tck.py` |


## Scope

This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support ([#20](https://github.com/apache/iceberg-python/issues/20)) and any future formats (Avro, etc.) would be follow-ups once this lands.

## References

- Java File Format API: [apache/iceberg#12774](https://github.com/apache/iceberg/pull/12774)
- Design doc: [Google Doc](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds)
- Format impls: Parquet [#15253](https://github.com/apache/iceberg/pull/15253), ORC [#15255](https://github.com/apache/iceberg/pull/15255), Avro [#15254](https://github.com/apache/iceberg/pull/15254)
- TCK: [apache/iceberg#15415](https://github.com/apache/iceberg/issues/15415)
- Prior pyiceberg ORC work: [#20](https://github.com/apache/iceberg-python/issues/20), [#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)

Java	Python
`FormatModel<D, S>`	`FileFormatModel` (ABC, no type params)
`FileAppender<D>` / `ModelWriteBuilder`	`FileFormatWriter` (ABC)
`FormatModelRegistry`	`FormatRegistry` (keyed by `FileFormat` only)
`Metrics`	`DataFileStatistics` (existing)
TCK	`test_file_format_tck.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Format API for PyIceberg #3100

Feature Request / Improvement

Problem

Proposal

Java ↔ Python Mapping

Scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File Format API for PyIceberg #3100

Description

Feature Request / Improvement

Problem

Proposal

Java ↔ Python Mapping

Scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions