Skip to content

File Format API for PyIceberg #3100

@nssalian

Description

@nssalian

Feature Request / Improvement

Problem

The write path in pyiceberg/io/pyarrow.py is hardcoded to Parquet. The write.format.default table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic write_file() function. The read path already dispatches multiple formats; the write path should too.

Proposal

Introduce a File Format API aligned with Java Iceberg's File Format API (design doc).

New module pyiceberg/io/fileformat.py:

  • FileFormatWriter (ABC)
  • FileFormatModel (ABC)
  • FormatRegistry
  • DataFileStatistics (it's in pyarrow.py currently but I think this might be good to consolidate for metrics)

Changes to pyiceberg/io/pyarrow.py:

  • ParquetFormatWriter / ParquetFormatModel using the write_parquet() (inside write_file()
  • write_file() refactored to read write.format.default, look up the format model, and dispatch.

TCK tests/io/test_file_format_tck.py:

  • pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format.

Phased rollout:

  • ABCs and registry first, then Parquet extraction with TCK tests, then write_file() dispatch

Java ↔ Python Mapping

Java Python
FormatModel<D, S> FileFormatModel (ABC, no type params)
FileAppender<D> / ModelWriteBuilder FileFormatWriter (ABC)
FormatModelRegistry FormatRegistry (keyed by FileFormat only)
Metrics DataFileStatistics (existing)
TCK test_file_format_tck.py

Scope

This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support (#20) and any future formats (Avro, etc.) would be follow-ups once this lands.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions