GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata#49926
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
raulcd
left a comment
There was a problem hiding this comment.
This seems reasonable to me. Could you update the the title to something like the following to make it clear that this is about exposing this to Python not about a fix on C++?
GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata
|
@raulcd thanks for reviewing and the feedback. I've updated the items respectively 😄 |
|
@raulcd there were 2 github actions that failed; 1 is linting-related and another was |
Rationale for this change
ColumnChunkMetaData.to_dict()method omitsbloom_filter_offsetandbloom_filter_lengtheven when a bloom filter is written to the Parquet file. Users cannot programmatically verify bloom filter presence via the Python metadata API without resorting to file size comparison.What changes are included in this PR?
python/pyarrow/includes/libparquet.pxd: Declarebloom_filter_offset()andbloom_filter_length()(both optional[int64_t]) onCColumnChunkMetaData. This is to expose the existing C++ methods to Cython.python/pyarrow/_parquet.pyx: Addbloom_filter_offsetandbloom_filter_lengthproperties toColumnChunkMetaData(returns int when set, None otherwise). Add both fields toto_dict()and__repr__.python/pyarrow/tests/parquet/test_metadata.py: Addtest_bloom_filter_offset_in_metadataverifying that columns with a bloom filter expose non-None integer values and thatto_dict()contains the keys, while columns without a bloom filter return None.Are these changes tested?
Yes.
test_bloom_filter_offset_in_metadatain test_metadata.py covers:Here is closer look on the logic output:
output: