Skip to content
Discussion options

You must be logged in to vote

Caching the metadata is essential so make sure you are doing that in some way.

Page size: Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row?

If your goal is to support pyarrow and not use a page index you are not going to get great results. Your read amplification will be controlled only by the row group size and making the row group size too small is going to lead to metadata explosion. You will have slow write times, the initial file open time will be very slow, and you will use more RAM for metadata. This can be mitigated slightly by completely disabling statistics.

I would highly recommend using a parquet reader that can s…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@nlgranger
Comment options

@jacek-pliszka
Comment options

Answer selected by nlgranger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants