How to find the best write options for read of file bytes ? #48940
-
TLDRI am struggling to find optimal configuration for the The data is:
What I triedI noticed datasets distributed as parquet files are typically ill suited for fast random row reads. A few I have tested from hugging face notably. Going through the options of the
Could you share some recommendations or guideline to optimize random row reads ? (Also, why are Dataset.take() and Table.take() so damn slow ?) References: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Caching the metadata is essential so make sure you are doing that in some way.
If your goal is to support pyarrow and not use a page index you are not going to get great results. Your read amplification will be controlled only by the row group size and making the row group size too small is going to lead to metadata explosion. You will have slow write times, the initial file open time will be very slow, and you will use more RAM for metadata. This can be mitigated slightly by completely disabling statistics. I would highly recommend using a parquet reader that can support the page index. If you do then: Group size: if you are using the page index then this shouldn't matter. Keep it large.
How are you implementing random access. Are you looking up the row offsets that you need to read in some other structure (external index, etc.) Or by "random access" do you mean "highly selective filter"? If it is the former, then you shouldn't care about the sorting / stats configuration of the file. Stats can only get in the way. However, if it is the latter, then:
They should always be compressed. If the stats / bloom filter do not allow you to page skip then you will need to read the entire filter column. Smaller column means smaller read.
There are two ways to do page / group skipping. First is column stats (zone map) and the second is bloom filters. Sorting on the filter columns is essential for column stats. It is not required (and unlikely to help much) if you are instead using bloom filters.
If you are using an external index and looking up rows by row offset then you will want to consider disabling dictionary encoding as it will force additional IOPS at read time. However, doing so can be a significant hit to compression. |
Beta Was this translation helpful? Give feedback.
Caching the metadata is essential so make sure you are doing that in some way.
If your goal is to support pyarrow and not use a page index you are not going to get great results. Your read amplification will be controlled only by the row group size and making the row group size too small is going to lead to metadata explosion. You will have slow write times, the initial file open time will be very slow, and you will use more RAM for metadata. This can be mitigated slightly by completely disabling statistics.
I would highly recommend using a parquet reader that can s…