How to find the best write options for read of file bytes ? #48940

nlgranger · 2026-01-22T09:08:00Z

nlgranger
Jan 22, 2026

TLDR

I am struggling to find optimal configuration for the ParquetWriter when the final goal is to read random rows for the resulting dataset.

The data is:

one column of file names as a strings
one column containing the bytes of an image file, so up to a few hundred kiB.

What I tried

I noticed datasets distributed as parquet files are typically ill suited for fast random row reads. A few I have tested from hugging face notably.

Going through the options of the ParquetWriter, here are some of the options that might need adjusting:

Group size: for random access there is no need to make it too big, but too low can be bad. I assume a low group size slows the search for a row in the list of group statistics.
Page size: Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row?
Compression: this black magic, sometimes enabling it works, sometimes it break performances. It seems better to disable it and store already compressed data in the row. Also, the columns used for filtering should never be compressed ?
Sorting columns: does sorting has any effect on performance in practice ?
Bloom filters: is it supported in pyarrow ?

Could you share some recommendations or guideline to optimize random row reads ?

(Also, why are Dataset.take() and Table.take() so damn slow ?)

References:

Answered by westonpace

Jan 26, 2026

Caching the metadata is essential so make sure you are doing that in some way.

Page size: Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row?

If your goal is to support pyarrow and not use a page index you are not going to get great results. Your read amplification will be controlled only by the row group size and making the row group size too small is going to lead to metadata explosion. You will have slow write times, the initial file open time will be very slow, and you will use more RAM for metadata. This can be mitigated slightly by completely disabling statistics.

I would highly recommend using a parquet reader that can s…

View full answer

westonpace · 2026-01-26T14:52:03Z

westonpace
Jan 26, 2026
Collaborator

Caching the metadata is essential so make sure you are doing that in some way.

Page size: Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row?

If your goal is to support pyarrow and not use a page index you are not going to get great results. Your read amplification will be controlled only by the row group size and making the row group size too small is going to lead to metadata explosion. You will have slow write times, the initial file open time will be very slow, and you will use more RAM for metadata. This can be mitigated slightly by completely disabling statistics.

I would highly recommend using a parquet reader that can support the page index. If you do then:

Group size: if you are using the page index then this shouldn't matter. Keep it large.
Page size: This will directly control how much data must be read for each row. I've used values as low as 8KiB and gotten reasonable results.
Compression: Compression shouldn't really matter. Unless all the data is cached in the kernel page cache you will be sufficiently I/O bound to not care.

Also, the columns used for filtering should never be compressed?
Sorting columns: does sorting has any effect on performance in practice ?
Bloom filters: is it supported in pyarrow ?

How are you implementing random access. Are you looking up the row offsets that you need to read in some other structure (external index, etc.) Or by "random access" do you mean "highly selective filter"?

If it is the former, then you shouldn't care about the sorting / stats configuration of the file. Stats can only get in the way.

However, if it is the latter, then:

the columns used for filtering should never be compressed?

They should always be compressed. If the stats / bloom filter do not allow you to page skip then you will need to read the entire filter column. Smaller column means smaller read.

Sorting columns: does sorting has any effect on performance in practice ?

There are two ways to do page / group skipping. First is column stats (zone map) and the second is bloom filters. Sorting on the filter columns is essential for column stats. It is not required (and unlikely to help much) if you are instead using bloom filters.

Could you share some recommendations or guideline to optimize random row reads ?

If you are using an external index and looking up rows by row offset then you will want to consider disabling dictionary encoding as it will force additional IOPS at read time. However, doing so can be a significant hit to compression.

2 replies

nlgranger Jan 26, 2026
Author

Thank you for the valuable insights!

My understanding is that pyarrow does not implement page indexes yet which is going to be an issue.

Since I have more or less played with the other parameters already, I think I will give a shot to other file formats first and come back to parquet later if necessary.

jacek-pliszka Jan 26, 2026

Test row-oriented formats as you want to read rows. Parquet is column-oriented format - better if you operate on columns, not rows.

You have not written where the files will be stored, is it a local drive, what OS? cloud?

You have not specified how many of them will you have and whether you will modify them or just store.

And whether you care more about read speed or about file size.

Maybe you do not need any special file format - maybe you can store them as files (on Linux with fast filesystem) or as .zip (if you do not modify them).

In both cases they will be indexed by file name and the access will be quite easy (you do not need to unpack whole .zip in Python to read a file from there).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to find the best write options for read of file bytes ? #48940

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to find the best write options for read of file bytes ? #48940

Uh oh!

Uh oh!

nlgranger Jan 22, 2026

TLDR

What I tried

References:

Replies: 1 comment · 2 replies

Uh oh!

westonpace Jan 26, 2026 Collaborator

Uh oh!

nlgranger Jan 26, 2026 Author

Uh oh!

jacek-pliszka Jan 26, 2026

nlgranger
Jan 22, 2026

Replies: 1 comment 2 replies

westonpace
Jan 26, 2026
Collaborator

nlgranger Jan 26, 2026
Author