Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 5, 2025

Initial draft was created using codex with the following prompt in case anyone is interested

Details

Please write a technical blog post about the new `ParquetPushDecoder` titled

"Push Decoder: Fine-Grained Control over IO and CPU when Reading Parquet Files"

It should have a publish date of December 17, 2025

It should have the same writing style and high level formatting as _posts/2025-10-23-rust-parquet-metadata.md

The blog post will be about the push parquet decoder, and how it can be used to offer more fine grained control over IO and CPU work in the parquet reader.

The blog post would cover:

* Motivation: why do we need a push decoder?
* Design: how does the push decoder work?
* Examples: how to use the push decoder in practice
* Performance: how does the push decoder perform compared to the existing parquet reader?
* Future work: what are the next steps for the push decoder?


In the background section be sure to mention
* arrow-rs already has push decoders for csv and json (include links to their documentation)
* we needed two distinct decoders for parquet already, sync and `async` which led to code duplication
* Hard to integrate (and needed "first party" support for object_store, but why not for other IO sources like OpenDAL?)

The motivation section should mention:
* how would we support more fine grained pre-fetching (we can prefetch with row groups now)
* This is the https://sans-io.readthedocs.io/ applied to columnar file formats
* Include a diagram showing the control flow for a standard "pull" deocder:
** the request for the next batch of data eventually results in an IO request issued by the decoder itself
* Include a diagram for how push decoders work

The examples section should include
* examples from the documentation https://docs.rs/parquet/latest/parquet/arrow/push_decoder/struct.ParquetPushDecoder.html
* you can also find the examples here: https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/push_decoder/mod.rs

Please include details from the following github tickets:
*  https://github.com/apache/arrow-rs/issues/8035
* Use the background, motivation, and high level design description on this Github ticket: https://github.com/apache/arrow-rs/issues/7983

@alamb alamb marked this pull request as draft December 5, 2025 14:59
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Preview URL: https://alamb.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about parquet push decoder

1 participant