You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey together,
I am working with the Apache Arrow C++ Dataset API to scan multiple Parquet files. My goal is to process RecordBatches in parallel using a callback function without materializing the entire table at once.
Following the Dataset Tutorial, I am using the Scan method. According to the documentation:
If multiple threads are used (via use_threads), the visitor will be invoked from those threads and is responsible for any synchronization.
However, in my implementation, the visitor function is called strictly in order—one call only begins after the previous one finishes—even though use_threads is set to true. I have also tried ScanBatchesUnordered, but I am seeing similar serial behavior.
When running this against 6 Parquet files (approx. 5 GiB total), the timestamps in the output show a perfect 5-second gap between batches.
ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:04.164492317
ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:09.165123649
ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:14.167036139
Apache arrow 22
Am I missing a configuration step in ScanOptions or ScannerBuilder to actually trigger parallel execution of the visitor? Is there a preferred way to handle parallel callbacks in the Dataset API?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey together,
I am working with the Apache Arrow C++ Dataset API to scan multiple Parquet files. My goal is to process RecordBatches in parallel using a callback function without materializing the entire table at once.
Following the Dataset Tutorial, I am using the Scan method. According to the documentation:
However, in my implementation, the visitor function is called strictly in order—one call only begins after the previous one finishes—even though
use_threadsis set totrue. I have also triedScanBatchesUnordered, but I am seeing similar serial behavior.Minimal Working Example:
Observed Behavior
When running this against 6 Parquet files (approx. 5 GiB total), the timestamps in the output show a perfect 5-second gap between batches.
Apache arrow 22
Am I missing a configuration step in ScanOptions or ScannerBuilder to actually trigger parallel execution of the visitor? Is there a preferred way to handle parallel callbacks in the Dataset API?
Thanks for your help
Beta Was this translation helpful? Give feedback.
All reactions