Skip to content

Add an in-repo benchmark harness for ArrowReader #2557

@viirya

Description

@viirya

Is your feature request related to a problem or challenge?

The ArrowReader has been the subject of a dedicated performance epic (#2172), and several optimizations have already landed from it (file-size passthrough #2175, single metadata load for migrated tables #2176, metadata size hint #2173, range coalescing #2181), with more proposed (operator caching #2177, same-file metadata caching — #2100, closed).

The problem is that there is currently no benchmark for the reader anywhere in the repo — no benches/, no criterion harness. Every performance claim in #2172 was measured against an external DataFusion Comet workload. That makes it hard for contributors and reviewers to:

  • reproduce the per-FileScanTask overhead the epic describes,
  • evaluate whether a proposed optimization actually helps, and on which scenario,
  • guard against regressions.

This gap had a concrete cost: #2100 (same-file metadata caching) was closed partly because the author could not demonstrate a benefit on their particular workload (a table with a ~1:1 task-to-file ratio, where same-file caching has nothing to hit). With a reproducible same-file-split benchmark in the repo, that kind of optimization could be evaluated objectively.

Describe the solution youd like

A criterion-based benchmark harness (crates/iceberg/benches/arrow_reader.rs) that writes Parquet files to a local temp dir and reads them back through the normal FileIO path, measuring per-task overhead rather than network latency. Proposed scenarios, chosen to map onto the epics code paths:

These run on the local FS, so they isolate CPU and per-task work. They are not a substitute for object-store latency benchmarks, but they give a reproducible baseline that any of the remaining #2172 optimizations can be measured against.

Willingness to contribute

I have a branch ready and will open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions