Add an in-repo benchmark harness for ArrowReader

Is your feature request related to a problem or challenge?

The `ArrowReader` has been the subject of a dedicated performance epic (#2172), and several optimizations have already landed from it (file-size passthrough #2175, single metadata load for migrated tables #2176, metadata size hint #2173, range coalescing #2181), with more proposed (operator caching #2177, same-file metadata caching — #2100, closed).

The problem is that **there is currently no benchmark for the reader anywhere in the repo** — no `benches/`, no criterion harness. Every performance claim in #2172 was measured against an external DataFusion Comet workload. That makes it hard for contributors and reviewers to:

- reproduce the per-`FileScanTask` overhead the epic describes,
- evaluate whether a proposed optimization actually helps, and on which scenario,
- guard against regressions.

This gap had a concrete cost: #2100 (same-file metadata caching) was closed partly because the author could not demonstrate a benefit on their particular workload (a table with a ~1:1 task-to-file ratio, where same-file caching has nothing to hit). With a reproducible same-file-split benchmark in the repo, that kind of optimization could be evaluated objectively.

Describe the solution youd like

A criterion-based benchmark harness (`crates/iceberg/benches/arrow_reader.rs`) that writes Parquet files to a local temp dir and reads them back through the normal `FileIO` path, measuring per-task overhead rather than network latency. Proposed scenarios, chosen to map onto the epics code paths:

- **many_small_files** — scans of 16/64/256 small files; per-file overhead in files/sec.
- **concurrency** — a fixed corpus at concurrency 1/4/16 (single-concurrency fast path vs buffered/flattened path).
- **migrated_table** — files without embedded field IDs, read via name mapping (the #2176 path).
- **same_file_splits** — one multi-row-group file read as 1/8/32 byte-range tasks (the #2100 / item-5 path).
- **with_predicate** — scans with a bound predicate, row-group filtering and row selection enabled.

These run on the local FS, so they isolate CPU and per-task work. They are not a substitute for object-store latency benchmarks, but they give a reproducible baseline that any of the remaining #2172 optimizations can be measured against.

Willingness to contribute

I have a branch ready and will open a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an in-repo benchmark harness for ArrowReader #2557

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add an in-repo benchmark harness for ArrowReader #2557

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions