bench(parquet): add short and large string `arrow_writer` benchmarks by adriangb · Pull Request #10021 · apache/arrow-rs

adriangb · 2026-05-26T22:57:22Z

Which issue does this PR close?

Split out of #9972 per this review comment.

Rationale for this change

#9972 makes the parquet writer's mini-batch sizing byte-budget aware so large variable-width values don't produce oversized data pages. To measure that change against a stable baseline — and in particular to see the difference in the large-string case — these benchmarks belong on main first.

What changes are included in this PR?

Adds two BYTE_ARRAY write cases to the arrow_writer criterion bench:

short_string_non_null — 1M fixed-width 8-byte strings. The small-value hot path, where byte-budget-based sub-batch sizing should always resolve to the full chunk (no granular splitting, no regression).
large_string_non_null — 1024 × 256 KiB strings (256 MiB total). The large-value case: with the default 1 MiB page byte limit each value needs its own page, and a write_batch_size of 1024 would otherwise buffer all 256 MiB before the post-write size check runs.

No library code changes — benchmarks only.

Are there any user-facing changes?

No.

🤖 Generated with Claude Code

Adds two BYTE_ARRAY write benchmarks to `arrow_writer`: - `short_string_non_null`: 1M fixed-width 8-byte strings, the small-value hot path where mini-batch sizing should resolve to the full chunk. - `large_string_non_null`: 1024 × 256 KiB strings, the large-value case where a single value far exceeds the page byte limit. Split out of apache#9972 so the page-size fix can be measured against a stable baseline (in particular the large-string case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

etseidl · 2026-05-26T23:39:36Z

Thanks @adriangb 🚀

github-actions Bot added the parquet Changes to the parquet crate label May 26, 2026

adriangb marked this pull request as ready for review May 26, 2026 22:59

adriangb mentioned this pull request May 26, 2026

fix(parquet): bound data page byte size for large variable-width values #9972

Open

etseidl approved these changes May 26, 2026

View reviewed changes

etseidl merged commit bbbe8a6 into apache:main May 26, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(parquet): add short and large string `arrow_writer` benchmarks#10021

bench(parquet): add short and large string `arrow_writer` benchmarks#10021
etseidl merged 1 commit into
apache:mainfrom
pydantic:bench-arrow-writer-string-page-size

adriangb commented May 26, 2026

Uh oh!

etseidl commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adriangb commented May 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

etseidl commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants