Skip to content

bench(parquet): add short and large string arrow_writer benchmarks#10021

Merged
etseidl merged 1 commit into
apache:mainfrom
pydantic:bench-arrow-writer-string-page-size
May 26, 2026
Merged

bench(parquet): add short and large string arrow_writer benchmarks#10021
etseidl merged 1 commit into
apache:mainfrom
pydantic:bench-arrow-writer-string-page-size

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Split out of #9972 per this review comment.

Rationale for this change

#9972 makes the parquet writer's mini-batch sizing byte-budget aware so large variable-width values don't produce oversized data pages. To measure that change against a stable baseline — and in particular to see the difference in the large-string case — these benchmarks belong on main first.

What changes are included in this PR?

Adds two BYTE_ARRAY write cases to the arrow_writer criterion bench:

  • short_string_non_null — 1M fixed-width 8-byte strings. The small-value hot path, where byte-budget-based sub-batch sizing should always resolve to the full chunk (no granular splitting, no regression).
  • large_string_non_null — 1024 × 256 KiB strings (256 MiB total). The large-value case: with the default 1 MiB page byte limit each value needs its own page, and a write_batch_size of 1024 would otherwise buffer all 256 MiB before the post-write size check runs.

No library code changes — benchmarks only.

Are there any user-facing changes?

No.

🤖 Generated with Claude Code

Adds two BYTE_ARRAY write benchmarks to `arrow_writer`:

- `short_string_non_null`: 1M fixed-width 8-byte strings, the small-value
  hot path where mini-batch sizing should resolve to the full chunk.
- `large_string_non_null`: 1024 × 256 KiB strings, the large-value case
  where a single value far exceeds the page byte limit.

Split out of apache#9972 so the page-size fix can be measured against a stable
baseline (in particular the large-string case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 26, 2026
@adriangb adriangb marked this pull request as ready for review May 26, 2026 22:59
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 26, 2026

Thanks @adriangb 🚀

@etseidl etseidl merged commit bbbe8a6 into apache:main May 26, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants