Skip to content

Optimize parquet row filter auto strategy with adaptive fallback#9956

Open
hhhizzz wants to merge 39 commits into
apache:mainfrom
hhhizzz:codex/parquet-reader-auto-fallback-pr
Open

Optimize parquet row filter auto strategy with adaptive fallback#9956
hhhizzz wants to merge 39 commits into
apache:mainfrom
hhhizzz:codex/parquet-reader-auto-fallback-pr

Conversation

@hhhizzz
Copy link
Copy Markdown
Contributor

@hhhizzz hhhizzz commented May 10, 2026

Which issue does this PR close?

Rationale for this change

RowFilter can be much slower than a full scan when predicate pushdown produces a highly fragmented RowSelection. In that shape, the reader spends substantial time repeatedly skipping and decoding tiny row runs. #8565 showed an extreme case where row-filter pushdown was around 10x slower than scanning and filtering afterwards.

This PR makes RowSelectionPolicy::Auto more cost-aware. Instead of treating predicate pushdown as always beneficial once planned, Auto now chooses among:

  • selector-backed pushdown when it can skip useful work;
  • mask-backed execution when fragmented selections are better represented as a dense bitmap;
  • adaptive post-filter execution when pushdown is unlikely to save enough decoding work.

This is not intended to disable predicate evaluation. It changes where the predicate is evaluated when the observed row-selection shape suggests that pushdown overhead is likely to dominate.

This PR also fixes a correctness issue in the explicit Mask path. With sparse page-loaded ranges, a mask-backed read plan could previously try to consume selected rows outside the pages that were actually loaded, causing decoding failures. Loaded row ranges are now represented explicitly, and explicit Mask can safely execute over sparse page-loaded data.

What changes are included in this PR?

Auto strategy and cost model

  • Adds structured row-selection shape analysis for Auto.
  • Adds CostModelObservation / decision reasons so the reader can explain why it chose pushdown or post-filter execution.
  • Adds an adaptive post-filter cost model for row groups:
    • observe early pushdown output shape;
    • keep pushdown for sparse / low-selectivity cases where it still saves output decoding;
    • switch later row groups to post-filter execution for high-selectivity or fragmented moderate/high-selectivity cases.
  • Starts directly with post-filter execution for selected cheap cases where predicate columns are already projected and pushdown cannot avoid decoding them.

Mask / selector planning

  • Splits row-selection strategy resolution into a dedicated planning layer.
  • Keeps explicit Mask and Selectors behavior intact.
  • Makes Auto conservative around sparse page-loaded ranges.
  • Adds sparse loaded-range tracking so explicit Mask no longer assumes all page data is dense.
  • Replaces loaded-range intersection with a linear two-pointer merge.

Post-filter execution

  • Adds a post-filter reader path that decodes the required output and predicate columns once, evaluates the RowFilter, and then projects back to the caller-requested output columns.
  • Handles nested projection conservatively by requiring whole-root batch projection support.
  • Avoids reusing sparse predicate chunks when rebuilding a base/full read.
  • Avoids evaluating the same row-group predicate twice when caller-provided row selection is present.
  • Disables adaptive post-filter execution for try_next_reader handoff paths.

Benchmarks and tests

  • Extends arrow_reader_row_filter benchmarks with strategy-sensitive cases.
  • Adds focused coverage for:
    • sparse loaded page ranges;
    • explicit Mask correctness;
    • Auto strategy decisions;
    • adaptive post-filter decisions;
    • caller-provided row selections;
    • predicate cache behavior;
    • async reader snapshots.

Are these changes tested?

Yes.

Unit / integration validation:

  • cargo fmt -p parquet -- --check
  • git diff --check
  • cargo test -p parquet --lib arrow::push_decoder
  • cargo test -p parquet --lib arrow::arrow_reader::read_plan
  • cargo test -p parquet --lib arrow::arrow_reader::selection
  • cargo test -p parquet --lib
  • cargo test -p parquet --test arrow_reader --all-features
  • cargo bench -p parquet --bench arrow_reader_row_filter --features arrow,async --no-run
  • cargo clippy -p parquet --all-targets --all-features -- -D warnings
  • cargo +nightly doc --document-private-items --no-deps --workspace --all-features

Benchmark evidence:

Lower current/main is better.

arrow_reader_row_filter

Summary across the arrow_reader_row_filter benchmark cases:

group current/main delta
all 0.9685x -3.15%
async 0.9353x -6.47%
sync 1.0028x +0.28%

The improvement is concentrated in async row-filter cases where fragmented pushdown previously paid extra planning/selection overhead. Sync cases are broadly neutral.

Most notable async improvements:

mode filter projection main current current/main delta
async utf8View <> '' all columns 8.652 ms 6.365 ms 0.7357x -26.43%
async int64 > 90 all columns 7.958 ms 5.924 ms 0.7445x -25.55%
async int64 > 90 exclude filter column 7.603 ms 5.893 ms 0.7751x -22.49%
async utf8View <> '' exclude filter column 7.610 ms 6.003 ms 0.7889x -21.11%

The remaining cases are mostly within noise or small regressions. The largest sync regressions in this run were around +1.5%, while the aggregate sync result was +0.28%.

TPC-DS with predicate pushdown enabled (SF10 on a AMD64 machine)

Against main, with predicate pushdown enabled, aggregate current speedup was:

suite aggregate speedup
TPC-DS +3.271%

Largest median improvements:

query main median current median current/main time change speedup
q9 613.776 ms 294.409 ms 0.4797x -52.03% +108.48%
q59 190.115 ms 133.384 ms 0.7016x -29.84% +42.53%
q70 207.927 ms 150.346 ms 0.7231x -27.69% +38.30%
q65 342.422 ms 249.613 ms 0.7290x -27.10% +37.18%
q26 128.938 ms 107.054 ms 0.8303x -16.97% +20.44%

q9 was the largest improvement and was stable: current was faster in 10/10 rounds.

Largest median regressions:

query main median current median current/main time change slow rounds
q2 68.829 ms 76.223 ms 1.1074x +10.74% 9/10
q92 45.358 ms 47.347 ms 1.0439x +4.39% 5/10
q68 177.821 ms 185.364 ms 1.0424x +4.24% 8/10
q77 87.460 ms 91.013 ms 1.0406x +4.06% 8/10
q19 115.416 ms 119.864 ms 1.0385x +3.85% 7/10

Overall, the microbenchmark shows the intended improvement in async fragmented-row-filter cases, while sync behavior remains approximately neutral. The TPC-DS run shows a positive aggregate result with several large stable wins, but also identifies q2 as the main remaining regression to investigate.

Are there any user-facing changes?

No intended breaking API changes.

RowSelectionPolicy::Auto may choose different internal execution strategies than before. Explicit Mask and Selectors policies remain available for callers that want fixed behavior.

The Mask execution path is now more robust for sparse page-loaded ranges, which makes future use of Mask safer in page-index / fragmented-selection cases.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 10, 2026
@hhhizzz hhhizzz marked this pull request as ready for review May 11, 2026 10:06
@hhhizzz hhhizzz marked this pull request as draft May 12, 2026 08:47
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 14, 2026

👋

@hhhizzz
Copy link
Copy Markdown
Contributor Author

hhhizzz commented May 15, 2026

👋

👋🏻 I find the result is still not stable next day I publish it, I can repro some regression in some rare case, still working on it.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 15, 2026

Yeah, we are at the point where the code is already pretty fast, so additional optimizations get harder and harder

@hhhizzz hhhizzz marked this pull request as ready for review May 22, 2026 14:16
@hhhizzz
Copy link
Copy Markdown
Contributor Author

hhhizzz commented May 22, 2026

A bit of context on how this PR evolved.

The initial motivation was #8565: predicate pushdown is not always cheaper than scanning when the produced RowSelection becomes highly fragmented. After my last PR(#8733), I think there would be more improvement. My first attempt was to make Auto prefer Mask more often for fragmented selections, because a dense bitmap can avoid some of the tiny select/skip run overhead.

That turned out to be incomplete. Page pruning means the loaded pages may be sparse, and the previous mask path could assume rows were available even when their pages had not been loaded. So the work split into two parts:

  1. make explicit Mask safe by tracking loaded row ranges;
  2. keep Auto free to choose Selectors, Mask, or post-filter execution based on the shape of the actual read.

I also tried several purely static rules based on selectivity, projection shape, and data type. Some helped, but the results were fragile: rules that improved one fragmented case could regress sparse output reads or cacheable predicate cases. In particular, string / variable-width cases were easy to overfit. That is why the final design moved toward a small adaptive cost model instead of a larger pile of static heuristics.

The current implementation is roughly:

  • use row-selection shape analysis to decide between Mask and Selectors;
  • represent sparse loaded ranges explicitly so Mask does not fail on page-pruned data;
  • observe early row-group pushdown behavior;
  • switch later row groups to post-filter execution only when pushdown appears unlikely to pay for itself;
  • keep explicit Mask / Selectors policies as caller intent and only apply this adaptive behavior to Auto.

I also added focused benchmark cases after seeing that the original benchmark suite did not clearly expose the cost-model-sensitive cases. The goal was to cover both the original fragmented-selection cliff and the cases where an overly aggressive rule could regress.

So the PR is larger than a single heuristic change because the important part was separating the concepts:

  • what rows are selected;
  • which pages are actually loaded;
  • how that selection should be represented;
  • whether pushdown or post-filter execution is cheaper for Auto.

That separation is what makes the Mask correctness fix and the adaptive Auto behavior fit together.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 27, 2026

Thank you @hhhizzz -- I will try and review this later today or tomorrw

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 27, 2026

(I am not likely going to be able to review 8k lines in detail, however, so I will probably look at the high level first)

@hhhizzz hhhizzz force-pushed the codex/parquet-reader-auto-fallback-pr branch from c472348 to 4d59bcd Compare May 28, 2026 00:54
@hhhizzz
Copy link
Copy Markdown
Contributor Author

hhhizzz commented May 28, 2026

(I am not likely going to be able to review 8k lines in detail, however, so I will probably look at the high level first)

Thanks for taking a look at this PR! I completely understand that an 8k-line diff is daunting to review in detail.

To help make the review process easier, I wanted to clarify only about 3,450 lines are production code, while the remaining 4,800+ lines are benchmarks and extensive unit/integration tests.

If you still feel this is too large to review as a single PR, I would be more than happy to split this into smaller , incremental PRs.😄 Here is how we can cleanly divide the work:

  • PR 1 (Infrastructure & Metrics): Expose ArrowReaderMetrics + refactor/extract strategy.rs and its isolated test file selection/tests.rs (No functional changes to the reader).
  • PR 2 (Post-Decode Filter State): Introduce post_filter.rs with its internal unit tests (Laying the groundwork for the fallback path).
  • PR 3 (Selection Policy & Cost Model): Add cost_model.rs, selection_policy.rs, integrate them into the push decoder state machine, and add the main benchmarks.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 28, 2026

If you still feel this is too large to review as a single PR, I would be more than happy to split this into smaller , incremental PRs.😄 Here is how we can cleanly divide the work:

Please do...that would help me, and would provide a baseline to compare the improvements against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Better heuristics to pick between RowSelection and Mask filter representation

3 participants