Skip to content

[POC] Add statistics-based thresholds for deferring RowFilter selection in ReadPlanBuilder#9659

Draft
sdf-jkl wants to merge 23 commits intoapache:mainfrom
sdf-jkl:selectivity_threshold_with_statistics
Draft

[POC] Add statistics-based thresholds for deferring RowFilter selection in ReadPlanBuilder#9659
sdf-jkl wants to merge 23 commits intoapache:mainfrom
sdf-jkl:selectivity_threshold_with_statistics

Conversation

@sdf-jkl
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl commented Apr 3, 2026

Which issue does this PR close?

Shows my idea on implementing this issue from #9414 (comment)

Rationale for this change

Check issue

What changes are included in this PR?

  • Calculate (non-average) metrics about selector density and selectivity to defer/not defer applying filter selection to the ReadPlanBuilder.
  • Add some selectivity metrics for observability (idea from Collect filter selectivity #8723)

Are these changes tested?

Are there any user-facing changes?

Dandandan and others added 21 commits February 14, 2026 17:05
Previously, every predicate in the RowFilter received the same
ProjectionMask containing ALL filter columns. This caused unnecessary
decoding of expensive string columns when evaluating cheap integer
predicates. Now each predicate receives a mask with only the single
column it needs.

Key sync improvements (vs baseline):
- Q37: 63.7ms -> 7.3ms  (-88.6%, Title LIKE with CounterID=62 filter)
- Q36: 117ms -> 24ms    (-79.5%, URL <> '' with CounterID=62 filter)
- Q40: 17.9ms -> 5.1ms  (-71.5%, multi-pred with RefererHash eq)
- Q41: 17.3ms -> 5.5ms  (-68.1%, multi-pred with URLHash eq)
- Q22: 303ms -> 127ms   (-58.2%, 3 string predicates)
- Q42: 7.6ms -> 3.9ms   (-48.5%, int-only multi-predicate)
- Q38: 19.1ms -> 12.4ms (-34.9%, 5 int predicates)
- Q21: 159ms -> 98ms     (-38.5%, URL LIKE + SearchPhrase)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use page-level min/max statistics (via StatisticsConverter) to compute
a RowSelection that skips pages where equality predicates cannot match.
For each equality predicate with an integer literal, we check if the
literal falls within each page's [min, max] range and skip pages where
it doesn't.

Impact is data-dependent - most effective when data is sorted/clustered
by the filter column. For this particular 100K-row sample file the data
isn't sorted by filter columns, so improvements are modest (~5% for
some CounterID=62 queries). Would show larger gains on sorted datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Put the cheapest/most selective predicate first: SearchPhrase <> ''
filters ~87% of rows before expensive LIKE predicates run. This
reduces string column decoding for Title and URL significantly.

Q22 sync: ~6% improvement, Q22 async: ~13% improvement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… full clone

- Measure selectivity against absolute row count (after and_then) instead
  of relative to current selection, making predicates comparable regardless
  of prior filtering
- Avoid cloning the full ReadPlanBuilder (including deferred_selection) by
  constructing a minimal builder for the predicate reader

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Measuring selectivity against the absolute result makes the threshold
less intuitive since it becomes more scattered after and_then. Revert
to measuring against the raw predicate result (relative to current
selection).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of measuring selectivity (fraction of rows passing), measure
scattering: how much applying a predicate would fragment the selection.
A predicate is deferred if it would increase the selector count beyond
`current_selectors * scatter_threshold`.

This directly targets what makes fragmented selections expensive:
many small skip/read transitions during decoding.

- Rename selectivity_threshold -> scatter_threshold
- Add RowSelection::selector_count() (O(1) via Vec::len)
- Use selector count ratio instead of row selectivity ratio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of comparing selector count ratios, measure selector density:
selectors / total_rows. A density of 0.25 means at most 25 selectors
per 100 rows — anything more fragmented gets deferred.

This is more intuitive and directly proportional to the per-row cost
of skip/read transitions during decoding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store total row count in RowSelection at construction time, enabling
O(1) total_row_count() instead of iterating all selectors. Also add
selector_count() for O(1) fragmentation measurement.

Update split_off() and limit() to maintain the cached row_count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The total row count needed for scatter density calculation is already
available at both call sites (sync reader sums row group sizes, async
reader has row_count in scope). Pass it as a parameter instead of
storing it in RowSelection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Based on ClickBench profiling, scattering predicates have densities
of 0.008-0.054 while clean predicates are <0.001. A threshold of 0.01
defers the scattering ones while applying the clean ones.

Also removes the eprintln debug instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Don't defer a predicate if applying it would reduce the selector count
(make the selection less fragmented). Only defer when the predicate
both increases selectors AND exceeds the density threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 3, 2026
@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 3, 2026

@Dandandan @adriangb Please take a look when available! The benchmarks looked good locally (I did 30 runs per query)

@Dandandan
Copy link
Copy Markdown
Contributor

run benchmarks arrow_reader_clickbench

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4184177300-758-qlv68 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing selectivity_threshold_with_statistics (2c4787d) to d53df60 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                             main                                   selectivity_threshold_with_statistics
-----                                             ----                                   -------------------------------------
arrow_reader_clickbench/async/Q1                  1.00  1100.7±19.64µs        ? ?/sec    1.01   1115.7±6.72µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.00      6.7±0.04ms        ? ?/sec    1.00      6.7±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.00      7.7±0.05ms        ? ?/sec    1.00      7.7±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     14.3±0.09ms        ? ?/sec    1.01     14.5±0.16ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.00     17.0±0.17ms        ? ?/sec    1.00     17.1±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     15.9±0.12ms        ? ?/sec    1.01     16.0±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.00      3.1±0.03ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.08     79.3±0.59ms        ? ?/sec    1.00     73.6±0.43ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.19     97.6±0.90ms        ? ?/sec    1.00     82.2±0.65ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.23    137.1±5.86ms        ? ?/sec    1.00    111.3±1.93ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.00    255.6±0.99ms        ? ?/sec    1.01    258.1±2.42ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.00     19.3±0.16ms        ? ?/sec    1.01     19.6±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.01     59.1±0.43ms        ? ?/sec    1.00     58.5±0.59ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.01     59.3±0.44ms        ? ?/sec    1.00     59.0±0.50ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.00     18.7±0.17ms        ? ?/sec    1.00     18.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.08     15.1±0.33ms        ? ?/sec    1.00     14.0±0.28ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.17      5.4±0.03ms        ? ?/sec    1.00      4.7±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.04     13.3±0.34ms        ? ?/sec    1.00     12.8±0.24ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.00     24.9±0.50ms        ? ?/sec    1.00     24.9±0.45ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.11      5.8±0.05ms        ? ?/sec    1.00      5.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.10      5.1±0.04ms        ? ?/sec    1.00      4.6±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.04      3.5±0.02ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00   1059.5±2.50µs        ? ?/sec    1.02   1077.4±4.25µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.00      6.6±0.08ms        ? ?/sec    1.00      6.6±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.03      7.7±0.35ms        ? ?/sec    1.00      7.5±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.01     14.6±0.15ms        ? ?/sec    1.00     14.4±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.01     17.1±0.14ms        ? ?/sec    1.00     16.9±0.14ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     15.9±0.11ms        ? ?/sec    1.00     15.8±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.02      3.0±0.04ms        ? ?/sec    1.00      3.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     74.5±0.88ms        ? ?/sec    1.00     72.4±0.65ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.02     83.4±1.38ms        ? ?/sec    1.00     81.4±0.73ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.01    102.2±1.14ms        ? ?/sec    1.00    100.9±0.83ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.07   242.1±13.77ms        ? ?/sec    1.00    226.3±5.01ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.02     19.5±0.25ms        ? ?/sec    1.00     19.2±0.14ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.01     58.0±0.66ms        ? ?/sec    1.00     57.3±0.88ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     58.9±0.80ms        ? ?/sec    1.00     57.7±0.78ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.01     18.6±0.15ms        ? ?/sec    1.00     18.3±0.14ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.16     15.4±0.46ms        ? ?/sec    1.00     13.3±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.17      5.3±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.11     13.5±0.27ms        ? ?/sec    1.00     12.1±0.30ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.03     23.9±0.55ms        ? ?/sec    1.00     23.3±0.43ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.12      5.6±0.06ms        ? ?/sec    1.00      5.0±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.11      4.9±0.03ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.08      3.4±0.01ms        ? ?/sec    1.00      3.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    875.5±8.04µs        ? ?/sec    1.00    874.5±2.51µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.00      5.1±0.04ms        ? ?/sec    1.00      5.1±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.00      6.1±0.03ms        ? ?/sec    1.00      6.1±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.00     22.0±0.60ms        ? ?/sec    1.00     22.0±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     28.1±0.77ms        ? ?/sec    1.00     27.3±1.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.00     23.0±0.07ms        ? ?/sec    1.02     23.6±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.01      2.8±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.00    125.3±0.88ms        ? ?/sec    1.01    126.9±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.00     98.8±0.56ms        ? ?/sec    1.01    100.2±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.00    147.2±0.53ms        ? ?/sec    1.02    150.4±1.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.02   293.1±16.73ms        ? ?/sec    1.00   287.6±12.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.00     27.1±0.13ms        ? ?/sec    1.03     27.7±0.29ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.00    107.8±0.53ms        ? ?/sec    1.02    110.2±0.69ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.00    106.4±0.92ms        ? ?/sec    1.02    108.2±0.68ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.00     18.8±0.06ms        ? ?/sec    1.01     19.1±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.05     22.0±0.10ms        ? ?/sec    1.00     20.9±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.21      6.9±0.01ms        ? ?/sec    1.00      5.7±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.05     11.2±0.05ms        ? ?/sec    1.00     10.7±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.00     20.6±0.13ms        ? ?/sec    1.02     21.0±0.17ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.08      5.2±0.02ms        ? ?/sec    1.00      4.9±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.34      5.7±0.02ms        ? ?/sec    1.00      4.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.03      4.4±0.02ms        ? ?/sec    1.00      4.2±0.01ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 793.3s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 708.3s
CPU sys 84.6s
Disk read 0 B
Disk write 793.9 MiB

branch

Metric Value
Wall time 777.4s
Peak memory 2.9 GiB
Avg memory 2.8 GiB
CPU user 702.5s
CPU sys 74.7s
Disk read 0 B
Disk write 171.4 MiB

File an issue against this benchmark runner

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 3, 2026

@Dandandan not too bad

@Dandandan
Copy link
Copy Markdown
Contributor

I wonder - does it trigger in more cases then purely on the other treshold? #9414 (comment)

For clickbench I found it is often best to just use the first or first two row filters, the other ones don't seem to bring much extra.

@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented Apr 4, 2026

@Dandandan not too bad
arrow_reader_clickbench/async/Q22 1.23 137.1±5.86ms ? ?/sec 1.00 111.3±1.93ms ? ?/sec

I'd even say pretty good!

@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented Apr 4, 2026

For clickbench I found it is often best to just use the first or first two row filters, the other ones don't seem to bring much extra.

I think we can handle that on the datafusion side, what we have here (more efficient handling of the filters that do get pushed down) is good. We'll end up with more coarse control of IO / CPU patterns on the datafusion side and more granular and dynamic adaptivity in arrow-rs.

I do like that FilterSelectivityStat is exposed - I could see that being useful for adaptivity on the datafusion side.

@Dandandan
Copy link
Copy Markdown
Contributor

For clickbench I found it is often best to just use the first or first two row filters, the other ones don't seem to bring much extra.

I think we can handle that on the datafusion side, what we have here (more efficient handling of the filters that do get pushed down) is good. We'll end up with more coarse control of IO / CPU patterns on the datafusion side and more granular and dynamic adaptivity in arrow-rs.

I do like that FilterSelectivityStat is exposed - I could see that being useful for adaptivity on the datafusion side.

I agree with that - although I wonder what is the minimum amount of levers we need 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add selectivity threshold for filter pushdown

4 participants