Skip to content

ClickBench Q10 slows down when filter pushdown is enabled #20325

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

When you run Q10 with predicate pushdown enabled (see #20324 for details of what that means) it goes more slowly:

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

Specifically, on my test I see it going 30% slower

Benchmark clickbench_partitioned.json
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ alamb_pushdown_and_arrow_58 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 10 │   333.30 ms │                   447.25 ms │  1.34x slower │

You can repro it like this:

Repro Script

set datafusion.execution.parquet.binary_as_string = true; -- needed for ClickBench data
SET datafusion.execution.target_partitions = 1; -- set to 1 to reduce variability
create external table hits stored as parquet location '/home/ec2-user/datafusion/benchmarks/data/hits_partitioned';

-- Q10 (default configuration, no pushdown)
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

-- Q10 enable with pushdown enabled
SET datafusion.execution.parquet.pushdown_filters = true;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

When I run this with datafusion-cli -f /tmp/q.sql | grep Elapsed you can clearly see

Elapsed 0.001 seconds.
Elapsed 0.000 seconds.
Elapsed 0.063 seconds.
Elapsed 1.419 seconds. <-- First query run 
Elapsed 1.432 seconds.
Elapsed 1.387 seconds.
Elapsed 1.399 seconds.
Elapsed 1.388 seconds.
Elapsed 0.000 seconds. <---- turn on filter pushdown
Elapsed 1.690 seconds. <-- now the query runs 30% slower
Elapsed 1.671 seconds.
Elapsed 1.695 seconds.
Elapsed 1.695 seconds.
Elapsed 1.718 seconds.

Describe the solution you'd like

I want there to be no slowdown when I run Q10

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions