Estimate aggregate output rows using existing NDV statistics by buraksenn · Pull Request #20926 · apache/datafusion

buraksenn · 2026-03-13T15:35:12Z

Which issue does this PR close?

Part of #20766

Rationale for this change

Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both.

What changes are included in this PR?

Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets)
Cap by Top K limit when active since output row cannot be higher than K
Propagate distinct_count from child stats to group-by output columns

Are these changes tested?

Yes existing and new tests that cover different scenarios and edge cases

Are there any user-facing changes?

No

asolimando

Thanks @buraksenn for working on this, I have left a few comments, hoping this helps!

asolimando · 2026-03-16T18:58:39Z

datafusion/physical-plan/src/aggregates/mod.rs

+            let ndv = *col_stats.distinct_count.get_value()?;
+            let null_adjustment = match col_stats.null_count.get_value() {
+                Some(&n) if n > 0 => 1usize,
+                _ => 0,


Nit: it's a reasonable default but I'd add this to the comment of the function explicitly

asolimando · 2026-03-16T19:23:55Z

datafusion/physical-plan/src/aggregates/mod.rs

+                            let ndv_product = self.compute_group_ndv(child_statistics);
+                            if let Some(ndv) = ndv_product {
+                                let grouping_set_num = self.group_by.groups.len();
+                                let ndv_estimate = ndv.saturating_mul(grouping_set_num);


This doesn't look correct to me, grouping sets target different columns.

Consider GROUPING SETS ((a), (b), (a, b)), here you would compute NDV(a) * NDV(b) * 3, while the number of distinct values should be NDV(a) + NDV(b) + NDV(a)*NDV(b).

Trino is bailing out for (multiple) grouping sets, Spark does not take them into account as, AFAIU, it rewrites them as union of aggregates at earlier phases.

It's fine to either bail out like Trino does, but if you want to support this, the code would be something like this (pseudocode):

res = 0; for each grouping set gs: part_res = 0; for each col in gs: part_res *= ndv(col) + null_count(col(gs)) > 1 ? 1 : 0; res += part_res;

asolimando · 2026-03-16T19:26:51Z

datafusion/physical-plan/src/aggregates/mod.rs

                        child_statistics.num_rows.map(|x| x * grouping_set_num)
                    }
+                } else if let Some(limit_opts) = &self.limit_options {
+                    Precision::Inexact(limit_opts.limit)


Here we might have ndv even if num_rows is unset, maybe we can return Inexact(min(ndv, limit))?

asolimando · 2026-03-16T19:30:33Z

datafusion/physical-plan/src/aggregates/mod.rs

+        for (expr, _) in self.group_by.expr.iter() {
+            let col = expr.as_any().downcast_ref::<Column>()?;
+            let col_stats = &child_statistics.column_statistics[col.index()];
+            let ndv = *col_stats.distinct_count.get_value()?;


Since we multiply by ndv, we might end up with a total of zero if any ndv is zero. If the column has only null values, you might have num_rows > = 1 and ndv = 0, so let's use min(num_rows, ndv) here to be more robust.

buraksenn · 2026-03-16T20:19:11Z

Thanks @asolimando for detailed review. I thought this one was more polished after looking at Trino and Spark in detail but I've missed some important points. I'll carefully apply your reviews and adjust implementation early tomorrow

asolimando · 2026-03-17T08:09:59Z

Thanks @asolimando for detailed review. I thought this one was more polished after looking at Trino and Spark in detail but I've missed some important points. I'll carefully apply your reviews and adjust implementation early tomorrow

No worries, I think the PR already brings a considerable improvement over the existing, and that the final outcome will be a more precise estimation than Trino and Spark. I am bit busy today and tomorrow so don't rush unless it's good for your schedule. I will make sure to review changes by the end of the week.

feat: implemented topk cardinality aggregate estimation

0723a77

github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 13, 2026

fix physical optimizer test

e12bda6

github-actions bot added the core Core DataFusion crate label Mar 13, 2026

asolimando reviewed Mar 16, 2026

View reviewed changes

jonathanc-n mentioned this pull request Mar 16, 2026

EPIC: Making use of NDVs (number of distinct values) in DataFusion #20766

Open

address reviews and adjust comments

0d19879

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate aggregate output rows using existing NDV statistics #20926

Estimate aggregate output rows using existing NDV statistics #20926
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:top-k-aggregate-estimation

buraksenn commented Mar 13, 2026

Uh oh!

asolimando left a comment

Uh oh!

asolimando Mar 16, 2026

Uh oh!

asolimando Mar 16, 2026

Uh oh!

asolimando Mar 16, 2026

Uh oh!

asolimando Mar 16, 2026

Uh oh!

buraksenn commented Mar 16, 2026 •

edited

Loading

Uh oh!

asolimando commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buraksenn commented Mar 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asolimando commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buraksenn commented Mar 16, 2026 •

edited

Loading