feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

asolimando · 2026-01-23T17:55:23Z

Which issue does this PR close?

Part of Question about Statistics Collection(specifically NDV) #15265

(I am not sure if an new issue specifically for the scope of the PR is needed, happy to create it if needed)

Rationale for this change

This work originates from a discussion in datafusion-distributed about improving the TaskEstimator API:
datafusion-contrib/datafusion-distributed#296 (comment)

We agreed that improved statistics support in DataFusion would benefit both projects. For distributed-datafusion, better cardinality estimation helps decide how to split computation across network boundaries.

This also benefits DataFusion directly, as CBO is already in place, for example, join cardinality estimation (joins/utils.rs:586-646) uses distinct_count via max_distinct_count to compute join selectivity.

Currently this field is always Absent when reading from Parquet, so this PR fills that gap.

What changes are included in this PR?

Commit 1 - Reading NDV from Parquet files:

Extract distinct_count from Parquet row group column statistics
Single row group with NDV -> Precision::Exact(ndv)
Multiple row groups with NDV -> Precision::Inexact(max) as conservative lower bound
No NDV available -> Precision::Absent

Commit 2 - Statistics propagation (can be split to a separate PR, if preferred):

Statistics::try_merge(): use max as conservative lower bound instead of discarding NDV
Projection: preserve NDV for single-column expressions as upper bound

I'm including the second commit to showcase how I intend to use the statistics, but these changes can be split to a follow-up PR to keep review scope limited.

Are these changes tested?

Yes, 7 unit tests are added for NDV extraction:

Single/multiple row groups with NDV
Partial NDV availability across row groups
Multiple columns with different NDV values
Integration test reading a real Parquet file with distinct_count statistics (following the pattern in
row_filter.rs:685-696, using parquet_to_arrow_schema to derive the schema from the file)

Are there any user-facing changes?

No breaking changes. Statistics consumers will now see populated distinct_count values when available in Parquet metadata.

Disclaimer: I used AI (Claude Code) to assist translating my ideas into code as I am still ramping up with the codebase and especially with Rust (guidance on both aspects is highly appreciated). I have a good understanding of the core concepts (statistics, CBO etc.) and have carefully double-checked that the PR matches my intentions and understanding.

cc: @gabotechs @jayshrivastava @NGA-TRAN @gene-bordegaray

This change adds support for reading Number of Distinct Values (NDV) statistics from Parquet file metadata when available. Previously, `distinct_count` in `ColumnStatistics` was always set to `Precision::Absent`. Now it is populated from parquet row group column statistics when present: - Single row group with NDV: `Precision::Exact(ndv)` - Multiple row groups with NDV: `Precision::Inexact(max)` as lower bound (we can't accurately merge NDV since duplicates may exist across row groups; max is more conservative than sum for join cardinality estimation) - No NDV available: `Precision::Absent` This provides foundation for improved join cardinality estimation and other statistics-based optimizations. Relates to apache#15265

- Statistics merge: use max as conservative lower bound instead of discarding NDV (duplicates may exist across partitions) - Projection: preserve NDV for single-column expressions as upper bound

gene-bordegaray

Have a few minor comments but this looks good 💯

gene-bordegaray · 2026-01-23T18:15:44Z

datafusion/common/src/stats.rs

+                        Precision::Inexact(*v)
+                    }
+                    (Precision::Absent, Precision::Absent) => Precision::Absent,
+                };


I think this verbosity could be reduced to something like:

col_stats.distinct_count = col_stats.distinct_count.get_value() .max(item_col_stats.distinct_count.get_value()) .map(|&v| Precision::Inexact(v)) .unwrap_or(Precision::Absent);

or we could introduce some method like max_inexact() on Precision.

Thanks a lot, this is very neat, addressed in db182e5!

gene-bordegaray · 2026-01-23T18:21:24Z

datafusion/datasource-parquet/src/metadata.rs

    is_max_value_exact: &mut [Option<bool>],
    is_min_value_exact: &mut [Option<bool>],
    column_byte_sizes: &[Precision<usize>],
+    distinct_counts: &[Precision<usize>],


A nit but maybe these could be extracted into a struct that encapsulates these parameters as fields - say extend StatisticsAccumulators and use this or create a new struct

Adopted StatisticsAccumulators as suggested, it feels better and I got rid of the "too many arguments" warning suppression, addressed in 4833ef5

gene-bordegaray · 2026-01-23T18:29:40Z

datafusion/datasource-parquet/src/metadata.rs

+            use parquet::arrow::parquet_to_arrow_schema;
+            use parquet::file::reader::{FileReader, SerializedFileReader};
+            use std::fs::File;
+            use std::path::PathBuf;


nit: since these tests are in their own module, I think moving these to the ndv_test module level would be ok

Thanks for the suggestion, adopted in e36c46a

gene-bordegaray · 2026-01-23T18:30:38Z

datafusion/physical-expr/src/projection.rs

-                // TODO stats: estimate more statistics from expressions
-                // (expressions should compute their statistics themselves)
-                ColumnStatistics::new_unknown()
+                // TODO: expressions should compute their own statistics


noice, this is useful thanks for understanding implications with using and propagating distincts. thank you 😄

Partition columns now preserve distinct_count as Inexact(1) when merging statistics, reflecting that each partition file has a single distinct partition value.

Use get_value().max() chain instead of verbose match statement for merging NDV in Statistics::try_merge()

Encapsulate get_col_stats parameters by adding build_column_statistics() method to StatisticsAccumulators, removing the standalone function.

Move imports to module level in ndv_tests since they're in their own module anyway.

gene-bordegaray

great refactor, very clean

gene-bordegaray · 2026-01-23T21:11:29Z

datafusion/datasource-parquet/src/metadata.rs

 }

-fn summarize_min_max_null_counts(
+impl StatisticsAccumulators<'_> {


didn't know about this notation, an anonymous lifetime. cool 😄

asolimando added 2 commits January 23, 2026 13:21

Improve NDV propagation through statistics merge and projection

506f7a7

- Statistics merge: use max as conservative lower bound instead of discarding NDV (duplicates may exist across partitions) - Projection: preserve NDV for single-column expressions as upper bound

github-actions bot added physical-expr Changes to the physical-expr crates common Related to common crate datasource Changes to the datasource crate labels Jan 23, 2026

fix: cargo fmt

dc97f07

gene-bordegaray reviewed Jan 23, 2026

View reviewed changes

fix: update partition_statistics tests for NDV preservation

4a1c7cd

Partition columns now preserve distinct_count as Inexact(1) when merging statistics, reflecting that each partition file has a single distinct partition value.

github-actions bot added the core Core DataFusion crate label Jan 23, 2026

asolimando added 3 commits January 23, 2026 21:23

refactor: simplify distinct_count merge logic

db182e5

Use get_value().max() chain instead of verbose match statement for merging NDV in Statistics::try_merge()

refactor: add build_column_statistics method to StatisticsAccumulators

4833ef5

Encapsulate get_col_stats parameters by adding build_column_statistics() method to StatisticsAccumulators, removing the standalone function.

refactor: move ndv_tests imports to module level

e36c46a

Move imports to module level in ndv_tests since they're in their own module anyway.

gene-bordegaray approved these changes Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

asolimando commented Jan 23, 2026

Uh oh!

gene-bordegaray left a comment

Uh oh!

gene-bordegaray Jan 23, 2026

Uh oh!

asolimando Jan 23, 2026

Uh oh!

gene-bordegaray Jan 23, 2026

Uh oh!

asolimando Jan 23, 2026

Uh oh!

gene-bordegaray Jan 23, 2026

Uh oh!

asolimando Jan 23, 2026

Uh oh!

gene-bordegaray Jan 23, 2026 •

edited

Loading

Uh oh!

gene-bordegaray left a comment

Uh oh!

gene-bordegaray Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

Are you sure you want to change the base?

feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

Conversation

asolimando commented Jan 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

gene-bordegaray left a comment

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray left a comment

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gene-bordegaray Jan 23, 2026 •

edited

Loading