feat: add approx_top_k aggregate function by sesteves · Pull Request #20968 · apache/datafusion

sesteves · 2026-03-16T17:17:46Z

Which issue does this PR close?

Closes #20967.

Rationale

Finding the most frequently occurring values in a column is a very common analytical pattern — top error codes, most popular products, most active users, frequent URL paths, etc. The exact approach (GROUP BY value ORDER BY COUNT(*) DESC LIMIT k) requires materializing all distinct groups, which is memory-intensive and slow on high-cardinality columns.

approx_top_k solves this with a streaming approximation algorithm that operates in bounded memory, making it practical for large-scale analytics.

This function is already available in ClickHouse (topK) and via extensions in PostgreSQL and Druid, and is a natural addition to DataFusion's existing family of approximate aggregate functions (approx_distinct, approx_median, approx_percentile_cont).

What changes are included in this PR?

Core implementation (`datafusion/functions-aggregate/src/approx_top_k.rs`)

SpaceSavingSummary — Implements the Filtered Space-Saving algorithm (Metwally et al., 2005):
- Fixed-size counter summary with eviction of minimum-count items when full
- Alpha map for improved accuracy — tracks recently evicted items and filters low-frequency noise before it enters the main summary (same approach used by ClickHouse)
- CAPACITY_MULTIPLIER = 3 (matching ClickHouse's default) — the summary tracks k × 3 counters internally
- Merge support for parallel/distributed execution
- Binary serialization/deserialization for intermediate state transfer
ApproxTopK — AggregateUDFImpl with #[user_doc] documentation, supporting any hashable scalar type
ApproxTopKAccumulator — Accumulator with update_batch, merge_batch, evaluate (returns List(Struct({value: T, count: UInt64})) ordered by count descending), and state (serialized summary + data type)
9 unit tests covering basic operations, eviction, alpha map, merge, serialization, accumulator update/evaluate, merge_batch, and a distributed merge simulation

Registration (`datafusion/functions-aggregate/src/lib.rs`)

Added module, expr_fn re-export, and registration in all_default_aggregate_functions()

SQL logic tests (`datafusion/sqllogictest/test_files/approx_top_k.slt`)

8 tests: basic string/int, GROUP BY, k=1, NULL handling, error cases (invalid k), WHERE clause filtering

Documentation

docs/source/user-guide/sql/aggregate_functions.md — TOC entry + full documentation section
docs/source/user-guide/expressions.md — quick-reference table entry

Integration tests

Proto roundtrip (datafusion/proto/tests/cases/roundtrip_logical_plan.rs) — approx_top_k(col("a"), lit(3))
DataFrame (datafusion/core/tests/dataframe/dataframe_functions.rs) — test_fn_approx_top_k with insta::assert_snapshot!

Are these changes tested?

Yes:

9 unit tests in approx_top_k.rs (algorithm correctness, serialization, accumulator lifecycle, distributed merge)
8 SQL logic tests in approx_top_k.slt (end-to-end SQL behavior including edge cases and errors)
1 DataFrame integration test with snapshot assertion
1 proto roundtrip test

Are there any user-facing changes?

Yes — new approx_top_k aggregate function available in SQL and the DataFrame API:

-- Top 3 most frequent values
SELECT approx_top_k(url_path, 3) FROM access_logs;

-- Top 3 per group
SELECT region, approx_top_k(product_id, 3) FROM sales GROUP BY region;

// DataFrame API
use datafusion::functions_aggregate::approx_top_k::approx_top_k;
df.aggregate(vec![], vec![approx_top_k(col("url_path"), lit(5))])?;

datafusion/functions-aggregate/src/approx_top_k.rs

Add a new approx_top_k(expression, k) aggregate function that returns the approximate top-k most frequent values with their estimated counts, using the Filtered Space-Saving algorithm. The implementation uses a capacity multiplier of 3 (matching ClickHouse's default) and includes an alpha map for improved accuracy by filtering low-frequency noise before it enters the main summary. Return type is List(Struct({value: T, count: UInt64})) ordered by count descending, where T matches the input column type. Closes apache#20967

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions Changes to functions implementation labels Mar 16, 2026

sesteves force-pushed the feat/approx-top-k branch 3 times, most recently from 0a2f5f8 to 1465c70 Compare March 16, 2026 17:58

Dandandan reviewed Mar 17, 2026

View reviewed changes

datafusion/functions-aggregate/src/approx_top_k.rs Show resolved Hide resolved

Dandandan reviewed Mar 17, 2026

View reviewed changes

datafusion/functions-aggregate/src/approx_top_k.rs Show resolved Hide resolved

martin-g reviewed Mar 17, 2026

View reviewed changes

datafusion/functions-aggregate/src/approx_top_k.rs Show resolved Hide resolved

datafusion/functions-aggregate/src/approx_top_k.rs Outdated Show resolved Hide resolved

datafusion/functions-aggregate/src/approx_top_k.rs Outdated Show resolved Hide resolved

sesteves force-pushed the feat/approx-top-k branch from 1465c70 to 7d12109 Compare March 17, 2026 12:26

sesteves force-pushed the feat/approx-top-k branch from 7d12109 to aa7bc47 Compare March 17, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add approx_top_k aggregate function#20968

feat: add approx_top_k aggregate function#20968
sesteves wants to merge 1 commit intoapache:mainfrom
sesteves:feat/approx-top-k

sesteves commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sesteves commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale

What changes are included in this PR?

Core implementation (datafusion/functions-aggregate/src/approx_top_k.rs)

Registration (datafusion/functions-aggregate/src/lib.rs)

SQL logic tests (datafusion/sqllogictest/test_files/approx_top_k.slt)

Documentation

Integration tests

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sesteves commented Mar 16, 2026 •

edited

Loading

Core implementation (`datafusion/functions-aggregate/src/approx_top_k.rs`)

Registration (`datafusion/functions-aggregate/src/lib.rs`)

SQL logic tests (`datafusion/sqllogictest/test_files/approx_top_k.slt`)