feat: add approx_top_k aggregate function#20968
Open
sesteves wants to merge 1 commit intoapache:mainfrom
Open
Conversation
0a2f5f8 to
1465c70
Compare
Dandandan
reviewed
Mar 17, 2026
Dandandan
reviewed
Mar 17, 2026
martin-g
reviewed
Mar 17, 2026
1465c70 to
7d12109
Compare
Add a new approx_top_k(expression, k) aggregate function that returns
the approximate top-k most frequent values with their estimated counts,
using the Filtered Space-Saving algorithm.
The implementation uses a capacity multiplier of 3 (matching ClickHouse's
default) and includes an alpha map for improved accuracy by filtering
low-frequency noise before it enters the main summary.
Return type is List(Struct({value: T, count: UInt64})) ordered by count
descending, where T matches the input column type.
Closes apache#20967
7d12109 to
aa7bc47
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #20967.
Rationale
Finding the most frequently occurring values in a column is a very common analytical pattern — top error codes, most popular products, most active users, frequent URL paths, etc. The exact approach (
GROUP BY value ORDER BY COUNT(*) DESC LIMIT k) requires materializing all distinct groups, which is memory-intensive and slow on high-cardinality columns.approx_top_ksolves this with a streaming approximation algorithm that operates in bounded memory, making it practical for large-scale analytics.This function is already available in ClickHouse (
topK) and via extensions in PostgreSQL and Druid, and is a natural addition to DataFusion's existing family of approximate aggregate functions (approx_distinct,approx_median,approx_percentile_cont).What changes are included in this PR?
Core implementation (
datafusion/functions-aggregate/src/approx_top_k.rs)SpaceSavingSummary— Implements the Filtered Space-Saving algorithm (Metwally et al., 2005):CAPACITY_MULTIPLIER = 3(matching ClickHouse's default) — the summary tracksk × 3counters internallyApproxTopK—AggregateUDFImplwith#[user_doc]documentation, supporting any hashable scalar typeApproxTopKAccumulator— Accumulator withupdate_batch,merge_batch,evaluate(returnsList(Struct({value: T, count: UInt64}))ordered by count descending), andstate(serialized summary + data type)9 unit tests covering basic operations, eviction, alpha map, merge, serialization, accumulator update/evaluate, merge_batch, and a distributed merge simulation
Registration (
datafusion/functions-aggregate/src/lib.rs)expr_fnre-export, and registration inall_default_aggregate_functions()SQL logic tests (
datafusion/sqllogictest/test_files/approx_top_k.slt)Documentation
docs/source/user-guide/sql/aggregate_functions.md— TOC entry + full documentation sectiondocs/source/user-guide/expressions.md— quick-reference table entryIntegration tests
datafusion/proto/tests/cases/roundtrip_logical_plan.rs) —approx_top_k(col("a"), lit(3))datafusion/core/tests/dataframe/dataframe_functions.rs) —test_fn_approx_top_kwithinsta::assert_snapshot!Are these changes tested?
Yes:
approx_top_k.rs(algorithm correctness, serialization, accumulator lifecycle, distributed merge)approx_top_k.slt(end-to-end SQL behavior including edge cases and errors)Are there any user-facing changes?
Yes — new
approx_top_kaggregate function available in SQL and the DataFrame API: