Skip to content

feat: add approx_top_k aggregate function#20968

Open
sesteves wants to merge 1 commit intoapache:mainfrom
sesteves:feat/approx-top-k
Open

feat: add approx_top_k aggregate function#20968
sesteves wants to merge 1 commit intoapache:mainfrom
sesteves:feat/approx-top-k

Conversation

@sesteves
Copy link
Contributor

@sesteves sesteves commented Mar 16, 2026

Which issue does this PR close?

Closes #20967.

Rationale

Finding the most frequently occurring values in a column is a very common analytical pattern — top error codes, most popular products, most active users, frequent URL paths, etc. The exact approach (GROUP BY value ORDER BY COUNT(*) DESC LIMIT k) requires materializing all distinct groups, which is memory-intensive and slow on high-cardinality columns.

approx_top_k solves this with a streaming approximation algorithm that operates in bounded memory, making it practical for large-scale analytics.

This function is already available in ClickHouse (topK) and via extensions in PostgreSQL and Druid, and is a natural addition to DataFusion's existing family of approximate aggregate functions (approx_distinct, approx_median, approx_percentile_cont).

What changes are included in this PR?

Core implementation (datafusion/functions-aggregate/src/approx_top_k.rs)

  • SpaceSavingSummary — Implements the Filtered Space-Saving algorithm (Metwally et al., 2005):

    • Fixed-size counter summary with eviction of minimum-count items when full
    • Alpha map for improved accuracy — tracks recently evicted items and filters low-frequency noise before it enters the main summary (same approach used by ClickHouse)
    • CAPACITY_MULTIPLIER = 3 (matching ClickHouse's default) — the summary tracks k × 3 counters internally
    • Merge support for parallel/distributed execution
    • Binary serialization/deserialization for intermediate state transfer
  • ApproxTopKAggregateUDFImpl with #[user_doc] documentation, supporting any hashable scalar type

  • ApproxTopKAccumulator — Accumulator with update_batch, merge_batch, evaluate (returns List(Struct({value: T, count: UInt64})) ordered by count descending), and state (serialized summary + data type)

  • 9 unit tests covering basic operations, eviction, alpha map, merge, serialization, accumulator update/evaluate, merge_batch, and a distributed merge simulation

Registration (datafusion/functions-aggregate/src/lib.rs)

  • Added module, expr_fn re-export, and registration in all_default_aggregate_functions()

SQL logic tests (datafusion/sqllogictest/test_files/approx_top_k.slt)

  • 8 tests: basic string/int, GROUP BY, k=1, NULL handling, error cases (invalid k), WHERE clause filtering

Documentation

  • docs/source/user-guide/sql/aggregate_functions.md — TOC entry + full documentation section
  • docs/source/user-guide/expressions.md — quick-reference table entry

Integration tests

  • Proto roundtrip (datafusion/proto/tests/cases/roundtrip_logical_plan.rs) — approx_top_k(col("a"), lit(3))
  • DataFrame (datafusion/core/tests/dataframe/dataframe_functions.rs) — test_fn_approx_top_k with insta::assert_snapshot!

Are these changes tested?

Yes:

  • 9 unit tests in approx_top_k.rs (algorithm correctness, serialization, accumulator lifecycle, distributed merge)
  • 8 SQL logic tests in approx_top_k.slt (end-to-end SQL behavior including edge cases and errors)
  • 1 DataFrame integration test with snapshot assertion
  • 1 proto roundtrip test

Are there any user-facing changes?

Yes — new approx_top_k aggregate function available in SQL and the DataFrame API:

-- Top 3 most frequent values
SELECT approx_top_k(url_path, 3) FROM access_logs;

-- Top 3 per group
SELECT region, approx_top_k(product_id, 3) FROM sales GROUP BY region;
// DataFrame API
use datafusion::functions_aggregate::approx_top_k::approx_top_k;
df.aggregate(vec![], vec![approx_top_k(col("url_path"), lit(5))])?;

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate functions Changes to functions implementation labels Mar 16, 2026
@sesteves sesteves force-pushed the feat/approx-top-k branch 3 times, most recently from 0a2f5f8 to 1465c70 Compare March 16, 2026 17:58
@sesteves sesteves force-pushed the feat/approx-top-k branch from 1465c70 to 7d12109 Compare March 17, 2026 12:26
Add a new approx_top_k(expression, k) aggregate function that returns
the approximate top-k most frequent values with their estimated counts,
using the Filtered Space-Saving algorithm.

The implementation uses a capacity multiplier of 3 (matching ClickHouse's
default) and includes an alpha map for improved accuracy by filtering
low-frequency noise before it enters the main summary.

Return type is List(Struct({value: T, count: UInt64})) ordered by count
descending, where T matches the input column type.

Closes apache#20967
@sesteves sesteves force-pushed the feat/approx-top-k branch from 7d12109 to aa7bc47 Compare March 17, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add approx_top_k aggregate function for approximate heavy hitter / top-k frequency estimation

3 participants