Skip to content

feat(python/sedonadb): add DataFrame.agg for global aggregation#887

Draft
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-agg
Draft

feat(python/sedonadb): add DataFrame.agg for global aggregation#887
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-agg

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

First DataFrame consumer of the function-registry dispatch landed in #885. Adds global (ungrouped) aggregation; grouped aggregation (DataFrame.group_by(*keys).agg(*aggs)) is the next small PR, sharing the same Rust binding.

API

sd = sedonadb.connect()
df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3, 4]}))

df.agg(sd.funcs.sum(col("x")).alias("total"))
df.agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    sd.funcs.count(col("y")).alias("n"),
    sd.funcs.min(col("x")).alias("lo"),
    sd.funcs.max(col("x")).alias("hi"),
)

Why this is so small

The function-registry dispatch in #885 means sd.funcs.sum, sd.funcs.count, sd.funcs.min, sd.funcs.max, sd.funcs.avg — and every other built-in / plugin / Python-registered aggregate — are already callable. This PR doesn't need any per-aggregate plumbing on either the Rust or Python side. One Rust binding, one Python method, a test file.

Implementation

File Change
python/sedonadb/src/dataframe.rs New InternalDataFrame::aggregate(group_exprs, agg_exprs). Generic wrapper over DataFusion's DataFrame::aggregate. Shared with the upcoming group_by PR — that path passes a populated group_exprs.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.agg(*exprs). Calls the Rust binding with an empty group_exprs.

Test plan

9 tests in tests/expr/test_dataframe_agg.py:

  • Positive: single sum; single count; paired min/max; avg over a compound expression col("x") + col("y"); four aggregates yielding a one-row four-column result.
  • Lazy return: isinstance(out, DataFrame).
  • Errors: empty agg()ValueError; non-Expr arg → TypeError.
  • Plan composition: chained filter().agg() produces the right result.

All assertions use pd.testing.assert_frame_equal for outputs.

Local: 9 unit + 22 doctests + ruff format + ruff check all clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Python DataFrame.agg(*exprs) support for global, ungrouped aggregation, using the existing function-registry expression dispatch and a new Rust binding over DataFusion aggregation.

Changes:

  • Adds InternalDataFrame::aggregate(group_exprs, agg_exprs) in Rust.
  • Adds Python DataFrame.agg() validation and lazy DataFrame return path.
  • Adds coverage for aggregate execution, errors, lazy return, and filter composition.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/sedonadb/src/dataframe.rs Adds the Rust aggregate binding over DataFusion DataFrame::aggregate.
python/sedonadb/python/sedonadb/dataframe.py Adds the public Python DataFrame.agg(*exprs) API.
python/sedonadb/tests/expr/test_dataframe_agg.py Adds tests for global aggregation behavior and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

df.agg("x")


def test_agg_chains_with_select(con):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_agg_chains_with_filter in 6fdc21c.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines +244 to +248
/// The Python side guarantees `agg_exprs` is non-empty. Argument
/// shape validation (every entry being an aggregate-shaped `Expr`)
/// happens Python-side. DataFusion's plan-build raises a clear
/// error if a non-aggregate Expr is passed in `agg_exprs`, so we
/// don't try to enforce that here.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the comment in 6fdc21c to drop the contradictory "Python-side validation" sentence — the Python wrapper only checks isinstance(e, Expr), not aggregate-shapedness, and DataFusion's plan-build catches the rest.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting...thank you!

Mostly nits...I'm hoping we can rename to aggregate (which is what DuckDB and Ibis call this).

Comment on lines +550 to +552
For grouped aggregation use `DataFrame.group_by(...).agg(...)`
(lands in a follow-up PR).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For grouped aggregation use `DataFrame.group_by(...).agg(...)`
(lands in a follow-up PR).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 2196d03 — no more forward-reference to grouped agg in the docstring.

Comment on lines +559 to +562
>>> from sedonadb.expr import col
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> from sedonadb.expr import col
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(sd.col("x")).alias("total")).show()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to sd.col("x") in 2196d03 — the from sedonadb.expr import col line is gone from the doctest.

Comment on lines +542 to +543
def agg(self, *exprs: Expr) -> "DataFrame":
"""Aggregate the entire DataFrame to a single row.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can we call this aggregate()? (Ibis, DuckDB)
  • Can we expose **kwargs like is done in select()? df.aggregate(x_sum=df.x.sum()) is much more compact than df.aggregate(df.x.sum().alias("x_sum")) and is allowed by Ibis.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark, Pandas and Polars all use agg. I'd like to keep it that way.

Copy link
Copy Markdown
Member Author

@jiayuasu jiayuasu May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs added in 2196d03df.agg(total=sd.funcs.sum(sd.col("x"))) now desugars to …sum(sd.col("x")).alias("total"), and positional + named can mix. Three new tests cover the kwarg path, mixed positional/kwarg, and the non-Expr kwarg value rejection.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines +241 to +242
/// from `DataFrame.agg`) and grouped aggregation (called from
/// `DataFrame.group_by(...).agg(...)` once that lands).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// from `DataFrame.agg`) and grouped aggregation (called from
/// `DataFrame.group_by(...).agg(...)` once that lands).
/// from `DataFrame.agg`) and grouped aggregation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in 2196d03.

First DataFrame consumer of the function-registry dispatch landed in
apache#885. Builds the call site that grouped aggregation will also use.

API:

    df.agg(sd.funcs.sum(col("x")).alias("total"))
    df.agg(
        sd.funcs.sum(col("x")).alias("sum_x"),
        sd.funcs.count(col("y")).alias("n"),
        sd.funcs.min(col("x")).alias("lo"),
        sd.funcs.max(col("x")).alias("hi"),
    )

- Varargs of aggregate `Expr` values. Aggregate exprs come from
  `sd.funcs.<name>(args)` via apache#885; no per-aggregate plumbing in this
  PR (or any future PR — that's the whole point of the registry
  dispatch).
- Strings rejected — `df.agg("x")` has no meaning since a bare column
  isn't an aggregate. No auto-promotion.
- Empty `df.agg()` → ValueError; non-Expr arg → TypeError.
- Returns a one-row DataFrame.

Rust side: `InternalDataFrame::aggregate(group_exprs, agg_exprs)` is
the generic binding for both `DataFrame.agg` (this PR — passes an
empty `group_exprs`) and `DataFrame.group_by(*keys).agg(*aggs)`
(next PR — same Rust call, with `group_exprs` populated). One
binding serves both surfaces.

Tests: 9 covering single-aggregate (sum/count), min+max paired,
avg over a compound expression, multiple-aggregates-one-row,
lazy return, both error paths, and chained `filter().agg()` for
plan composition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants