feat(python/sedonadb): add DataFrame.agg for global aggregation by jiayuasu · Pull Request #887 · apache/sedona-db

jiayuasu · 2026-05-29T04:37:58Z

First DataFrame consumer of the function-registry dispatch landed in #885. Adds global (ungrouped) aggregation; grouped aggregation (DataFrame.group_by(*keys).agg(*aggs)) is the next small PR, sharing the same Rust binding.

API

sd = sedonadb.connect()
df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3, 4]}))

df.agg(sd.funcs.sum(col("x")).alias("total"))
df.agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    sd.funcs.count(col("y")).alias("n"),
    sd.funcs.min(col("x")).alias("lo"),
    sd.funcs.max(col("x")).alias("hi"),
)

Varargs of aggregate Expr values, built via sd.funcs.<name>(args) from feat(python/sedonadb): Expose scalar and aggregate udfs from context registry #885.
Strings are not auto-promoted — a bare column isn't an aggregate.
Empty df.agg() → ValueError; non-Expr arg → TypeError.
Returns a one-row DataFrame.

Why this is so small

The function-registry dispatch in #885 means sd.funcs.sum, sd.funcs.count, sd.funcs.min, sd.funcs.max, sd.funcs.avg — and every other built-in / plugin / Python-registered aggregate — are already callable. This PR doesn't need any per-aggregate plumbing on either the Rust or Python side. One Rust binding, one Python method, a test file.

Implementation

File	Change
`python/sedonadb/src/dataframe.rs`	New `InternalDataFrame::aggregate(group_exprs, agg_exprs)`. Generic wrapper over DataFusion's `DataFrame::aggregate`. Shared with the upcoming `group_by` PR — that path passes a populated `group_exprs`.
`python/sedonadb/python/sedonadb/dataframe.py`	`DataFrame.agg(*exprs)`. Calls the Rust binding with an empty `group_exprs`.

Test plan

9 tests in tests/expr/test_dataframe_agg.py:

Positive: single sum; single count; paired min/max; avg over a compound expression col("x") + col("y"); four aggregates yielding a one-row four-column result.
Lazy return: isinstance(out, DataFrame).
Errors: empty agg() → ValueError; non-Expr arg → TypeError.
Plan composition: chained filter().agg() produces the right result.

All assertions use pd.testing.assert_frame_equal for outputs.

Local: 9 unit + 22 doctests + ruff format + ruff check all clean.

Copilot

Pull request overview

Adds Python DataFrame.agg(*exprs) support for global, ungrouped aggregation, using the existing function-registry expression dispatch and a new Rust binding over DataFusion aggregation.

Changes:

Adds InternalDataFrame::aggregate(group_exprs, agg_exprs) in Rust.
Adds Python DataFrame.agg() validation and lazy DataFrame return path.
Adds coverage for aggregate execution, errors, lazy return, and filter composition.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`python/sedonadb/src/dataframe.rs`	Adds the Rust aggregate binding over DataFusion `DataFrame::aggregate`.
`python/sedonadb/python/sedonadb/dataframe.py`	Adds the public Python `DataFrame.agg(*exprs)` API.
`python/sedonadb/tests/expr/test_dataframe_agg.py`	Adds tests for global aggregation behavior and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-05-29T05:14:01Z

+        df.agg("x")
+
+
+def test_agg_chains_with_select(con):


Renamed to test_agg_chains_with_filter in 6fdc21c.

jiayuasu · 2026-05-29T05:14:03Z

+    /// The Python side guarantees `agg_exprs` is non-empty. Argument
+    /// shape validation (every entry being an aggregate-shaped `Expr`)
+    /// happens Python-side. DataFusion's plan-build raises a clear
+    /// error if a non-aggregate Expr is passed in `agg_exprs`, so we
+    /// don't try to enforce that here.


Rewrote the comment in 6fdc21c to drop the contradictory "Python-side validation" sentence — the Python wrapper only checks isinstance(e, Expr), not aggregate-shapedness, and DataFusion's plan-build catches the rest.

paleolimbot

Exciting...thank you!

Mostly nits...I'm hoping we can rename to aggregate (which is what DuckDB and Ibis call this).

paleolimbot · 2026-05-30T01:18:44Z

+        For grouped aggregation use `DataFrame.group_by(...).agg(...)`
+        (lands in a follow-up PR).
+


Suggested change

For grouped aggregation use `DataFrame.group_by(...).agg(...)`

(lands in a follow-up PR).

Dropped in 2196d03 — no more forward-reference to grouped agg in the docstring.

paleolimbot · 2026-05-30T01:19:11Z

+            >>> from sedonadb.expr import col
+            >>> sd = sedona.db.connect()
+            >>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
+            >>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()


Suggested change

>>> from sedonadb.expr import col

>>> sd = sedona.db.connect()

>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")

>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()

>>> sd = sedona.db.connect()

>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")

>>> df.agg(sd.funcs.sum(sd.col("x")).alias("total")).show()

Switched to sd.col("x") in 2196d03 — the from sedonadb.expr import col line is gone from the doctest.

paleolimbot · 2026-05-30T01:43:09Z

+    def agg(self, *exprs: Expr) -> "DataFrame":
+        """Aggregate the entire DataFrame to a single row.


Can we call this aggregate()? (Ibis, DuckDB)

Can we expose **kwargs like is done in select()? df.aggregate(x_sum=df.x.sum()) is much more compact than df.aggregate(df.x.sum().alias("x_sum")) and is allowed by Ibis.

PySpark, Pandas and Polars all use agg. I'd like to keep it that way.

kwargs added in 2196d03 — df.agg(total=sd.funcs.sum(sd.col("x"))) now desugars to …sum(sd.col("x")).alias("total"), and positional + named can mix. Three new tests cover the kwarg path, mixed positional/kwarg, and the non-Expr kwarg value rejection.

paleolimbot · 2026-05-30T01:49:48Z

+    /// from `DataFrame.agg`) and grouped aggregation (called from
+    /// `DataFrame.group_by(...).agg(...)` once that lands).


Suggested change

/// from `DataFrame.agg`) and grouped aggregation (called from

/// `DataFrame.group_by(...).agg(...)` once that lands).

/// from `DataFrame.agg`) and grouped aggregation.

Simplified in 2196d03.

First DataFrame consumer of the function-registry dispatch landed in apache#885. Builds the call site that grouped aggregation will also use. API: df.agg(sd.funcs.sum(col("x")).alias("total")) df.agg( sd.funcs.sum(col("x")).alias("sum_x"), sd.funcs.count(col("y")).alias("n"), sd.funcs.min(col("x")).alias("lo"), sd.funcs.max(col("x")).alias("hi"), ) - Varargs of aggregate `Expr` values. Aggregate exprs come from `sd.funcs.<name>(args)` via apache#885; no per-aggregate plumbing in this PR (or any future PR — that's the whole point of the registry dispatch). - Strings rejected — `df.agg("x")` has no meaning since a bare column isn't an aggregate. No auto-promotion. - Empty `df.agg()` → ValueError; non-Expr arg → TypeError. - Returns a one-row DataFrame. Rust side: `InternalDataFrame::aggregate(group_exprs, agg_exprs)` is the generic binding for both `DataFrame.agg` (this PR — passes an empty `group_exprs`) and `DataFrame.group_by(*keys).agg(*aggs)` (next PR — same Rust call, with `group_exprs` populated). One binding serves both surfaces. Tests: 9 covering single-aggregate (sum/count), min+max paired, avg over a compound expression, multiple-aggregates-one-row, lazy return, both error paths, and chained `filter().agg()` for plan composition.

github-actions Bot requested a review from prantogg May 29, 2026 04:38

jiayuasu requested a review from Copilot May 29, 2026 04:39

Copilot started reviewing on behalf of jiayuasu May 29, 2026 04:39 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-agg branch from f42bb35 to 6fdc21c Compare May 29, 2026 05:13

paleolimbot reviewed May 30, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-agg branch from 6fdc21c to 2196d03 Compare May 30, 2026 03:24

		For grouped aggregation use `DataFrame.group_by(...).agg(...)`
		(lands in a follow-up PR).

		def agg(self, *exprs: Expr) -> "DataFrame":
		"""Aggregate the entire DataFrame to a single row.

		/// from `DataFrame.agg`) and grouped aggregation (called from
		/// `DataFrame.group_by(...).agg(...)` once that lands).

Conversation

jiayuasu commented May 29, 2026

API

Why this is so small

Implementation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiayuasu May 30, 2026 •

edited

Loading