Skip to content

feat(python/sedonadb): add DataFrame.sort with composable SortExpr#859

Merged
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-sort-values
May 23, 2026
Merged

feat(python/sedonadb): add DataFrame.sort with composable SortExpr#859
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-sort-values

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 19, 2026

Redesigned per @paleolimbot's review on the earlier draft — replacing pandas-style sort_values(by=, ascending=) with the composable API pattern from DataFusion-python / DuckDB / Ibis.

Continues Phase P2 of #791.

API

from sedonadb.expr import col, sort_expr

# Common cases — Expr.asc()/.desc() return SortExpr
df.sort("x")                          # str auto-promotes to ascending
df.sort(col("x").desc())
df.sort(col("x"), col("y").desc())    # varargs, multi-key
df.sort(col("x") + col("y"))          # arbitrary Expr key

# Full control via the factory
df.sort(sort_expr(col("x"), asc=False, nulls_first=True))

DataFrame.sort accepts each key as str | Expr | SortExpr. Strings and bare Expr auto-promote to ascending keys with nulls last; Expr.asc() / Expr.desc() cover the common direction switch; sedonadb.expr.sort_expr(expr, asc=..., nulls_first=...) exposes the full DataFusion SortExpr knobs.

What's gone vs. the earlier draft

  • sort_values() — removed entirely. Per the discussion: no pandas-compat duplicate at this layer; that can come later in a dedicated GeoPandas-compat surface.
  • by= / ascending= keyword args — replaced by varargs (formats better with Ruff and matches the three reference interfaces).

Null placement

Expr.asc() and Expr.desc() default to nulls_first=False (nulls last) regardless of direction — matches the previous draft's pandas-style default and what the rest of the SedonaDB Python API ships. SQL-style nulls-first-on-descending is available via sort_expr(col("x"), asc=False, nulls_first=True).

Implementation

File Change
python/sedonadb/src/expr.rs New PySortExpr PyO3 class wrapping datafusion_expr::SortExpr. Expr.asc(nulls_first) / .desc(nulls_first) methods. expr_sort_expr(expr, asc, nulls_first) factory.
python/sedonadb/src/dataframe.rs InternalDataFrame::sort(Vec<PySortExpr>); old sort_by_keys gone.
python/sedonadb/src/lib.rs Register PySortExpr + expr_sort_expr.
python/sedonadb/python/sedonadb/expr/expression.py Python SortExpr class, sort_expr() factory, Expr.asc/.desc methods.
python/sedonadb/python/sedonadb/expr/__init__.py Re-export SortExpr, sort_expr.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.sort(*keys). Module-level SortExpr import (no lazy imports, per the policy locked on #852).

Tests

  • tests/expr/test_dataframe_sort.py — 14 tests covering string/Expr/SortExpr keys, mixed-direction multi-key, computed-Expr keys, nulls-last default in both directions, sort_expr(nulls_first=True), lazy return type (isinstance(out, DataFrame)), and the empty / bad-type / unknown-column error paths.
  • tests/expr/test_expression.py — 9 new tests pinning the exact repr() of SortExpr for asc / desc / both null placements / the factory / both type-guard rejections.

Local: 97 unit + 29 doctests + ruff format + ruff check all clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions github-actions Bot requested a review from prantogg May 19, 2026 05:51
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Pandas invented sort_values() there have been more elegant/composable ways to handle this that have evolved.

For our purposes, I think we should:

  • Expose all the bells and whistles of DataFusion's SortExpr via sedonadb.expr.sort_expr(expr, <options like asc/dsc/nulls>)
  • Add methods to Expr (.asc(nulls), .desc(nulls)) that return SortExpr
  • Accept str, Expr, or SortExpr in sort_values
  • Name it either sort() (DataFusion python, DuckDB) or order_by() (Ibis)
  • Accept multiple arguments instead of a list (auto formats with Ruff into multiple lines better, all three newer interfaces do this)

Some more elegant interfaces for reference:

@jiayuasu
Copy link
Copy Markdown
Member Author

Should we keep both sort_values and sort to maintain some compatibility with Pandas?

@paleolimbot
Copy link
Copy Markdown
Member

Should we keep both sort_values and sort to maintain some compatibility with Pandas?

I don't think so. The GeoPandas compatibility layer is separate where we can do that thing and do that thing well. We can always add later...I'd prefer to start without duplicate "convenience" until we're sure they're actually convenient.

jiayuasu added a commit to jiayuasu/sedona-db that referenced this pull request May 22, 2026
Replace the pandas-style `sort_values(by=, ascending=)` proposed in
the earlier draft with the composable API pattern preferred in
DataFusion-python / DuckDB / Ibis: a `sort(*keys)` varargs method
plus a first-class `SortExpr` type users build via `Expr.asc/desc()`
or `sedonadb.expr.sort_expr()`.

API:

    df.sort("x")
    df.sort(col("x").desc())
    df.sort(col("x"), col("y").desc())                       # varargs
    df.sort(sort_expr(col("x") + col("y"),
                      asc=False, nulls_first=False))         # full knobs

`DataFrame.sort` accepts each key as `str`, `Expr`, or `SortExpr`.
Strings and bare `Expr` auto-promote to ascending keys with nulls
placed last. Use `Expr.asc()` / `Expr.desc()` for the common
direction switch, and `sedonadb.expr.sort_expr(expr, asc=...,
nulls_first=...)` for full control (e.g. SQL-style nulls-first on
descending).

Rust side:
- New `PySortExpr` PyO3 class wrapping `datafusion_expr::SortExpr`,
  exposed to Python as `_lib.InternalSortExpr`.
- `PyExpr.asc(nulls_first=false)` / `PyExpr.desc(nulls_first=false)`
  return `PySortExpr`.
- `expr_sort_expr(expr, asc=true, nulls_first=false)` factory.
- `InternalDataFrame::sort_by_keys` removed in favor of
  `InternalDataFrame::sort(Vec<PySortExpr>)`.

Python side:
- New `SortExpr` user-facing class in `sedonadb.expr.expression`.
- Top-level `sort_expr()` factory plus `Expr.asc/desc` methods.
- `sedonadb.expr.SortExpr` and `sedonadb.expr.sort_expr` re-exported.
- `DataFrame.sort_values` removed; `DataFrame.sort(*keys)` added.
  Module-level import of `SortExpr` keeps the lazy-import policy
  consistent with the rest of `dataframe.py`.

Tests: 14 in `test_dataframe_sort.py` covering string/Expr/SortExpr
keys, mixed-direction multi-key, computed-Expr keys, nulls-last
default in both directions, `sort_expr(nulls_first=True)`, lazy
return type, and the empty/bad-type/unknown-column error paths.
9 new tests in `test_expression.py` lock the exact `repr()` of
`SortExpr` for asc/desc, both null placements, the factory, and
the type-guard rejections.

Closes the design discussion on apache#859.
@jiayuasu jiayuasu force-pushed the feature/df-sort-values branch from 5176de9 to d88b91f Compare May 22, 2026 06:33
@jiayuasu jiayuasu changed the title feat(python/sedonadb): add DataFrame.sort_values feat(python/sedonadb): add DataFrame.sort with composable SortExpr May 22, 2026
@jiayuasu
Copy link
Copy Markdown
Member Author

@paleolimbot — redesigned per your review. Force-pushed d88b91f6 over the earlier sort_values commit. Branch name is stale (kept it to preserve this PR thread); the PR title and description are updated.

All five of your asks landed:

  1. sedonadb.expr.sort_expr(expr, asc=True, nulls_first=False) — full-knob factory.
  2. Expr.asc(nulls_first=False) / Expr.desc(nulls_first=False) — common-case sugar, return SortExpr.
  3. DataFrame.sort accepts str | Expr | SortExpr — strings and bare Expr auto-promote to ascending keys.
  4. Named sort() (matching DataFusion-python and DuckDB) rather than order_by().
  5. Varargsdf.sort(col("x"), col("y").desc()), no list.

sort_values() is gone — no pandas-compat duplicate at this layer, per the thread.

The Expr.asc/.desc shortcuts default to nulls_first=False (nulls last) in both directions, matching what the rest of the SedonaDB Python API already ships. SQL-style nulls-first-on-descending is reachable via sort_expr(col("x"), asc=False, nulls_first=True).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment on lines +304 to +317
coerced: List = []
for k in keys:
if isinstance(k, SortExpr):
coerced.append(k._impl)
elif isinstance(k, Expr):
# Default direction is ascending, nulls last.
coerced.append(k.asc()._impl)
elif isinstance(k, str):
coerced.append(_col(k).asc()._impl)
else:
raise TypeError(
f"sort() expects str, Expr, or SortExpr arguments, "
f"got {type(k).__name__}"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're fine here 🙂

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +20 to +26
from sedonadb._lib import InternalExpr as _InternalExpr
from sedonadb._lib import InternalSortExpr as _InternalSortExpr
from sedonadb._lib import expr_binary as _expr_binary
from sedonadb._lib import expr_col as _expr_col
from sedonadb._lib import expr_lit as _expr_lit
from sedonadb._lib import expr_not as _expr_not
from sedonadb._lib import expr_sort_expr as _expr_sort_expr
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consense these into a single import? Ruff's "format imports" might do this for you nicely.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined in d17eccc — single from sedonadb._lib import (..., ...) block. Didn't enable a workspace-wide isort rule in this PR; happy to do it separately if you'd like.

Comment on lines +304 to +317
coerced: List = []
for k in keys:
if isinstance(k, SortExpr):
coerced.append(k._impl)
elif isinstance(k, Expr):
# Default direction is ascending, nulls last.
coerced.append(k.asc()._impl)
elif isinstance(k, str):
coerced.append(_col(k).asc()._impl)
else:
raise TypeError(
f"sort() expects str, Expr, or SortExpr arguments, "
f"got {type(k).__name__}"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're fine here 🙂

Replace the pandas-style `sort_values(by=, ascending=)` proposed in
the earlier draft with the composable API pattern preferred in
DataFusion-python / DuckDB / Ibis: a `sort(*keys)` varargs method
plus a first-class `SortExpr` type users build via `Expr.asc/desc()`
or `sedonadb.expr.sort_expr()`.

API:

    df.sort("x")
    df.sort(col("x").desc())
    df.sort(col("x"), col("y").desc())                       # varargs
    df.sort(sort_expr(col("x") + col("y"),
                      asc=False, nulls_first=False))         # full knobs

`DataFrame.sort` accepts each key as `str`, `Expr`, or `SortExpr`.
Strings and bare `Expr` auto-promote to ascending keys with nulls
placed last. Use `Expr.asc()` / `Expr.desc()` for the common
direction switch, and `sedonadb.expr.sort_expr(expr, asc=...,
nulls_first=...)` for full control (e.g. SQL-style nulls-first on
descending).

Rust side:
- New `PySortExpr` PyO3 class wrapping `datafusion_expr::SortExpr`,
  exposed to Python as `_lib.InternalSortExpr`.
- `PyExpr.asc(nulls_first=false)` / `PyExpr.desc(nulls_first=false)`
  return `PySortExpr`.
- `expr_sort_expr(expr, asc=true, nulls_first=false)` factory.
- `InternalDataFrame::sort_by_keys` removed in favor of
  `InternalDataFrame::sort(Vec<PySortExpr>)`.

Python side:
- New `SortExpr` user-facing class in `sedonadb.expr.expression`.
- Top-level `sort_expr()` factory plus `Expr.asc/desc` methods.
- `sedonadb.expr.SortExpr` and `sedonadb.expr.sort_expr` re-exported.
- `DataFrame.sort_values` removed; `DataFrame.sort(*keys)` added.
  Module-level import of `SortExpr` keeps the lazy-import policy
  consistent with the rest of `dataframe.py`.

Tests: 14 in `test_dataframe_sort.py` covering string/Expr/SortExpr
keys, mixed-direction multi-key, computed-Expr keys, nulls-last
default in both directions, `sort_expr(nulls_first=True)`, lazy
return type, and the empty/bad-type/unknown-column error paths.
9 new tests in `test_expression.py` lock the exact `repr()` of
`SortExpr` for asc/desc, both null placements, the factory, and
the type-guard rejections.

Closes the design discussion on apache#859.
@jiayuasu jiayuasu force-pushed the feature/df-sort-values branch from d88b91f to d17eccc Compare May 22, 2026 19:47
@jiayuasu jiayuasu marked this pull request as ready for review May 22, 2026 19:47
@jiayuasu jiayuasu merged commit fdd0d84 into apache:main May 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants