feat(python/sedonadb): add DataFrame.sort with composable SortExpr#859
Conversation
There was a problem hiding this comment.
Since Pandas invented sort_values() there have been more elegant/composable ways to handle this that have evolved.
For our purposes, I think we should:
- Expose all the bells and whistles of DataFusion's
SortExprviasedonadb.expr.sort_expr(expr, <options like asc/dsc/nulls>) - Add methods to
Expr(.asc(nulls),.desc(nulls)) that returnSortExpr - Accept
str,Expr, orSortExprin sort_values - Name it either
sort()(DataFusion python, DuckDB) ororder_by()(Ibis) - Accept multiple arguments instead of a list (auto formats with Ruff into multiple lines better, all three newer interfaces do this)
Some more elegant interfaces for reference:
|
Should we keep both |
I don't think so. The GeoPandas compatibility layer is separate where we can do that thing and do that thing well. We can always add later...I'd prefer to start without duplicate "convenience" until we're sure they're actually convenient. |
Replace the pandas-style `sort_values(by=, ascending=)` proposed in
the earlier draft with the composable API pattern preferred in
DataFusion-python / DuckDB / Ibis: a `sort(*keys)` varargs method
plus a first-class `SortExpr` type users build via `Expr.asc/desc()`
or `sedonadb.expr.sort_expr()`.
API:
df.sort("x")
df.sort(col("x").desc())
df.sort(col("x"), col("y").desc()) # varargs
df.sort(sort_expr(col("x") + col("y"),
asc=False, nulls_first=False)) # full knobs
`DataFrame.sort` accepts each key as `str`, `Expr`, or `SortExpr`.
Strings and bare `Expr` auto-promote to ascending keys with nulls
placed last. Use `Expr.asc()` / `Expr.desc()` for the common
direction switch, and `sedonadb.expr.sort_expr(expr, asc=...,
nulls_first=...)` for full control (e.g. SQL-style nulls-first on
descending).
Rust side:
- New `PySortExpr` PyO3 class wrapping `datafusion_expr::SortExpr`,
exposed to Python as `_lib.InternalSortExpr`.
- `PyExpr.asc(nulls_first=false)` / `PyExpr.desc(nulls_first=false)`
return `PySortExpr`.
- `expr_sort_expr(expr, asc=true, nulls_first=false)` factory.
- `InternalDataFrame::sort_by_keys` removed in favor of
`InternalDataFrame::sort(Vec<PySortExpr>)`.
Python side:
- New `SortExpr` user-facing class in `sedonadb.expr.expression`.
- Top-level `sort_expr()` factory plus `Expr.asc/desc` methods.
- `sedonadb.expr.SortExpr` and `sedonadb.expr.sort_expr` re-exported.
- `DataFrame.sort_values` removed; `DataFrame.sort(*keys)` added.
Module-level import of `SortExpr` keeps the lazy-import policy
consistent with the rest of `dataframe.py`.
Tests: 14 in `test_dataframe_sort.py` covering string/Expr/SortExpr
keys, mixed-direction multi-key, computed-Expr keys, nulls-last
default in both directions, `sort_expr(nulls_first=True)`, lazy
return type, and the empty/bad-type/unknown-column error paths.
9 new tests in `test_expression.py` lock the exact `repr()` of
`SortExpr` for asc/desc, both null placements, the factory, and
the type-guard rejections.
Closes the design discussion on apache#859.
5176de9 to
d88b91f
Compare
|
@paleolimbot — redesigned per your review. Force-pushed All five of your asks landed:
The |
| coerced: List = [] | ||
| for k in keys: | ||
| if isinstance(k, SortExpr): | ||
| coerced.append(k._impl) | ||
| elif isinstance(k, Expr): | ||
| # Default direction is ascending, nulls last. | ||
| coerced.append(k.asc()._impl) | ||
| elif isinstance(k, str): | ||
| coerced.append(_col(k).asc()._impl) | ||
| else: | ||
| raise TypeError( | ||
| f"sort() expects str, Expr, or SortExpr arguments, " | ||
| f"got {type(k).__name__}" | ||
| ) |
| from sedonadb._lib import InternalExpr as _InternalExpr | ||
| from sedonadb._lib import InternalSortExpr as _InternalSortExpr | ||
| from sedonadb._lib import expr_binary as _expr_binary | ||
| from sedonadb._lib import expr_col as _expr_col | ||
| from sedonadb._lib import expr_lit as _expr_lit | ||
| from sedonadb._lib import expr_not as _expr_not | ||
| from sedonadb._lib import expr_sort_expr as _expr_sort_expr |
There was a problem hiding this comment.
Can you consense these into a single import? Ruff's "format imports" might do this for you nicely.
There was a problem hiding this comment.
Combined in d17eccc — single from sedonadb._lib import (..., ...) block. Didn't enable a workspace-wide isort rule in this PR; happy to do it separately if you'd like.
| coerced: List = [] | ||
| for k in keys: | ||
| if isinstance(k, SortExpr): | ||
| coerced.append(k._impl) | ||
| elif isinstance(k, Expr): | ||
| # Default direction is ascending, nulls last. | ||
| coerced.append(k.asc()._impl) | ||
| elif isinstance(k, str): | ||
| coerced.append(_col(k).asc()._impl) | ||
| else: | ||
| raise TypeError( | ||
| f"sort() expects str, Expr, or SortExpr arguments, " | ||
| f"got {type(k).__name__}" | ||
| ) |
Replace the pandas-style `sort_values(by=, ascending=)` proposed in
the earlier draft with the composable API pattern preferred in
DataFusion-python / DuckDB / Ibis: a `sort(*keys)` varargs method
plus a first-class `SortExpr` type users build via `Expr.asc/desc()`
or `sedonadb.expr.sort_expr()`.
API:
df.sort("x")
df.sort(col("x").desc())
df.sort(col("x"), col("y").desc()) # varargs
df.sort(sort_expr(col("x") + col("y"),
asc=False, nulls_first=False)) # full knobs
`DataFrame.sort` accepts each key as `str`, `Expr`, or `SortExpr`.
Strings and bare `Expr` auto-promote to ascending keys with nulls
placed last. Use `Expr.asc()` / `Expr.desc()` for the common
direction switch, and `sedonadb.expr.sort_expr(expr, asc=...,
nulls_first=...)` for full control (e.g. SQL-style nulls-first on
descending).
Rust side:
- New `PySortExpr` PyO3 class wrapping `datafusion_expr::SortExpr`,
exposed to Python as `_lib.InternalSortExpr`.
- `PyExpr.asc(nulls_first=false)` / `PyExpr.desc(nulls_first=false)`
return `PySortExpr`.
- `expr_sort_expr(expr, asc=true, nulls_first=false)` factory.
- `InternalDataFrame::sort_by_keys` removed in favor of
`InternalDataFrame::sort(Vec<PySortExpr>)`.
Python side:
- New `SortExpr` user-facing class in `sedonadb.expr.expression`.
- Top-level `sort_expr()` factory plus `Expr.asc/desc` methods.
- `sedonadb.expr.SortExpr` and `sedonadb.expr.sort_expr` re-exported.
- `DataFrame.sort_values` removed; `DataFrame.sort(*keys)` added.
Module-level import of `SortExpr` keeps the lazy-import policy
consistent with the rest of `dataframe.py`.
Tests: 14 in `test_dataframe_sort.py` covering string/Expr/SortExpr
keys, mixed-direction multi-key, computed-Expr keys, nulls-last
default in both directions, `sort_expr(nulls_first=True)`, lazy
return type, and the empty/bad-type/unknown-column error paths.
9 new tests in `test_expression.py` lock the exact `repr()` of
`SortExpr` for asc/desc, both null placements, the factory, and
the type-guard rejections.
Closes the design discussion on apache#859.
d88b91f to
d17eccc
Compare
Redesigned per @paleolimbot's review on the earlier draft — replacing pandas-style
sort_values(by=, ascending=)with the composable API pattern from DataFusion-python / DuckDB / Ibis.Continues Phase P2 of #791.
API
DataFrame.sortaccepts each key asstr | Expr | SortExpr. Strings and bareExprauto-promote to ascending keys with nulls last;Expr.asc()/Expr.desc()cover the common direction switch;sedonadb.expr.sort_expr(expr, asc=..., nulls_first=...)exposes the full DataFusionSortExprknobs.What's gone vs. the earlier draft
sort_values()— removed entirely. Per the discussion: no pandas-compat duplicate at this layer; that can come later in a dedicated GeoPandas-compat surface.by=/ascending=keyword args — replaced by varargs (formats better with Ruff and matches the three reference interfaces).Null placement
Expr.asc()andExpr.desc()default tonulls_first=False(nulls last) regardless of direction — matches the previous draft's pandas-style default and what the rest of the SedonaDB Python API ships. SQL-style nulls-first-on-descending is available viasort_expr(col("x"), asc=False, nulls_first=True).Implementation
python/sedonadb/src/expr.rsPySortExprPyO3 class wrappingdatafusion_expr::SortExpr.Expr.asc(nulls_first)/.desc(nulls_first)methods.expr_sort_expr(expr, asc, nulls_first)factory.python/sedonadb/src/dataframe.rsInternalDataFrame::sort(Vec<PySortExpr>); oldsort_by_keysgone.python/sedonadb/src/lib.rsPySortExpr+expr_sort_expr.python/sedonadb/python/sedonadb/expr/expression.pySortExprclass,sort_expr()factory,Expr.asc/.descmethods.python/sedonadb/python/sedonadb/expr/__init__.pySortExpr,sort_expr.python/sedonadb/python/sedonadb/dataframe.pyDataFrame.sort(*keys). Module-levelSortExprimport (no lazy imports, per the policy locked on #852).Tests
tests/expr/test_dataframe_sort.py— 14 tests covering string/Expr/SortExpr keys, mixed-direction multi-key, computed-Expr keys, nulls-last default in both directions,sort_expr(nulls_first=True), lazy return type (isinstance(out, DataFrame)), and the empty / bad-type / unknown-column error paths.tests/expr/test_expression.py— 9 new tests pinning the exactrepr()ofSortExprfor asc / desc / both null placements / the factory / both type-guard rejections.Local: 97 unit + 29 doctests +
ruff format+ruff checkall clean.