feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835
feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835jiayuasu wants to merge 2 commits into
Conversation
Wire the Expr layer into the existing lazy DataFrame so users can
filter rows without writing SQL predicates.
- `DataFrame.filter(*exprs)` accepts one or more `Expr` predicates.
Multiple predicates are combined with logical AND, so
`df.filter(a, b)` is equivalent to `df.filter(a & b)` and to
`df.filter(a).filter(b)` (the first two produce one filter node
in the plan, the third produces two).
- `DataFrame.where = filter` aliases the same path.
- Strings and bare `Literal` values are both rejected at the Python
boundary. Strings would be SQL predicates (a separate feature).
`filter(lit(True))` is almost certainly a typo; if you really want
a constant predicate, wrap a column expression like
`col("flag") == lit(True)`. DataFusion would silently accept
`WHERE 7` as truthy, so catching at the boundary avoids a
no-op-that-looks-like-a-filter bug.
Rust side: `InternalDataFrame::filter` AND-combines all predicates
with `Expr::and` and hands the single composed Expr to DataFusion's
`DataFrame::filter`, giving the optimizer one conjunction to reason
about rather than N stacked filter nodes.
Tests assert exact output via `pd.testing.assert_frame_equal` and
lock the contract that the unknown-column error message lists valid
field names (DataFusion provides this in plan-build).
There was a problem hiding this comment.
Pull request overview
Adds pandas-style row filtering to the Python sedonadb.DataFrame API by wiring the existing Expr layer into the lazy DataFusion-backed DataFrame, including a .where() alias.
Changes:
- Add
DataFrame.filter(*exprs: Expr)andDataFrame.where(alias), with Python-side type checks and clear error paths (empty args, string args,Literalargs). - Add Rust
InternalDataFrame::filter(Vec<PyExpr>)that AND-combines multiple predicates into a single DataFusion filter expression. - Add a dedicated pytest module covering positive behavior, laziness, and key error contracts.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
python/sedonadb/python/sedonadb/dataframe.py |
Adds DataFrame.filter() implementation + where alias with Python-boundary validation and docs. |
python/sedonadb/src/dataframe.rs |
Exposes InternalDataFrame.filter() and folds multi-predicate calls via Expr::and before calling DataFusion. |
python/sedonadb/tests/expr/test_dataframe_filter.py |
Introduces unit tests for filter/where behavior, composition, laziness, and error messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| where = filter |
There was a problem hiding this comment.
Can we not add an alias for filter here? We can specifically add a pyspark compatibility later at a different time if there is interest.
| where = filter |
There was a problem hiding this comment.
Done in 92afb9b — where = filter is gone and test_where_alias_produces_same_output along with it. PR description updated to drop the where mentions.
| let mut iter = exprs.into_iter().map(|e| e.inner); | ||
| let mut combined = iter.next().unwrap(); | ||
| for next in iter { | ||
| combined = combined.and(next); | ||
| } |
There was a problem hiding this comment.
There's a helper for this in DataFusion (I think it's called conjunction!(); there may be one for both physical and logical expressions and you'll need the logical one).
There was a problem hiding this comment.
Switched in 92afb9b — now using datafusion_expr::utils::conjunction, matching the pattern in sedona-pointcloud and sedona-query-planner. unwrap/expect is safe because the empty case is already rejected at the Python boundary.
| def _xy_df(con): | ||
| return con.create_data_frame( | ||
| pd.DataFrame({"x": [1, 2, 3, 4], "y": [10, 20, 30, 40]}) | ||
| ) |
There was a problem hiding this comment.
Can you inline this into each test? Probably pd.DataFrame({"x": [1, 2, 3]}) would be sufficient.
There was a problem hiding this comment.
Done in 92afb9b — _xy_df helper removed; each test builds its own df inline from the smallest pd.DataFrame({...}) sufficient for the assertion. Most tests use {"x": [1, 2, 3]}; tests that need a second column use {"x": ..., "y": ...}.
- Remove the `where = filter` class-level alias. PySpark-compat naming can land later if there's demand; not part of this layer's scope. - Replace the manual left-fold of `Expr::and` in `InternalDataFrame::filter` with DataFusion's `conjunction` helper from `datafusion_expr::utils`. Same semantics, fewer lines, and matches the pattern used elsewhere in the workspace (`sedona-pointcloud`, `sedona-query-planner`). - Drop the `_xy_df` helper in `test_dataframe_filter.py`; each test now builds its own `df` via `con.create_data_frame(pd.DataFrame(...))` inline, with the smallest data sufficient for the assertion. Same "full context in the failing test function" rationale as the select PR. - Drop `test_where_alias_produces_same_output` along with the alias.
Wire the
Exprlayer into the existing lazyDataFrameso users can filter rows without writing SQL predicates.This is the fourth and final small PR completing Phase P1 of #791, building on #807 (Expr foundation), #823 (operators), and #832 (
DataFrame.select).What's new
Multiple predicates are AND-combined into a single composed
Exprvia DataFusion'sconjunctionhelper, giving the optimizer one conjunction to reason about rather than N stacked filter nodes. Chained.filter().filter()calls produce two filter nodes in the plan but the same result.Decision worth flagging — rejecting
Literal🛎️ cc @paleolimbotfilter()rejects bothstrandLiteralarguments at the Python boundary.query()-style feature, not this PR's surface.Literalis rejected becausefilter(lit(True))(or anyfilter(lit(value))) is almost always a typo: DataFusion would silently acceptWHERE 7as truthy, producing a filter that looks present in the source but is actually a no-op. If a user genuinely wants a constant predicate they can writecol("flag") == lit(True), which forces the intent.This is more restrictive than
select, which does acceptLiteral. The asymmetry is intentional: a literal as a projection is meaningful (a constant column), but a literal as a filter is almost never what you want. Happy to revisit if you'd rather havefilter(lit(...))accepted and the resulting no-op surface as an optimizer-time warning instead.Implementation
Rust:
InternalDataFrame::filter(Vec<PyExpr>)— empty list errors at the Python boundary; multi-element lists are folded viadatafusion_expr::utils::conjunction, matching the pattern used elsewhere in the workspace (sedona-pointcloud,sedona-query-planner). Step-by-step comments explain the conjunction choice.Python:
DataFrame.filter(*exprs: "Expr") -> DataFramewith explicitisinstancechecks rejectingstr,Literal, and other types.Test plan
12 tests in
tests/expr/test_dataframe_filter.py:&,|,~,isin, chainedfilter().filter()matches multi-AND.DataFramewithout materializing.ValueError; string arg →TypeError;Literalarg →TypeErrorwith the actionable suggestion; unknown column →SedonaErrorwhose message includes the valid field names.Each test builds its own input
dfinline from the smallestpd.DataFrame({...})that exercises the predicate. Existing select/expression/literal tests all still pass.ruff formatand doctests both clean.What's next
This completes the issue-tracker Phase P1 (Expr layer + select/filter on DataFrame). Phase P2 in the design doc covers
Series/Scalar,groupby,merge, the curatedsd.stmodule, and the cookbook track.