feat(python/sedonadb): add DataFrame.filter / DataFrame.where by jiayuasu · Pull Request #835 · apache/sedona-db

jiayuasu · 2026-05-12T05:45:29Z

Wire the Expr layer into the existing lazy DataFrame so users can filter rows without writing SQL predicates.

This is the fourth and final small PR completing Phase P1 of #791, building on #807 (Expr foundation), #823 (operators), and #832 (DataFrame.select).

What's new

from sedonadb.expr import col

df.filter(col("x") > 0)
df.filter(col("x") > 0, col("y") < 10)              # multiple preds → AND
df.filter((col("x") > 0) & col("y").is_not_null())  # explicit & also fine
df.filter(col("x") > 1).filter(col("x") < 4)        # chained: two filter nodes

Multiple predicates are AND-combined into a single composed Expr via DataFusion's conjunction helper, giving the optimizer one conjunction to reason about rather than N stacked filter nodes. Chained .filter().filter() calls produce two filter nodes in the plan but the same result.

Decision worth flagging — rejecting `Literal` 🛎️ cc @paleolimbot

filter() rejects both str and Literal arguments at the Python boundary.

Strings are not interpreted as SQL predicates — that's a separate query()-style feature, not this PR's surface.
Literal is rejected because filter(lit(True)) (or any filter(lit(value))) is almost always a typo: DataFusion would silently accept WHERE 7 as truthy, producing a filter that looks present in the source but is actually a no-op. If a user genuinely wants a constant predicate they can write col("flag") == lit(True), which forces the intent.

This is more restrictive than select, which does accept Literal. The asymmetry is intentional: a literal as a projection is meaningful (a constant column), but a literal as a filter is almost never what you want. Happy to revisit if you'd rather have filter(lit(...)) accepted and the resulting no-op surface as an optimizer-time warning instead.

Implementation

Rust: InternalDataFrame::filter(Vec<PyExpr>) — empty list errors at the Python boundary; multi-element lists are folded via datafusion_expr::utils::conjunction, matching the pattern used elsewhere in the workspace (sedona-pointcloud, sedona-query-planner). Step-by-step comments explain the conjunction choice.

Python: DataFrame.filter(*exprs: "Expr") -> DataFrame with explicit isinstance checks rejecting str, Literal, and other types.

Test plan

12 tests in tests/expr/test_dataframe_filter.py:

Positive: simple predicate, multi-AND, explicit &, |, ~, isin, chained filter().filter() matches multi-AND.
Lazy: filter returns a DataFrame without materializing.
Errors: empty filter → ValueError; string arg → TypeError; Literal arg → TypeError with the actionable suggestion; unknown column → SedonaError whose message includes the valid field names.

Each test builds its own input df inline from the smallest pd.DataFrame({...}) that exercises the predicate. Existing select/expression/literal tests all still pass. ruff format and doctests both clean.

What's next

This completes the issue-tracker Phase P1 (Expr layer + select/filter on DataFrame). Phase P2 in the design doc covers Series / Scalar, groupby, merge, the curated sd.st module, and the cookbook track.

Wire the Expr layer into the existing lazy DataFrame so users can filter rows without writing SQL predicates. - `DataFrame.filter(*exprs)` accepts one or more `Expr` predicates. Multiple predicates are combined with logical AND, so `df.filter(a, b)` is equivalent to `df.filter(a & b)` and to `df.filter(a).filter(b)` (the first two produce one filter node in the plan, the third produces two). - `DataFrame.where = filter` aliases the same path. - Strings and bare `Literal` values are both rejected at the Python boundary. Strings would be SQL predicates (a separate feature). `filter(lit(True))` is almost certainly a typo; if you really want a constant predicate, wrap a column expression like `col("flag") == lit(True)`. DataFusion would silently accept `WHERE 7` as truthy, so catching at the boundary avoids a no-op-that-looks-like-a-filter bug. Rust side: `InternalDataFrame::filter` AND-combines all predicates with `Expr::and` and hands the single composed Expr to DataFusion's `DataFrame::filter`, giving the optimizer one conjunction to reason about rather than N stacked filter nodes. Tests assert exact output via `pd.testing.assert_frame_equal` and lock the contract that the unknown-column error message lists valid field names (DataFusion provides this in plan-build).

Copilot

Pull request overview

Adds pandas-style row filtering to the Python sedonadb.DataFrame API by wiring the existing Expr layer into the lazy DataFusion-backed DataFrame, including a .where() alias.

Changes:

Add DataFrame.filter(*exprs: Expr) and DataFrame.where (alias), with Python-side type checks and clear error paths (empty args, string args, Literal args).
Add Rust InternalDataFrame::filter(Vec<PyExpr>) that AND-combines multiple predicates into a single DataFusion filter expression.
Add a dedicated pytest module covering positive behavior, laziness, and key error contracts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`python/sedonadb/python/sedonadb/dataframe.py`	Adds `DataFrame.filter()` implementation + `where` alias with Python-boundary validation and docs.
`python/sedonadb/src/dataframe.rs`	Exposes `InternalDataFrame.filter()` and folds multi-predicate calls via `Expr::and` before calling DataFusion.
`python/sedonadb/tests/expr/test_dataframe_filter.py`	Introduces unit tests for filter/where behavior, composition, laziness, and error messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paleolimbot · 2026-05-12T14:32:21Z

+
+    where = filter


Can we not add an alias for filter here? We can specifically add a pyspark compatibility later at a different time if there is interest.

Suggested change

where = filter

Done in 92afb9b — where = filter is gone and test_where_alias_produces_same_output along with it. PR description updated to drop the where mentions.

paleolimbot · 2026-05-12T14:34:56Z

+        let mut iter = exprs.into_iter().map(|e| e.inner);
+        let mut combined = iter.next().unwrap();
+        for next in iter {
+            combined = combined.and(next);
+        }


There's a helper for this in DataFusion (I think it's called conjunction!(); there may be one for both physical and logical expressions and you'll need the logical one).

Switched in 92afb9b — now using datafusion_expr::utils::conjunction, matching the pattern in sedona-pointcloud and sedona-query-planner. unwrap/expect is safe because the empty case is already rejected at the Python boundary.

paleolimbot · 2026-05-12T14:37:31Z

+def _xy_df(con):
+    return con.create_data_frame(
+        pd.DataFrame({"x": [1, 2, 3, 4], "y": [10, 20, 30, 40]})
+    )


Can you inline this into each test? Probably pd.DataFrame({"x": [1, 2, 3]}) would be sufficient.

Done in 92afb9b — _xy_df helper removed; each test builds its own df inline from the smallest pd.DataFrame({...}) sufficient for the assertion. Most tests use {"x": [1, 2, 3]}; tests that need a second column use {"x": ..., "y": ...}.

- Remove the `where = filter` class-level alias. PySpark-compat naming can land later if there's demand; not part of this layer's scope. - Replace the manual left-fold of `Expr::and` in `InternalDataFrame::filter` with DataFusion's `conjunction` helper from `datafusion_expr::utils`. Same semantics, fewer lines, and matches the pattern used elsewhere in the workspace (`sedona-pointcloud`, `sedona-query-planner`). - Drop the `_xy_df` helper in `test_dataframe_filter.py`; each test now builds its own `df` via `con.create_data_frame(pd.DataFrame(...))` inline, with the smallest data sufficient for the assertion. Same "full context in the failing test function" rationale as the select PR. - Drop `test_where_alias_produces_same_output` along with the alias.

github-actions Bot requested a review from prantogg May 12, 2026 05:46

jiayuasu requested a review from Copilot May 12, 2026 05:48

Copilot started reviewing on behalf of jiayuasu May 12, 2026 05:48 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

paleolimbot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835

feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835
jiayuasu wants to merge 2 commits into
apache:mainfrom
jiayuasu:feature/df-filter

jiayuasu commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

paleolimbot May 12, 2026

Uh oh!

jiayuasu May 13, 2026

Uh oh!

paleolimbot May 12, 2026

Uh oh!

jiayuasu May 13, 2026

Uh oh!

paleolimbot May 12, 2026

Uh oh!

jiayuasu May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiayuasu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new

Decision worth flagging — rejecting Literal 🛎️ cc @paleolimbot

Implementation

Test plan

What's next

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

paleolimbot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiayuasu commented May 12, 2026 •

edited

Loading

Decision worth flagging — rejecting `Literal` 🛎️ cc @paleolimbot