Skip to content

feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835

Draft
jiayuasu wants to merge 2 commits into
apache:mainfrom
jiayuasu:feature/df-filter
Draft

feat(python/sedonadb): add DataFrame.filter / DataFrame.where#835
jiayuasu wants to merge 2 commits into
apache:mainfrom
jiayuasu:feature/df-filter

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 12, 2026

Wire the Expr layer into the existing lazy DataFrame so users can filter rows without writing SQL predicates.

This is the fourth and final small PR completing Phase P1 of #791, building on #807 (Expr foundation), #823 (operators), and #832 (DataFrame.select).

What's new

from sedonadb.expr import col

df.filter(col("x") > 0)
df.filter(col("x") > 0, col("y") < 10)              # multiple preds → AND
df.filter((col("x") > 0) & col("y").is_not_null())  # explicit & also fine
df.filter(col("x") > 1).filter(col("x") < 4)        # chained: two filter nodes

Multiple predicates are AND-combined into a single composed Expr via DataFusion's conjunction helper, giving the optimizer one conjunction to reason about rather than N stacked filter nodes. Chained .filter().filter() calls produce two filter nodes in the plan but the same result.

Decision worth flagging — rejecting Literal 🛎️ cc @paleolimbot

filter() rejects both str and Literal arguments at the Python boundary.

  • Strings are not interpreted as SQL predicates — that's a separate query()-style feature, not this PR's surface.
  • Literal is rejected because filter(lit(True)) (or any filter(lit(value))) is almost always a typo: DataFusion would silently accept WHERE 7 as truthy, producing a filter that looks present in the source but is actually a no-op. If a user genuinely wants a constant predicate they can write col("flag") == lit(True), which forces the intent.

This is more restrictive than select, which does accept Literal. The asymmetry is intentional: a literal as a projection is meaningful (a constant column), but a literal as a filter is almost never what you want. Happy to revisit if you'd rather have filter(lit(...)) accepted and the resulting no-op surface as an optimizer-time warning instead.

Implementation

Rust: InternalDataFrame::filter(Vec<PyExpr>) — empty list errors at the Python boundary; multi-element lists are folded via datafusion_expr::utils::conjunction, matching the pattern used elsewhere in the workspace (sedona-pointcloud, sedona-query-planner). Step-by-step comments explain the conjunction choice.

Python: DataFrame.filter(*exprs: "Expr") -> DataFrame with explicit isinstance checks rejecting str, Literal, and other types.

Test plan

12 tests in tests/expr/test_dataframe_filter.py:

  • Positive: simple predicate, multi-AND, explicit &, |, ~, isin, chained filter().filter() matches multi-AND.
  • Lazy: filter returns a DataFrame without materializing.
  • Errors: empty filter → ValueError; string arg → TypeError; Literal arg → TypeError with the actionable suggestion; unknown column → SedonaError whose message includes the valid field names.

Each test builds its own input df inline from the smallest pd.DataFrame({...}) that exercises the predicate. Existing select/expression/literal tests all still pass. ruff format and doctests both clean.

What's next

This completes the issue-tracker Phase P1 (Expr layer + select/filter on DataFrame). Phase P2 in the design doc covers Series / Scalar, groupby, merge, the curated sd.st module, and the cookbook track.

Wire the Expr layer into the existing lazy DataFrame so users can
filter rows without writing SQL predicates.

- `DataFrame.filter(*exprs)` accepts one or more `Expr` predicates.
  Multiple predicates are combined with logical AND, so
  `df.filter(a, b)` is equivalent to `df.filter(a & b)` and to
  `df.filter(a).filter(b)` (the first two produce one filter node
  in the plan, the third produces two).
- `DataFrame.where = filter` aliases the same path.
- Strings and bare `Literal` values are both rejected at the Python
  boundary. Strings would be SQL predicates (a separate feature).
  `filter(lit(True))` is almost certainly a typo; if you really want
  a constant predicate, wrap a column expression like
  `col("flag") == lit(True)`. DataFusion would silently accept
  `WHERE 7` as truthy, so catching at the boundary avoids a
  no-op-that-looks-like-a-filter bug.

Rust side: `InternalDataFrame::filter` AND-combines all predicates
with `Expr::and` and hands the single composed Expr to DataFusion's
`DataFrame::filter`, giving the optimizer one conjunction to reason
about rather than N stacked filter nodes.

Tests assert exact output via `pd.testing.assert_frame_equal` and
lock the contract that the unknown-column error message lists valid
field names (DataFusion provides this in plan-build).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds pandas-style row filtering to the Python sedonadb.DataFrame API by wiring the existing Expr layer into the lazy DataFusion-backed DataFrame, including a .where() alias.

Changes:

  • Add DataFrame.filter(*exprs: Expr) and DataFrame.where (alias), with Python-side type checks and clear error paths (empty args, string args, Literal args).
  • Add Rust InternalDataFrame::filter(Vec<PyExpr>) that AND-combines multiple predicates into a single DataFusion filter expression.
  • Add a dedicated pytest module covering positive behavior, laziness, and key error contracts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
python/sedonadb/python/sedonadb/dataframe.py Adds DataFrame.filter() implementation + where alias with Python-boundary validation and docs.
python/sedonadb/src/dataframe.rs Exposes InternalDataFrame.filter() and folds multi-predicate calls via Expr::and before calling DataFusion.
python/sedonadb/tests/expr/test_dataframe_filter.py Introduces unit tests for filter/where behavior, composition, laziness, and error messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +200 to +201

where = filter
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not add an alias for filter here? We can specifically add a pyspark compatibility later at a different time if there is interest.

Suggested change
where = filter

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 92afb9bwhere = filter is gone and test_where_alias_produces_same_output along with it. PR description updated to drop the where mentions.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines +145 to +149
let mut iter = exprs.into_iter().map(|e| e.inner);
let mut combined = iter.next().unwrap();
for next in iter {
combined = combined.and(next);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a helper for this in DataFusion (I think it's called conjunction!(); there may be one for both physical and logical expressions and you'll need the logical one).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched in 92afb9b — now using datafusion_expr::utils::conjunction, matching the pattern in sedona-pointcloud and sedona-query-planner. unwrap/expect is safe because the empty case is already rejected at the Python boundary.

Comment on lines +34 to +37
def _xy_df(con):
return con.create_data_frame(
pd.DataFrame({"x": [1, 2, 3, 4], "y": [10, 20, 30, 40]})
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inline this into each test? Probably pd.DataFrame({"x": [1, 2, 3]}) would be sufficient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 92afb9b_xy_df helper removed; each test builds its own df inline from the smallest pd.DataFrame({...}) sufficient for the assertion. Most tests use {"x": [1, 2, 3]}; tests that need a second column use {"x": ..., "y": ...}.

- Remove the `where = filter` class-level alias. PySpark-compat naming
  can land later if there's demand; not part of this layer's scope.
- Replace the manual left-fold of `Expr::and` in
  `InternalDataFrame::filter` with DataFusion's `conjunction` helper
  from `datafusion_expr::utils`. Same semantics, fewer lines, and
  matches the pattern used elsewhere in the workspace
  (`sedona-pointcloud`, `sedona-query-planner`).
- Drop the `_xy_df` helper in `test_dataframe_filter.py`; each test
  now builds its own `df` via `con.create_data_frame(pd.DataFrame(...))`
  inline, with the smallest data sufficient for the assertion. Same
  "full context in the failing test function" rationale as the select
  PR.
- Drop `test_where_alias_produces_same_output` along with the alias.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants