feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908
feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908jiayuasu wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a pandas/Polars/PySpark-style DataFrame.join() API to SedonaDB’s Python DataFrame layer, implementing common-key equi-joins with pandas-shaped output (single copy of join keys) by aliasing both inputs and projecting a de-duplicated schema after the DataFusion join.
Changes:
- Added a Rust
InternalDataFrame::join(...)wrapper that mapshowstrings to DataFusionJoinTypeand callsDataFrame::join. - Added Python
DataFrame.join(other, on, how)that normalizes/validates inputs, performs internal aliasing, and projects to dedupe join keys (includingCOALESCEfor outer joins). - Added a new Python test module covering core join behaviors and input validation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| python/sedonadb/tests/expr/test_dataframe_join.py | New test suite for DataFrame.join() behavior and validation. |
| python/sedonadb/src/dataframe.rs | Rust-side join binding for Python InternalDataFrame. |
| python/sedonadb/python/sedonadb/dataframe.py | Public DataFrame.join() implementation with alias-and-project dedup logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| use datafusion::logical_expr::SortExpr; | ||
| use datafusion::prelude::{DataFrame, SessionContext}; | ||
| use datafusion_common::{Column, DataFusionError, ParamValues}; | ||
| use datafusion_common::{Column, DataFusionError, JoinType, ParamValues}; | ||
| use datafusion_execution::TaskContextProvider; | ||
| use datafusion_expr::{ExplainFormat, ExplainOption, Expr}; |
There was a problem hiding this comment.
Moved in ef61ea6 — JoinType is now imported from datafusion_expr to match the dominant convention in the workspace (8 sites already use that path).
| # that side's key; COALESCE picks the populated one. | ||
| from sedonadb.expr.expression import ScalarUdf | ||
|
|
||
| coalesce_udf = ScalarUdf(self._ctx.scalar_udf("coalesce")) |
There was a problem hiding this comment.
self._ctx in this DataFrame class is actually the internal context handle (_lib.InternalContext), not the user-facing SedonaContext Python class — and InternalContext exposes scalar_udf(name) (added in #885). The outer-join test test_join_outer covers this code path and passes. Added a clarifying comment at the call site in ef61ea6 since the naming is genuinely confusing here.
There was a problem hiding this comment.
For what it's worth I updated this in #901 (it was confusing there, too)
| elif how in ("right_semi", "right_anti"): | ||
| projection = [_col(c, RIGHT_ALIAS)._impl for c in right_cols] |
There was a problem hiding this comment.
Added test_join_right_semi and test_join_right_anti in ef61ea6, parallel to the existing left-semi/anti cases.
First of the join sub-PRs from apache#791. Predicate-based / spatial joins land next; cross_join is a separate small follow-up. API: df.join(other, on="k") df.join(other, on=["k1", "k2"], how="left") df.join(other, on="k", how="outer") - `on` accepts `str` or `list[str]` — common column names that exist on both sides. Predicate `Expr` form is the next sub-PR. - `how` is a string literal: `inner` (default), `left`, `right`, `outer`, `left_semi`, `left_anti`, `right_semi`, `right_anti`. - Result has a single copy of each join key — matching pandas / Polars / PySpark, not DataFusion's DataFrame default (which keeps both copies). Rust side: thin wrapper over DataFusion's `DataFrame::join`. Maps the `how` string to `JoinType`; passes the column lists through. No filter / residual predicate in this sub-PR. Python wrapper does the heavy lifting to get the pandas-shaped output. DataFusion's DataFrame join (a) rejects two unaliased inputs that share column names because both default to the `?table?` qualifier, and (b) keeps both copies of the join key in the result. To match user expectations: 1. Alias both sides internally with sentinel qualifiers so the merged schema has no qualified-name collisions. 2. Run the join. 3. Project with fully qualified refs: left's full column list plus right's non-key columns. The unified join key comes from the left for inner/left, the right for right joins, and via COALESCE(left.k, right.k) for outer joins. The qualified col() projection strips qualifiers from the output names so users see the unqualified pandas shape. 4. Semi/anti joins skip the projection logic — DataFusion already drops the right (or left) columns, so we just take the surviving side. Tests: 14 covering single/multi-key inner, left, right, outer, left_semi, left_anti, lazy return, and the type/empty/bad-how error paths. Limitations to follow up later: non-key column-name collisions between left and right are not auto-suffixed (`_x`/`_y` like pandas); the duplicate names propagate and become ambiguous to reference. Documented as a deferred-suffix limitation.
First of the three join sub-PRs flagged in #791. Predicate-based / spatial joins are the next sub-PR;
cross_joinis a tiny follow-up after that.API (this sub-PR)
onacceptsstrorlist[str]— common column names that exist on both sides. PredicateExprform lands in sub-PR Use new sedona_internal_err to avoid misleading datafusion internal err #2.howis a string literal:inner(default),left,right,outer,left_semi,left_anti,right_semi,right_anti. Cross join gets its own method.Result has a single copy of each join key, matching pandas / Polars / PySpark — not DataFusion's DataFrame default, which keeps both copies.
Worth flagging — the auto-dedup machinery
DataFusion's
DataFrame::joinhas two behaviors that diverge from user expectations:?table?qualifier; the merged schema has unresolvable collisions before the join even runs.USINGparser does the dedup at parse time; the DataFrame API doesn't.To get the pandas shape, the Python wrapper:
_sd_join_left_/_sd_join_right_).col(name, alias)refs to dedupe the key columns and strip the sentinel qualifiers from the output.The unified join-key column:
COALESCE(left.k, right.k)— picks the populated side for rows unmatched on either input.Implementation
python/sedonadb/src/dataframe.rsInternalDataFrame::join(right, join_cols, how). Thin wrapper over DataFusion'sDataFrame::join. Mapshowstrings toJoinType. No residual filter — that's sub-PR #2.python/sedonadb/python/sedonadb/dataframe.pyDataFrame.join(other, on, how). Handles validation, alias-and-project for the pandas shape, COALESCE for outer-join keys.Test plan
14 tests in
tests/expr/test_dataframe_join.py:isinstance(out, DataFrame).other; badontype; emptyonlist; non-str element inon; invalidhow; unknown column (DataFusion plan-build error).All output assertions use exact
pd.testing.assert_frame_equalafter sorting.Local: 14 unit + 24 doctests +
ruff format+ruff checkall clean.Known limitations (for follow-up)
_x/_ylike pandas). The duplicate names propagate and become ambiguous to reference. Deferred to a later PR per the design discussion.