feat(python/sedonadb): add DataFrame.join (common-key equi-join) by jiayuasu · Pull Request #908 · apache/sedona-db

jiayuasu · 2026-06-03T04:03:23Z

First of the three join sub-PRs flagged in #791. Predicate-based / spatial joins are the next sub-PR; cross_join is a tiny follow-up after that.

API (this sub-PR)

df.join(other, on="k")                       # inner equi-join
df.join(other, on=["k1", "k2"], how="left")  # multi-key
df.join(other, on="k", how="outer")

on accepts str or list[str] — common column names that exist on both sides. Predicate Expr form lands in sub-PR Use new sedona_internal_err to avoid misleading datafusion internal err #2.
how is a string literal: inner (default), left, right, outer, left_semi, left_anti, right_semi, right_anti. Cross join gets its own method.

Result has a single copy of each join key, matching pandas / Polars / PySpark — not DataFusion's DataFrame default, which keeps both copies.

Worth flagging — the auto-dedup machinery

DataFusion's DataFrame::join has two behaviors that diverge from user expectations:

Rejects two unaliased inputs that share column names with "duplicate qualified field name". Both default to the ?table? qualifier; the merged schema has unresolvable collisions before the join even runs.
Keeps both copies of the join key in the result when inputs are aliased. The SQL USING parser does the dedup at parse time; the DataFrame API doesn't.

To get the pandas shape, the Python wrapper:

Aliases both sides internally with sentinel qualifiers (_sd_join_left_ / _sd_join_right_).
Runs the join.
Projects with fully qualified col(name, alias) refs to dedupe the key columns and strip the sentinel qualifiers from the output.

The unified join-key column:

inner / left: sourced from the left side (always populated).
right: sourced from the right side (left may be NULL for unmatched rows).
outer: COALESCE(left.k, right.k) — picks the populated side for rows unmatched on either input.
left_semi / left_anti / right_semi / right_anti: only one side's columns appear; projection is straightforward.

Implementation

File	Change
`python/sedonadb/src/dataframe.rs`	New `InternalDataFrame::join(right, join_cols, how)`. Thin wrapper over DataFusion's `DataFrame::join`. Maps `how` strings to `JoinType`. No residual filter — that's sub-PR #2.
`python/sedonadb/python/sedonadb/dataframe.py`	New `DataFrame.join(other, on, how)`. Handles validation, alias-and-project for the pandas shape, COALESCE for outer-join keys.

Test plan

14 tests in tests/expr/test_dataframe_join.py:

Positive: single-key inner / left / right / outer; multi-key inner; left_semi; left_anti.
Lazy return: isinstance(out, DataFrame).
Errors: non-DataFrame other; bad on type; empty on list; non-str element in on; invalid how; unknown column (DataFusion plan-build error).

All output assertions use exact pd.testing.assert_frame_equal after sorting.

Local: 14 unit + 24 doctests + ruff format + ruff check all clean.

Known limitations (for follow-up)

Non-key column-name collisions between left and right are not auto-suffixed (_x/_y like pandas). The duplicate names propagate and become ambiguous to reference. Deferred to a later PR per the design discussion.
Self-join requires the user to alias one side explicitly. The internal sentinel aliasing handles the unaliased common case but doesn't disambiguate a self-join.

Copilot

Pull request overview

Adds a pandas/Polars/PySpark-style DataFrame.join() API to SedonaDB’s Python DataFrame layer, implementing common-key equi-joins with pandas-shaped output (single copy of join keys) by aliasing both inputs and projecting a de-duplicated schema after the DataFusion join.

Changes:

Added a Rust InternalDataFrame::join(...) wrapper that maps how strings to DataFusion JoinType and calls DataFrame::join.
Added Python DataFrame.join(other, on, how) that normalizes/validates inputs, performs internal aliasing, and projects to dedupe join keys (including COALESCE for outer joins).
Added a new Python test module covering core join behaviors and input validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
python/sedonadb/tests/expr/test_dataframe_join.py	New test suite for `DataFrame.join()` behavior and validation.
python/sedonadb/src/dataframe.rs	Rust-side join binding for Python `InternalDataFrame`.
python/sedonadb/python/sedonadb/dataframe.py	Public `DataFrame.join()` implementation with alias-and-project dedup logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-06-03T04:59:01Z

 use datafusion::logical_expr::SortExpr;
 use datafusion::prelude::{DataFrame, SessionContext};
-use datafusion_common::{Column, DataFusionError, ParamValues};
+use datafusion_common::{Column, DataFusionError, JoinType, ParamValues};
 use datafusion_execution::TaskContextProvider;
 use datafusion_expr::{ExplainFormat, ExplainOption, Expr};


Moved in ef61ea6 — JoinType is now imported from datafusion_expr to match the dominant convention in the workspace (8 sites already use that path).

jiayuasu · 2026-06-03T04:59:02Z

+                    # that side's key; COALESCE picks the populated one.
+                    from sedonadb.expr.expression import ScalarUdf
+
+                    coalesce_udf = ScalarUdf(self._ctx.scalar_udf("coalesce"))


self._ctx in this DataFrame class is actually the internal context handle (_lib.InternalContext), not the user-facing SedonaContext Python class — and InternalContext exposes scalar_udf(name) (added in #885). The outer-join test test_join_outer covers this code path and passes. Added a clarifying comment at the call site in ef61ea6 since the naming is genuinely confusing here.

For what it's worth I updated this in #901 (it was confusing there, too)

jiayuasu · 2026-06-03T04:59:03Z

+        elif how in ("right_semi", "right_anti"):
+            projection = [_col(c, RIGHT_ALIAS)._impl for c in right_cols]


Added test_join_right_semi and test_join_right_anti in ef61ea6, parallel to the existing left-semi/anti cases.

First of the join sub-PRs from apache#791. Predicate-based / spatial joins land next; cross_join is a separate small follow-up. API: df.join(other, on="k") df.join(other, on=["k1", "k2"], how="left") df.join(other, on="k", how="outer") - `on` accepts `str` or `list[str]` — common column names that exist on both sides. Predicate `Expr` form is the next sub-PR. - `how` is a string literal: `inner` (default), `left`, `right`, `outer`, `left_semi`, `left_anti`, `right_semi`, `right_anti`. - Result has a single copy of each join key — matching pandas / Polars / PySpark, not DataFusion's DataFrame default (which keeps both copies). Rust side: thin wrapper over DataFusion's `DataFrame::join`. Maps the `how` string to `JoinType`; passes the column lists through. No filter / residual predicate in this sub-PR. Python wrapper does the heavy lifting to get the pandas-shaped output. DataFusion's DataFrame join (a) rejects two unaliased inputs that share column names because both default to the `?table?` qualifier, and (b) keeps both copies of the join key in the result. To match user expectations: 1. Alias both sides internally with sentinel qualifiers so the merged schema has no qualified-name collisions. 2. Run the join. 3. Project with fully qualified refs: left's full column list plus right's non-key columns. The unified join key comes from the left for inner/left, the right for right joins, and via COALESCE(left.k, right.k) for outer joins. The qualified col() projection strips qualifiers from the output names so users see the unqualified pandas shape. 4. Semi/anti joins skip the projection logic — DataFusion already drops the right (or left) columns, so we just take the surviving side. Tests: 14 covering single/multi-key inner, left, right, outer, left_semi, left_anti, lazy return, and the type/empty/bad-how error paths. Limitations to follow up later: non-key column-name collisions between left and right are not auto-suffixed (`_x`/`_y` like pandas); the duplicate names propagate and become ambiguous to reference. Documented as a deferred-suffix limitation.

github-actions Bot requested a review from paleolimbot June 3, 2026 04:03

jiayuasu force-pushed the feature/df-join branch from 23d1256 to 1b6df25 Compare June 3, 2026 04:12

jiayuasu requested a review from Copilot June 3, 2026 04:37

Copilot started reviewing on behalf of jiayuasu June 3, 2026 04:37 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-join branch from 1b6df25 to ef61ea6 Compare June 3, 2026 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908

feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-join

jiayuasu commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jiayuasu Jun 3, 2026

Uh oh!

jiayuasu Jun 3, 2026

Uh oh!

paleolimbot Jun 3, 2026

Uh oh!

jiayuasu Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		elif how in ("right_semi", "right_anti"):
		projection = [_col(c, RIGHT_ALIAS)._impl for c in right_cols]

Conversation

jiayuasu commented Jun 3, 2026

API (this sub-PR)

Worth flagging — the auto-dedup machinery

Implementation

Test plan

Known limitations (for follow-up)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jiayuasu Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants