Skip to content

feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908

Draft
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-join
Draft

feat(python/sedonadb): add DataFrame.join (common-key equi-join)#908
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-join

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented Jun 3, 2026

First of the three join sub-PRs flagged in #791. Predicate-based / spatial joins are the next sub-PR; cross_join is a tiny follow-up after that.

API (this sub-PR)

df.join(other, on="k")                       # inner equi-join
df.join(other, on=["k1", "k2"], how="left")  # multi-key
df.join(other, on="k", how="outer")

Result has a single copy of each join key, matching pandas / Polars / PySpark — not DataFusion's DataFrame default, which keeps both copies.

Worth flagging — the auto-dedup machinery

DataFusion's DataFrame::join has two behaviors that diverge from user expectations:

  1. Rejects two unaliased inputs that share column names with "duplicate qualified field name". Both default to the ?table? qualifier; the merged schema has unresolvable collisions before the join even runs.
  2. Keeps both copies of the join key in the result when inputs are aliased. The SQL USING parser does the dedup at parse time; the DataFrame API doesn't.

To get the pandas shape, the Python wrapper:

  1. Aliases both sides internally with sentinel qualifiers (_sd_join_left_ / _sd_join_right_).
  2. Runs the join.
  3. Projects with fully qualified col(name, alias) refs to dedupe the key columns and strip the sentinel qualifiers from the output.

The unified join-key column:

  • inner / left: sourced from the left side (always populated).
  • right: sourced from the right side (left may be NULL for unmatched rows).
  • outer: COALESCE(left.k, right.k) — picks the populated side for rows unmatched on either input.
  • left_semi / left_anti / right_semi / right_anti: only one side's columns appear; projection is straightforward.

Implementation

File Change
python/sedonadb/src/dataframe.rs New InternalDataFrame::join(right, join_cols, how). Thin wrapper over DataFusion's DataFrame::join. Maps how strings to JoinType. No residual filter — that's sub-PR #2.
python/sedonadb/python/sedonadb/dataframe.py New DataFrame.join(other, on, how). Handles validation, alias-and-project for the pandas shape, COALESCE for outer-join keys.

Test plan

14 tests in tests/expr/test_dataframe_join.py:

  • Positive: single-key inner / left / right / outer; multi-key inner; left_semi; left_anti.
  • Lazy return: isinstance(out, DataFrame).
  • Errors: non-DataFrame other; bad on type; empty on list; non-str element in on; invalid how; unknown column (DataFusion plan-build error).

All output assertions use exact pd.testing.assert_frame_equal after sorting.

Local: 14 unit + 24 doctests + ruff format + ruff check all clean.

Known limitations (for follow-up)

  • Non-key column-name collisions between left and right are not auto-suffixed (_x/_y like pandas). The duplicate names propagate and become ambiguous to reference. Deferred to a later PR per the design discussion.
  • Self-join requires the user to alias one side explicitly. The internal sentinel aliasing handles the unaliased common case but doesn't disambiguate a self-join.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pandas/Polars/PySpark-style DataFrame.join() API to SedonaDB’s Python DataFrame layer, implementing common-key equi-joins with pandas-shaped output (single copy of join keys) by aliasing both inputs and projecting a de-duplicated schema after the DataFusion join.

Changes:

  • Added a Rust InternalDataFrame::join(...) wrapper that maps how strings to DataFusion JoinType and calls DataFrame::join.
  • Added Python DataFrame.join(other, on, how) that normalizes/validates inputs, performs internal aliasing, and projects to dedupe join keys (including COALESCE for outer joins).
  • Added a new Python test module covering core join behaviors and input validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
python/sedonadb/tests/expr/test_dataframe_join.py New test suite for DataFrame.join() behavior and validation.
python/sedonadb/src/dataframe.rs Rust-side join binding for Python InternalDataFrame.
python/sedonadb/python/sedonadb/dataframe.py Public DataFrame.join() implementation with alias-and-project dedup logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines 27 to 31
use datafusion::logical_expr::SortExpr;
use datafusion::prelude::{DataFrame, SessionContext};
use datafusion_common::{Column, DataFusionError, ParamValues};
use datafusion_common::{Column, DataFusionError, JoinType, ParamValues};
use datafusion_execution::TaskContextProvider;
use datafusion_expr::{ExplainFormat, ExplainOption, Expr};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in ef61ea6JoinType is now imported from datafusion_expr to match the dominant convention in the workspace (8 sites already use that path).

# that side's key; COALESCE picks the populated one.
from sedonadb.expr.expression import ScalarUdf

coalesce_udf = ScalarUdf(self._ctx.scalar_udf("coalesce"))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._ctx in this DataFrame class is actually the internal context handle (_lib.InternalContext), not the user-facing SedonaContext Python class — and InternalContext exposes scalar_udf(name) (added in #885). The outer-join test test_join_outer covers this code path and passes. Added a clarifying comment at the call site in ef61ea6 since the naming is genuinely confusing here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth I updated this in #901 (it was confusing there, too)

Comment on lines +745 to +746
elif how in ("right_semi", "right_anti"):
projection = [_col(c, RIGHT_ALIAS)._impl for c in right_cols]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_join_right_semi and test_join_right_anti in ef61ea6, parallel to the existing left-semi/anti cases.

First of the join sub-PRs from apache#791. Predicate-based / spatial joins
land next; cross_join is a separate small follow-up.

API:

    df.join(other, on="k")
    df.join(other, on=["k1", "k2"], how="left")
    df.join(other, on="k", how="outer")

- `on` accepts `str` or `list[str]` — common column names that exist
  on both sides. Predicate `Expr` form is the next sub-PR.
- `how` is a string literal: `inner` (default), `left`, `right`,
  `outer`, `left_semi`, `left_anti`, `right_semi`, `right_anti`.
- Result has a single copy of each join key — matching pandas /
  Polars / PySpark, not DataFusion's DataFrame default (which keeps
  both copies).

Rust side: thin wrapper over DataFusion's `DataFrame::join`. Maps
the `how` string to `JoinType`; passes the column lists through.
No filter / residual predicate in this sub-PR.

Python wrapper does the heavy lifting to get the pandas-shaped
output. DataFusion's DataFrame join (a) rejects two unaliased
inputs that share column names because both default to the `?table?`
qualifier, and (b) keeps both copies of the join key in the result.
To match user expectations:

  1. Alias both sides internally with sentinel qualifiers so the
     merged schema has no qualified-name collisions.
  2. Run the join.
  3. Project with fully qualified refs: left's full column list plus
     right's non-key columns. The unified join key comes from the
     left for inner/left, the right for right joins, and via
     COALESCE(left.k, right.k) for outer joins. The qualified col()
     projection strips qualifiers from the output names so users
     see the unqualified pandas shape.
  4. Semi/anti joins skip the projection logic — DataFusion already
     drops the right (or left) columns, so we just take the
     surviving side.

Tests: 14 covering single/multi-key inner, left, right, outer,
left_semi, left_anti, lazy return, and the type/empty/bad-how
error paths.

Limitations to follow up later: non-key column-name collisions
between left and right are not auto-suffixed (`_x`/`_y` like
pandas); the duplicate names propagate and become ambiguous to
reference. Documented as a deferred-suffix limitation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants