feat(python/sedonadb): add DataFrame __getitem__ by jiayuasu · Pull Request #846 · apache/sedona-db

jiayuasu · 2026-05-15T06:45:43Z

Pandas-style bracket indexing on DataFrame, dispatching to the existing select / filter / col paths landed in earlier PRs (#807, #823, #832, #835).

This is a small follow-on PR continuing Phase P1/P2 of #791. Pure Python sugar — no new Rust code.

What's new

df["x"]                       # → Expr(col("x"))
df[["x", "y"]]                # → DataFrame.select("x", "y")
df[df["x"] > 0]               # → DataFrame.filter(col("x") > 0)
df[(df["x"] + df["y"]) > 30]  # composes cleanly with operator overloads
df[df["x"] > 1][["y"]]        # chain: filter → project

Row-position indexing (ints, slices, .loc, .iloc) is intentionally not supported and raises TypeError with a message pointing at the three supported forms. SedonaDB has no row-ordering or index concept in this scope; the non-goal is documented in #791.

Note on `df["x"]` return type

For now, df["x"] returns an Expr (a column reference). A Series wrapper is planned for the next phase — when that lands, df["x"] will start returning Series instead and Expr will fall back to an internal escape hatch. The compose-with-operators idiom continues to work either way because Series will share the same operator surface as Expr.

Test plan

10 tests in tests/expr/test_dataframe_getitem.py:

Positive: string lookup → Expr (exact repr()); list projection (two-col, single-col, reorder); bool-Expr filter; arithmetic compose; filter-then-project chain.
Errors: list with non-string element → TypeError; int → TypeError; slice → TypeError.

All 10 pass locally. No regressions in existing tests; doctests + ruff format + ruff check all clean.

Pandas-style bracket indexing dispatches to the existing select / filter / col paths shipped in earlier PRs: - `df["x"]` returns an `Expr` referencing column `x` (a Series wrapper will land in a later phase). - `df[["x", "y"]]` returns a new `DataFrame` projecting the listed columns. - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean expression. Row-position indexing (ints, slices, `.loc`, `.iloc`) is intentionally not supported and raises `TypeError` with a message pointing at the three supported forms. Pure Python sugar — no new Rust code. Composes cleanly with the operator overloads from apache#823, so the natural pandas idiom `df[(df.x + df.y) > 30]` works as-is.

Copilot

Pull request overview

Adds pandas-style bracket indexing (DataFrame.__getitem__) to SedonaDB’s Python DataFrame, providing ergonomic sugar that routes to the already-landed expression (Expr), projection (select), and filtering (filter) APIs.

Changes:

Implement DataFrame.__getitem__ supporting df["col"] -> Expr, df[["c1","c2"]] -> select, and df[expr] -> filter, with non-supported row-based indexing raising TypeError.
Add a focused test module covering supported forms, composition/chaining, and key error cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
python/sedonadb/python/sedonadb/dataframe.py	Adds `DataFrame.__getitem__` dispatching to `col()`, `select()`, and `filter()` with clear TypeErrors for unsupported indexers.
python/sedonadb/tests/expr/test_dataframe_getitem.py	Adds unit tests validating the new bracket-indexing behavior and error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zhangfengcdt

LGTM

paleolimbot

@zhangfengcdt Please let the original committer do the merging (if a PR is not from a committer, give them a reasonable window to update if they want to do another round of self-review).

@jiayuasu Can we turn off the round robin reviews? We have a huge variety in our code base and should split up review based on familiarity, not randomly.

paleolimbot · 2026-05-15T16:34:25Z

+        from sedonadb.expr import Expr
+        from sedonadb.expr import col as _col


These should be module-level imports (and they should be combined). Lazy imports in this module are only for optional dependencies (pyarrow is the usual culprit)

paleolimbot · 2026-05-15T16:37:47Z

+        if isinstance(key, str):
+            return _col(key)
+        if isinstance(key, Expr):
+            return self.filter(key)
+        if isinstance(key, list):
+            for k in key:
+                if not isinstance(k, str):
+                    raise TypeError(
+                        f"DataFrame[list] expects a list of column names, "
+                        f"got {type(k).__name__}"
+                    )
+            return self.select(*key)
+        raise TypeError(
+            f"DataFrame indexing is not supported for {type(key).__name__}. "
+            f"Use df['x'] for a column expression, df[['x', 'y']] to project "
+            f"columns, or df[bool_expr] to filter rows."
+        )


I think we should restrict this to only the first case. The ability to put a filter or multi-column selection here is pandas-y but prevents type hinting from working well on the output (i.e., IDEs and LLMs that use the language server or can use type hints can't auto complete tab["col"].<tab>).

This should also support integer indexing (e.g., tab[0] gives you the first column).

FWIW Ibis used to accept the second two cases but deprecated them (FutureWarning: Selecting/filtering arbitrary expressions in Table.getitemis deprecated and will be removed in version 10.0. Please useTable.selectorTable.filter instead.).

paleolimbot · 2026-05-15T16:42:42Z

        """
        return self.limit(n)

+    def __getitem__(self, key):


Suggested change

def __getitem__(self, key):

def __getitem__(self, key: Union[str, int]) -> Expr:

Eventually we can give Expr a subclass for columns to improve the type hinting

paleolimbot · 2026-05-15T16:45:49Z

+        if isinstance(key, str):
+            return _col(key)


This should reach into the underlying DFSchema and pull out the actual qualified column expression (displaying a nice error for columns that don't exist). You'll need this for join expressions.

jiayuasu · 2026-05-17T06:55:11Z

@paleolimbot I’ll create a follow-up PR for this. I’d still like to keep the round-robin review assignment so our engineers can build familiarity with different components of the project. That said, this does not mean I’ll merge a PR solely based on their approval.

@paleolimbot

Address post-merge feedback on apache#846. - Restrict `DataFrame.__getitem__` to `(key: Union[str, int]) -> Expr`. Drops the polymorphic list-projection and bool-Expr-filter forms that apache#846 added. Reasons mirror Ibis's deprecation of the same forms: a single return type keeps IDE/type-checker inference on `df["x"].<method>` clean, and the multi-column / filter cases have obvious named entry points (`select`, `filter`). - Integer indexing now resolves a column by 0-based position, with negative indexing supported. Out-of-range integers raise `IndexError`; unknown string names raise `KeyError` listing the available columns. `bool` is rejected explicitly so a stray `df[True]` doesn't silently mean `df[1]` (bool is a subclass of int in Python). List, slice, and Expr keys raise `TypeError` with messages pointing at `select` / `filter`. - Move the lazy `Expr` / `Literal` / `col` / `_to_expr` imports inside `__getitem__`, `select`, and `filter` to module level (combined). Per project policy in this module, lazy imports are reserved for optional dependencies like pyarrow. The fix while here keeps the import discipline consistent across the three Expr-aware methods rather than churning them separately later. - Update annotations on `select` and `filter` to use the new module-level names directly (no more string forward references). Tests rewritten for the new shape: 11 cases covering name and position lookup, negative indexing, operator composition, and the `KeyError` / `IndexError` / `TypeError` error paths. cc @paleolimbot

github-actions Bot requested a review from zhangfengcdt May 15, 2026 06:46

jiayuasu requested a review from Copilot May 15, 2026 07:04

Copilot started reviewing on behalf of jiayuasu May 15, 2026 07:05 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

jiayuasu marked this pull request as ready for review May 15, 2026 07:45

zhangfengcdt approved these changes May 15, 2026

View reviewed changes

zhangfengcdt merged commit 5650acb into apache:main May 15, 2026
6 checks passed

paleolimbot reviewed May 15, 2026

View reviewed changes

jiayuasu mentioned this pull request May 17, 2026

feat(python/sedonadb): restrict DataFrame __getitem__ to single-column lookup #852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add DataFrame getitem#846

feat(python/sedonadb): add DataFrame getitem#846
zhangfengcdt merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-getitem

jiayuasu commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

zhangfengcdt left a comment

Uh oh!

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot May 15, 2026

Uh oh!

paleolimbot May 15, 2026

Uh oh!

paleolimbot May 15, 2026

Uh oh!

paleolimbot May 15, 2026

Uh oh!

jiayuasu commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from sedonadb.expr import Expr
		from sedonadb.expr import col as _col

	def __getitem__(self, key):
	def __getitem__(self, key: Union[str, int]) -> Expr:

Conversation

jiayuasu commented May 15, 2026

What's new

Note on df["x"] return type

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

zhangfengcdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Note on `df["x"]` return type