feat(python/sedonadb): add DataFrame __getitem__#846
Conversation
Pandas-style bracket indexing dispatches to the existing select / filter / col paths shipped in earlier PRs: - `df["x"]` returns an `Expr` referencing column `x` (a Series wrapper will land in a later phase). - `df[["x", "y"]]` returns a new `DataFrame` projecting the listed columns. - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean expression. Row-position indexing (ints, slices, `.loc`, `.iloc`) is intentionally not supported and raises `TypeError` with a message pointing at the three supported forms. Pure Python sugar — no new Rust code. Composes cleanly with the operator overloads from apache#823, so the natural pandas idiom `df[(df.x + df.y) > 30]` works as-is.
There was a problem hiding this comment.
Pull request overview
Adds pandas-style bracket indexing (DataFrame.__getitem__) to SedonaDB’s Python DataFrame, providing ergonomic sugar that routes to the already-landed expression (Expr), projection (select), and filtering (filter) APIs.
Changes:
- Implement
DataFrame.__getitem__supportingdf["col"] -> Expr,df[["c1","c2"]] -> select, anddf[expr] -> filter, with non-supported row-based indexing raisingTypeError. - Add a focused test module covering supported forms, composition/chaining, and key error cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| python/sedonadb/python/sedonadb/dataframe.py | Adds DataFrame.__getitem__ dispatching to col(), select(), and filter() with clear TypeErrors for unsupported indexers. |
| python/sedonadb/tests/expr/test_dataframe_getitem.py | Adds unit tests validating the new bracket-indexing behavior and error paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
paleolimbot
left a comment
There was a problem hiding this comment.
@zhangfengcdt Please let the original committer do the merging (if a PR is not from a committer, give them a reasonable window to update if they want to do another round of self-review).
@jiayuasu Can we turn off the round robin reviews? We have a huge variety in our code base and should split up review based on familiarity, not randomly.
| from sedonadb.expr import Expr | ||
| from sedonadb.expr import col as _col |
There was a problem hiding this comment.
These should be module-level imports (and they should be combined). Lazy imports in this module are only for optional dependencies (pyarrow is the usual culprit)
| if isinstance(key, str): | ||
| return _col(key) | ||
| if isinstance(key, Expr): | ||
| return self.filter(key) | ||
| if isinstance(key, list): | ||
| for k in key: | ||
| if not isinstance(k, str): | ||
| raise TypeError( | ||
| f"DataFrame[list] expects a list of column names, " | ||
| f"got {type(k).__name__}" | ||
| ) | ||
| return self.select(*key) | ||
| raise TypeError( | ||
| f"DataFrame indexing is not supported for {type(key).__name__}. " | ||
| f"Use df['x'] for a column expression, df[['x', 'y']] to project " | ||
| f"columns, or df[bool_expr] to filter rows." | ||
| ) |
There was a problem hiding this comment.
I think we should restrict this to only the first case. The ability to put a filter or multi-column selection here is pandas-y but prevents type hinting from working well on the output (i.e., IDEs and LLMs that use the language server or can use type hints can't auto complete tab["col"].<tab>).
This should also support integer indexing (e.g., tab[0] gives you the first column).
FWIW Ibis used to accept the second two cases but deprecated them (FutureWarning: Selecting/filtering arbitrary expressions in Table.getitemis deprecated and will be removed in version 10.0. Please useTable.selectorTable.filter instead.).
| """ | ||
| return self.limit(n) | ||
|
|
||
| def __getitem__(self, key): |
There was a problem hiding this comment.
| def __getitem__(self, key): | |
| def __getitem__(self, key: Union[str, int]) -> Expr: |
Eventually we can give Expr a subclass for columns to improve the type hinting
| if isinstance(key, str): | ||
| return _col(key) |
There was a problem hiding this comment.
This should reach into the underlying DFSchema and pull out the actual qualified column expression (displaying a nice error for columns that don't exist). You'll need this for join expressions.
|
@paleolimbot I’ll create a follow-up PR for this. I’d still like to keep the round-robin review assignment so our engineers can build familiarity with different components of the project. That said, this does not mean I’ll merge a PR solely based on their approval. |
Address post-merge feedback on apache#846. - Restrict `DataFrame.__getitem__` to `(key: Union[str, int]) -> Expr`. Drops the polymorphic list-projection and bool-Expr-filter forms that apache#846 added. Reasons mirror Ibis's deprecation of the same forms: a single return type keeps IDE/type-checker inference on `df["x"].<method>` clean, and the multi-column / filter cases have obvious named entry points (`select`, `filter`). - Integer indexing now resolves a column by 0-based position, with negative indexing supported. Out-of-range integers raise `IndexError`; unknown string names raise `KeyError` listing the available columns. `bool` is rejected explicitly so a stray `df[True]` doesn't silently mean `df[1]` (bool is a subclass of int in Python). List, slice, and Expr keys raise `TypeError` with messages pointing at `select` / `filter`. - Move the lazy `Expr` / `Literal` / `col` / `_to_expr` imports inside `__getitem__`, `select`, and `filter` to module level (combined). Per project policy in this module, lazy imports are reserved for optional dependencies like pyarrow. The fix while here keeps the import discipline consistent across the three Expr-aware methods rather than churning them separately later. - Update annotations on `select` and `filter` to use the new module-level names directly (no more string forward references). Tests rewritten for the new shape: 11 cases covering name and position lookup, negative indexing, operator composition, and the `KeyError` / `IndexError` / `TypeError` error paths. cc @paleolimbot
Pandas-style bracket indexing on
DataFrame, dispatching to the existingselect/filter/colpaths landed in earlier PRs (#807, #823, #832, #835).This is a small follow-on PR continuing Phase P1/P2 of #791. Pure Python sugar — no new Rust code.
What's new
Row-position indexing (ints, slices,
.loc,.iloc) is intentionally not supported and raisesTypeErrorwith a message pointing at the three supported forms. SedonaDB has no row-ordering or index concept in this scope; the non-goal is documented in #791.Note on
df["x"]return typeFor now,
df["x"]returns anExpr(a column reference). ASerieswrapper is planned for the next phase — when that lands,df["x"]will start returningSeriesinstead andExprwill fall back to an internal escape hatch. The compose-with-operators idiom continues to work either way becauseSerieswill share the same operator surface asExpr.Test plan
10 tests in
tests/expr/test_dataframe_getitem.py:repr()); list projection (two-col, single-col, reorder); bool-Expr filter; arithmetic compose; filter-then-project chain.TypeError; int →TypeError; slice →TypeError.All 10 pass locally. No regressions in existing tests; doctests +
ruff format+ruff checkall clean.