Skip to content

Commit 8ba06e4

Browse files
timsaucerclaude
andauthored
Update datafusion dependency to latest in preparation for DF54 (#1532)
* feat: upgrade upstream DataFusion 53 → main (pre-54) Bump workspace deps to apache/datafusion@3d06bedc (git pin) in preparation for the 54.0.0 release. Workspace package version moves to 54.0.0 to track the upstream major convention. Compile fixes: - Drop as_any impls (trait now has Any as supertrait) and use the upstream-provided downcast_ref helper on dyn trait objects. - Reconcile FFI provider From conversions to drop redundant `+ Send` on Arc<dyn ...> bounds. - Cast/TryCast: data_type → field.data_type() (FieldRef rename). - Stub match arms for new Expr::HigherOrderFunction / Lambda / LambdaVariable and ScalarValue::ListView / LargeListView variants; proper exposure deferred to PR 3 audit. - DatasetExec: partition_statistics returns Arc<Statistics>; add required apply_expressions trait method. - Suppress TableFunctionImpl::call deprecation pending call_with_args refactor that needs Session plumbing. User-facing test updates for upstream behavior changes: - median / approx_median / approx_percentile_cont now return Float64. - String functions (concat_ws, lower, upper, repeat, reverse, split_part, translate) return StringView when given StringView. - overlay appends past end-of-string rather than replacing the input. - arrays_zip / list_zip struct field names "c0"/"c1" → "1"/"2". - Filter on mismatched cast types now errors (was 0 matches). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: expose DataFrame.alias and tidy public API after DF53→54 audit Companion to the upstream DataFusion 53 → main bump. The check-upstream audit (PR 3 of dev/release/upstream-sync.md) surfaced a small set of trivial wins; this commit ships them. Trivial wins: - DataFrame.alias(name) — wraps the logical plan in a SubqueryAlias. - functions.__all__: add `instr` and `position` (both were defined as public defs but missing from `__all__`, so they didn't show up in `from datafusion.functions import *` or generated docs). - top-level `datafusion.__all__`: re-export `TableProviderFactory` and `TableProviderFactoryExportable` (previously only reachable via the `datafusion.catalog` submodule). Non-trivial gaps surfaced by the audit (DataFrame.registry, into_*/task_ctx, SessionContext extensibility surface, distinct-aware aggregate variants, TableFunctionImpl::call_with_args migration, FFI Protocol pipeline gaps) are deferred — each warrants its own design and PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * taplo fmt * Update unit test to go along with apache/datafusion#22133 * docs: demonstrate alias via self-join in DataFrame.alias example Prior example called alias("t") then to_pydict(), which did not show the qualifier effect. Replace with a self-join that uses col("l.val") and col("r.val") so the disambiguation behavior is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: wrap higher-order, lambda, and lambda-variable Expr variants DataFusion 54 introduces Expr::HigherOrderFunction, Expr::Lambda, and Expr::LambdaVariable. PyExpr::to_variant previously errored on each with py_unsupported_variant_err. Add PyHigherOrderFunction, PyLambda, and PyLambdaVariable wrappers, register them in the expr pymodule and re-export from python/datafusion/expr.py, and dispatch to_variant to the new wrappers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: wire rex_type and rex_call_operands for new Expr variants Map HigherOrderFunction and Lambda to RexType::Call; LambdaVariable to RexType::Reference. In rex_call_operands return the args for HigherOrderFunction, the body for Lambda, and self for LambdaVariable (mirroring Column). In rex_call_operator return the underlying UDF name for HigherOrderFunction and the literal "lambda" for Lambda. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: support LargeList/ListView/LargeListView in map_from_scalar_to_arrow These ScalarValue variants all wrap Arc<...Array>, exposing the outer DataType via Array::data_type(), so we can mirror the existing ScalarValue::List arm instead of returning PyNotImplementedError. This makes Expr.types() work for plans that round-trip through SQL or proto where these scalar variants surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: switch PyTableFunction to non-deprecated call_with_args DataFusion 53.0.0 deprecated TableFunctionImpl::call in favor of call_with_args(args: TableFunctionArgs), which threads a Session reference alongside the exprs. Implement call_with_args on PyTableFunction (delegating to the FFI variant's call_with_args, or ignoring the session for the pure-Python variant which doesn't use it) and have __call__ build a TableFunctionArgs from the global session. Drops both #[allow(deprecated)] attributes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build: revert workspace version to 53.0.0 and move DF overrides to [patch.crates-io] The workspace version was prematurely bumped to 54.0.0 in the DF53→pre-54 upgrade. Restore it to 53.0.0 until we are actually ready to cut the 54 release. The same change had moved every datafusion-* dependency from a crates.io version constraint to a direct git dep in [workspace.dependencies]. Switch them back to "version = \"53\"" and move the git rev overrides into [patch.crates-io] so the published manifest will be patch-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * taplo format * test: sort FFI test results by partition key before equality compare Multi-partition `collect()` returns batches in execution-scheduling order, which is non-deterministic and differs between local and CI runners. Sort by the first value of column 0 (unique per partition in each affected test) so the expected/actual comparison is stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump datafusion main commit * test: cover new DF54 expr wrappers, catalog factories, and DataFrame.alias Add module-metadata checks for HigherOrderFunction, Lambda, LambdaVariable and the top-level TableProviderFactory / TableProviderFactoryExportable re-exports, plus a self-join regression test exercising the new DataFrame.alias() qualifier-based selection path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent baef8f0 commit 8ba06e4

35 files changed

Lines changed: 865 additions & 622 deletions

.ai/skills/check-upstream/SKILL.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,29 @@ You are auditing the datafusion-python project to find features from the upstrea
2929

3030
**IMPORTANT: The Python API is the source of truth for coverage.** A function or method is considered "exposed" if it exists in the Python API (e.g., `python/datafusion/functions.py`), even if there is no corresponding entry in the Rust bindings. Many upstream functions are aliases of other functions — the Python layer can expose these aliases by calling a different underlying Rust binding. Do NOT report a function as missing if it appears in the Python `__all__` list and has a working implementation, regardless of whether a matching `#[pyfunction]` exists in Rust.
3131

32+
**IMPORTANT: audit the total upstream surface, not the delta since the last pin.** Gaps accumulate across syncs. A patch-release bump with a "bug fixes only" changelog does not mean there is nothing to find — pre-existing gaps from earlier majors still need to be surfaced. Always run the full comparison.
33+
34+
## Compile-Signal Triggers
35+
36+
If a recent upstream bump required *any* of the following while fixing
37+
compile errors in `crates/core/` or the FFI example, treat that as a
38+
**hard signal** that user-facing surface area grew and run this skill
39+
before considering the bump done. Each pattern corresponds to a class of
40+
gap that frequently shows up in the audit:
41+
42+
| Signal during PR 1 compile fix | Likely gap to check |
43+
|---|---|
44+
| New `Expr::*` variant added to a non-exhaustive `match` (`HigherOrderFunction`, `Lambda`, `LambdaVariable`, …) | New lambda / higher-order scalar functions (`any_match`, `array_transform`, `list_transform`, …) |
45+
| New `ScalarValue::*` variant (`ListView`, `LargeListView`, …) | New scalar / array functions that consume or produce the type |
46+
| New required trait method on `ExecutionPlan` / `TableProvider` / `*UDFImpl` (`apply_expressions`, …) | Corresponding capability on the Python wrapper class |
47+
| Renamed or restructured struct field (e.g. `Cast.data_type``Cast.field: FieldRef`) | Any Python accessor / SKILL.md doc that read the old field |
48+
| Newly deprecated trait method with a `_with_args` / `_with_options` replacement | The `*_with_options` variant frequently warrants a separate Python entry point |
49+
50+
PR 1 of `dev/release/upstream-sync.md` asks you to log these signals as
51+
they appear. When you run this skill, use that log as a checklist: every
52+
entry must either show up in the audit output or be explicitly skipped
53+
with a reason.
54+
3255
## Areas to Check
3356

3457
The user may specify an area via `$ARGUMENTS`. If no area is specified or "all" is given, check all areas.
@@ -173,6 +196,28 @@ These upstream FFI types have been reviewed and do not need to be independently
173196
- FFI example in `examples/datafusion-ffi-example/`
174197
- Type appears in union type hints where accepted
175198

199+
### 8. `__all__` Hygiene (functions.py)
200+
201+
Independent of upstream parity, also flag public `def` symbols in
202+
`python/datafusion/functions.py` that are missing from the module's
203+
`__all__`. These are functions a user can call but that do not show up in
204+
`from datafusion.functions import *`, in tab-completion against the
205+
namespace, or in generated API docs — typically an oversight rather than
206+
an intentional omission.
207+
208+
**How to check:**
209+
1. Grep for `^def ([a-z_][a-z0-9_]*)\(` in `python/datafusion/functions.py`
210+
to enumerate every public function definition.
211+
2. Read the `__all__` list at the top of the same file.
212+
3. Report any function in (1) that is not in (2). Skip private helpers
213+
(names starting with `_`).
214+
215+
A historical example: `instr` and `position` shipped as public `def`s but
216+
were absent from `__all__` until the gap was caught here.
217+
218+
For each finding, propose adding the name to `__all__` in alphabetical
219+
position with the existing entries.
220+
176221
## Checking for Existing GitHub Issues
177222

178223
After identifying missing APIs, search the open issues at https://github.com/apache/datafusion-python/issues for each gap to see if an issue already exists requesting that API be exposed. Search using the function or method name as the query.

0 commit comments

Comments
 (0)