Skip to content

feat: expose variety of features from DF54 update#1554

Open
timsaucer wants to merge 9 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave1
Open

feat: expose variety of features from DF54 update#1554
timsaucer wants to merge 9 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave1

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 21, 2026

Which issue does this PR close?

No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently.

Rationale for this change

DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps.

What changes are included in this PR?

  • Add LogicalExtensionCodecExportable / PhysicalExtensionCodecExportable to make hinting signatures more understandable
  • Expose get_field_path but instead fold it into get_field to be more pythonic
  • expose SessionContext.read_batches / read_batch
  • expose UDF lookup helpers
  • bump pre-commit so it stops failing CI checks
  • Minor changes to unit tests so deprecation warning doesn't show and we no longer have xfail test

Are there any user-facing changes?

Yes, but they are all additions. No breaking changes to existing public APIs.

timsaucer and others added 8 commits May 21, 2026 14:55
DataFusion 53 deprecated `TableFunctionImpl::call(args: &[Expr])` in
favor of `call_with_args(args: TableFunctionArgs)`. `PyTableFunction`
was migrated in 5a64b0d; this brings the FFI example along so it no
longer relies on the deprecated entry point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR apache#1541 introduced `with_logical_extension_codec` /
`with_physical_extension_codec` setters typed as `codec: Any`. The Rust
extractors accept either a raw `PyCapsule` or any object exposing
`__datafusion_logical_extension_codec__` /
`__datafusion_physical_extension_codec__`.

Add `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable`
Protocols in `python/datafusion/user_defined.py` (matching the existing
`ScalarUDFExportable` pattern) and tighten both setter signatures to
`Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream exposes both `get_field(expr, name)` and
`get_field_path(expr, [names...])`, but both ultimately call the same
scalar UDF with a base expression plus one or more name args. Collapse
the Python surface into a single variadic `get_field(expr, *names)`
that accepts either a one-step lookup or a path of names, dispatching
through a single Rust binding.

Note in `.ai/skills/check-upstream/SKILL.md` that `get_field_path` is
covered by the variadic form so future audits do not flag it as a gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap upstream `SessionContext::read_batches`, which materializes a
DataFrame directly from a sequence of `RecordBatch`es without
registering a named table. The single-batch convenience
`SessionContext.read_batch` is implemented in pure Python by calling
`read_batches([batch])`, so the Rust side only needs the one binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expose `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with
the existing `register_udf` / `register_udaf` / `register_udwf` setters,
plus `udfs()` / `udafs()` / `udwfs()` for enumerating registered
function names. Looked-up functions come back as the same
`ScalarUDF` / `AggregateUDF` / `WindowUDF` wrappers users already get
from registration, so they can be called as expressions or re-registered
into a different session.

Returns Vec<String> from the list helpers (sorted) rather than the raw
HashSet upstream returns, so calling code gets a stable ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyarrow.parquet promotes timestamp[s] to timestamp[ms] on write (apache/arrow#41382),
so the read array never matched the input. Cast the expected array to timestamp[ms]
in test_simple_select to assert DataFusion reads what Arrow actually stored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFrameHtmlFormatter(repr_rows=..., max_rows=...) fires the deprecation
warning before raising ValueError, but pytest.raises does not catch warnings.
The escaping warning surfaced in every pytest run. Wrap the call in both
pytest.raises and pytest.warns so the warning is asserted, not leaked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title DF54 follow-ups wave 1: SessionContext APIs, codec typing, test fixes feat: expose variety of features from DF54 update May 21, 2026
Add Examples docstrings (doctest) for `udf` / `udaf` / `udwf` / `udfs` /
`udafs` / `udwfs` that demonstrate the lookup pattern, including a
late-binding example where the function name comes from configuration.
Add tests covering config-driven dispatch and built-in UDAF / UDWF
lookup so the documented patterns are exercised end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as ready for review May 22, 2026 10:38
@timsaucer timsaucer requested a review from Copilot May 22, 2026 14:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python bindings and examples to expose additional DataFusion 54-era functionality (notably UDF/UDAF/UDWF discovery + lookup helpers and Arrow RecordBatch ingestion conveniences), and adjusts tests/tooling accordingly.

Changes:

  • Add SessionContext.read_batch / read_batches plus UDF/UDAF/UDWF lookup & listing helpers (udf/udaf/udwf, udfs/udafs/udwfs).
  • Extend functions.get_field to support multi-segment nested field paths (and update the Rust binding accordingly).
  • Update tests to cover the new API surface and adjust timestamp/parquet and deprecation-warning expectations; bump pre-commit hook version.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
python/datafusion/context.py Adds batch-reading helpers and UDF/UDAF/UDWF discovery + lookup methods; improves codec type hints.
python/datafusion/functions.py Updates get_field to accept nested paths.
python/datafusion/user_defined.py Introduces Protocol type hints for logical/physical extension codec exportables.
crates/core/src/context.rs Exposes read_batches and function-registry lookup/listing to Python via PyO3.
crates/core/src/functions.rs Updates internal get_field binding to accept a vector of path segments.
examples/datafusion-ffi-example/src/table_function.rs Updates example for upstream TableFunctionImpl API changes (call_with_args).
python/tests/test_context.py Adds coverage for read_batch/read_batches.
python/tests/test_dataframe.py Adjusts test to assert both DeprecationWarning and ValueError.
python/tests/test_functions.py Adds coverage for nested-path get_field and empty-arg error behavior.
python/tests/test_sql.py Removes timestamp[s] xfail and compensates for parquet timestamp unit promotion.
python/tests/test_udf.py Adds coverage for UDF/UDAF/UDWF lookup + late-binding dispatch.
.pre-commit-config.yaml Bumps actionlint hook version to fix CI failures.
.ai/skills/check-upstream/SKILL.md Documents that get_field_path is covered by variadic get_field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/datafusion/functions.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants