Skip to content

feat: expose variety of features from DF54 update#1554

Open
timsaucer wants to merge 9 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave1
Open

feat: expose variety of features from DF54 update#1554
timsaucer wants to merge 9 commits into
apache:mainfrom
timsaucer:feat/df54-followups-wave1

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 21, 2026

Which issue does this PR close?

No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently.

Rationale for this change

DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps.

What changes are included in this PR?

  • Add LogicalExtensionCodecExportable / PhysicalExtensionCodecExportable to make hinting signatures more understandable
  • Expose get_field_path but instead fold it into get_field to be more pythonic
  • expose SessionContext.read_batches / read_batch
  • expose UDF lookup helpers
  • bump pre-commit so it stops failing CI checks
  • Minor changes to unit tests so deprecation warning doesn't show and we no longer have xfail test

Are there any user-facing changes?

Yes, but they are all additions. No breaking changes to existing public APIs.

timsaucer and others added 8 commits May 21, 2026 14:55
DataFusion 53 deprecated `TableFunctionImpl::call(args: &[Expr])` in
favor of `call_with_args(args: TableFunctionArgs)`. `PyTableFunction`
was migrated in 5a64b0d; this brings the FFI example along so it no
longer relies on the deprecated entry point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR apache#1541 introduced `with_logical_extension_codec` /
`with_physical_extension_codec` setters typed as `codec: Any`. The Rust
extractors accept either a raw `PyCapsule` or any object exposing
`__datafusion_logical_extension_codec__` /
`__datafusion_physical_extension_codec__`.

Add `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable`
Protocols in `python/datafusion/user_defined.py` (matching the existing
`ScalarUDFExportable` pattern) and tighten both setter signatures to
`Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream exposes both `get_field(expr, name)` and
`get_field_path(expr, [names...])`, but both ultimately call the same
scalar UDF with a base expression plus one or more name args. Collapse
the Python surface into a single variadic `get_field(expr, *names)`
that accepts either a one-step lookup or a path of names, dispatching
through a single Rust binding.

Note in `.ai/skills/check-upstream/SKILL.md` that `get_field_path` is
covered by the variadic form so future audits do not flag it as a gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap upstream `SessionContext::read_batches`, which materializes a
DataFrame directly from a sequence of `RecordBatch`es without
registering a named table. The single-batch convenience
`SessionContext.read_batch` is implemented in pure Python by calling
`read_batches([batch])`, so the Rust side only needs the one binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expose `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with
the existing `register_udf` / `register_udaf` / `register_udwf` setters,
plus `udfs()` / `udafs()` / `udwfs()` for enumerating registered
function names. Looked-up functions come back as the same
`ScalarUDF` / `AggregateUDF` / `WindowUDF` wrappers users already get
from registration, so they can be called as expressions or re-registered
into a different session.

Returns Vec<String> from the list helpers (sorted) rather than the raw
HashSet upstream returns, so calling code gets a stable ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyarrow.parquet promotes timestamp[s] to timestamp[ms] on write (apache/arrow#41382),
so the read array never matched the input. Cast the expected array to timestamp[ms]
in test_simple_select to assert DataFusion reads what Arrow actually stored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFrameHtmlFormatter(repr_rows=..., max_rows=...) fires the deprecation
warning before raising ValueError, but pytest.raises does not catch warnings.
The escaping warning surfaced in every pytest run. Wrap the call in both
pytest.raises and pytest.warns so the warning is asserted, not leaked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title DF54 follow-ups wave 1: SessionContext APIs, codec typing, test fixes feat: expose variety of features from DF54 update May 21, 2026
Add Examples docstrings (doctest) for `udf` / `udaf` / `udwf` / `udfs` /
`udafs` / `udwfs` that demonstrate the lookup pattern, including a
late-binding example where the function name comes from configuration.
Add tests covering config-driven dispatch and built-in UDAF / UDWF
lookup so the documented patterns are exercised end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as ready for review May 22, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant