feat: expose variety of features from DF54 update#1554
Open
timsaucer wants to merge 9 commits into
Open
Conversation
DataFusion 53 deprecated `TableFunctionImpl::call(args: &[Expr])` in favor of `call_with_args(args: TableFunctionArgs)`. `PyTableFunction` was migrated in 5a64b0d; this brings the FFI example along so it no longer relies on the deprecated entry point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR apache#1541 introduced `with_logical_extension_codec` / `with_physical_extension_codec` setters typed as `codec: Any`. The Rust extractors accept either a raw `PyCapsule` or any object exposing `__datafusion_logical_extension_codec__` / `__datafusion_physical_extension_codec__`. Add `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable` Protocols in `python/datafusion/user_defined.py` (matching the existing `ScalarUDFExportable` pattern) and tighten both setter signatures to `Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream exposes both `get_field(expr, name)` and `get_field_path(expr, [names...])`, but both ultimately call the same scalar UDF with a base expression plus one or more name args. Collapse the Python surface into a single variadic `get_field(expr, *names)` that accepts either a one-step lookup or a path of names, dispatching through a single Rust binding. Note in `.ai/skills/check-upstream/SKILL.md` that `get_field_path` is covered by the variadic form so future audits do not flag it as a gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap upstream `SessionContext::read_batches`, which materializes a DataFrame directly from a sequence of `RecordBatch`es without registering a named table. The single-batch convenience `SessionContext.read_batch` is implemented in pure Python by calling `read_batches([batch])`, so the Rust side only needs the one binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expose `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with the existing `register_udf` / `register_udaf` / `register_udwf` setters, plus `udfs()` / `udafs()` / `udwfs()` for enumerating registered function names. Looked-up functions come back as the same `ScalarUDF` / `AggregateUDF` / `WindowUDF` wrappers users already get from registration, so they can be called as expressions or re-registered into a different session. Returns Vec<String> from the list helpers (sorted) rather than the raw HashSet upstream returns, so calling code gets a stable ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyarrow.parquet promotes timestamp[s] to timestamp[ms] on write (apache/arrow#41382), so the read array never matched the input. Cast the expected array to timestamp[ms] in test_simple_select to assert DataFusion reads what Arrow actually stored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFrameHtmlFormatter(repr_rows=..., max_rows=...) fires the deprecation warning before raising ValueError, but pytest.raises does not catch warnings. The escaping warning surfaced in every pytest run. Wrap the call in both pytest.raises and pytest.warns so the warning is asserted, not leaked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add Examples docstrings (doctest) for `udf` / `udaf` / `udwf` / `udfs` / `udafs` / `udwfs` that demonstrate the lookup pattern, including a late-binding example where the function name comes from configuration. Add tests covering config-driven dispatch and built-in UDAF / UDWF lookup so the documented patterns are exercised end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently.
Rationale for this change
DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps.
What changes are included in this PR?
LogicalExtensionCodecExportable/PhysicalExtensionCodecExportableto make hinting signatures more understandableget_field_pathbut instead fold it intoget_fieldto be more pythonicSessionContext.read_batches/read_batchAre there any user-facing changes?
Yes, but they are all additions. No breaking changes to existing public APIs.