Pass Field information back and forth when using scalar UDFs by timsaucer · Pull Request #1299 · apache/datafusion-python

timsaucer · 2025-11-03T20:23:29Z

Which issue does this PR close?

Rationale for this change

Since we now have the ability to pass Field information instead of just DataType with ScalarUDFs, this feature adds similar support for Python written UDFs. Without this feature you must write your UDFs in rust and expose them to Python. This enhancement greatly expands the use cases where PyArrow data can be leveraged.

What changes are included in this PR?

Adds a ScalarUDF implementation for python based UDFs instead of relying on the create_udf function
Adds support for converting to pyarrow array via FFI but including the Field schema instead of the data type
Add unit test

Are there any user-facing changes?

This expands on the current API and is backwards compatible.

timsaucer · 2025-11-03T20:26:18Z

@kosiew Here is an alternate approach. Instead of relying on extension type features it is going to pass the Field information when creating the FFI array. This will capture pyarrow extensions as well as any other metadata that any user assigns on the input.

I'm going to leave it in draft until I can finish up those additional items on my check list.

What do you think?

cc @paleolimbot

paleolimbot · 2025-11-03T20:38:18Z

What do you think?

Definitely! Passing the argument fields/return fields should do it. Using __arrow_c_schema__ might be more flexible than isinstance(x, pa.Field) (arro3, nanoarrow, and polars types would work too).

We have a slightly different signature model in SedonaDB ("type matchers") because the existing signature matching doesn't consider metadata, but at the Arrow/FFI level we're doing approximately the same thing: apache/sedona-db#228 . We do use the concept of SedonaType for arguments and return type (but these are serializable to/deserializable from fields).

kosiew · 2025-11-05T11:35:48Z

src/udf.rs

+        "_import_from_c",
+        (
+            addr_of!(array) as Py_uintptr_t,
+            addr_of!(schema) as Py_uintptr_t,
+        ),


is the use of PyArrow's private _import_from_c advisable?

This code is a near duplicate of how we already convert ArrayData into a pyarrow object. You can see the original here. The difference in this function is that we know the field instead of only the data type.

A more modern way is to use __arrow_c_schema__ (although I think import_from_c will be around for a while). It's only a few lines:

https://github.com/apache/sedona-db/blob/main/python/sedonadb/src/import_from.rs#L151-L157

timsaucer · 2025-11-06T14:38:05Z

Also worth evaluating while we're doing this: For scalar values, is it possible for them to contain metadata? If I do pa.scalar(uuid.uuid4().bytes, type=pa.uuid()) and I check the type I should have the extension data. Maybe this is already supported, but as part of this PR I want to evaluate that as well.

Opened new issue so there isn't too much scope creep

Copilot

Pull request overview

This PR enhances Python UDFs to support PyArrow Field information (including metadata and nullability) instead of only DataType information, enabling more sophisticated data type handling in Python-written scalar UDFs.

Changes:

Implements a custom ScalarUDFImpl for Python UDFs instead of using the generic create_udf function
Adds PyArrowArrayExportable struct to support FFI conversion with Field schema information
Updates Python API to accept pa.Field objects while maintaining backwards compatibility with pa.DataType

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/udf.rs	Replaces `create_udf` with custom `PythonFunctionScalarUDF` implementation supporting Field-based metadata
src/lib.rs	Adds new `array` module to the crate
src/array.rs	Implements `PyArrowArrayExportable` for FFI conversion with Field information
python/tests/test_udf.py	Adds tests for UUID metadata handling and nullability preservation
python/datafusion/user_defined.py	Updates API to accept Field/DataType with helper conversion functions
pyproject.toml	Adds minimum PyArrow version constraint
docs/source/user-guide/common-operations/udf-and-udfa.rst	Documents Field vs DataType usage and references Rust UDF blog post

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/udf.rs

python/datafusion/user_defined.py

…r python UDFs

…r UDFs

kosiew reviewed Nov 5, 2025

View reviewed changes

timsaucer mentioned this pull request Nov 6, 2025

Add support for scalar values with extension types #1301

Open

timsaucer force-pushed the feat/propagate-metadata branch from be7daf5 to 39754e5 Compare January 5, 2026 13:47

timsaucer force-pushed the feat/propagate-metadata branch from 39754e5 to dcb9e25 Compare January 19, 2026 19:15

timsaucer marked this pull request as ready for review January 19, 2026 19:19

timsaucer requested a review from Copilot January 19, 2026 19:19

Copilot AI reviewed Jan 19, 2026

View reviewed changes

src/udf.rs Outdated Show resolved Hide resolved

python/datafusion/user_defined.py Outdated Show resolved Hide resolved

timsaucer mentioned this pull request Feb 4, 2026

Release DataFusion-Python 52.0.0 #1364

Open

8 tasks

timsaucer added 9 commits February 4, 2026 07:43

Pass Field information back and forth when using scalar UDFs

7067a9c

Add ArrowArrayExportable class and use it to create pyarrow arrays fo…

6c7e55b

…r python UDFs

Minor user documentation update

77ae1f8

Update naming from type to field where appropriate

aa7d35c

Add unit test to check field inputs

f1cf7a1

Update docstring

42a4df0

Add text to user documentation on passing field information for scala…

27e15ba

…r UDFs

Minor change requested in code review

f6dccad

Make type hints match outer

844592f

timsaucer force-pushed the feat/propagate-metadata branch from dcb9e25 to 844592f Compare February 4, 2026 12:48

timsaucer mentioned this pull request Feb 4, 2026

Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

Closed

timsaucer merged commit 015dd76 into apache:main Feb 4, 2026
17 checks passed

timsaucer deleted the feat/propagate-metadata branch February 4, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass Field information back and forth when using scalar UDFs#1299

Pass Field information back and forth when using scalar UDFs#1299
timsaucer merged 9 commits intoapache:mainfrom
timsaucer:feat/propagate-metadata

timsaucer commented Nov 3, 2025 •

edited

Loading

Uh oh!

timsaucer commented Nov 3, 2025

Uh oh!

paleolimbot commented Nov 3, 2025

Uh oh!

kosiew Nov 5, 2025

Uh oh!

timsaucer Nov 5, 2025

Uh oh!

paleolimbot Nov 5, 2025

Uh oh!

timsaucer commented Nov 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timsaucer commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer commented Nov 3, 2025

Uh oh!

paleolimbot commented Nov 3, 2025

Uh oh!

kosiew Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timsaucer commented Nov 3, 2025 •

edited

Loading

timsaucer commented Nov 6, 2025 •

edited

Loading