Pass Field information back and forth when using scalar UDFs#1299
Pass Field information back and forth when using scalar UDFs#1299timsaucer merged 9 commits intoapache:mainfrom
Conversation
|
@kosiew Here is an alternate approach. Instead of relying on extension type features it is going to pass the Field information when creating the FFI array. This will capture pyarrow extensions as well as any other metadata that any user assigns on the input. I'm going to leave it in draft until I can finish up those additional items on my check list. What do you think? cc @paleolimbot |
Definitely! Passing the argument fields/return fields should do it. Using We have a slightly different signature model in SedonaDB ("type matchers") because the existing signature matching doesn't consider metadata, but at the Arrow/FFI level we're doing approximately the same thing: apache/sedona-db#228 . We do use the concept of |
src/udf.rs
Outdated
| "_import_from_c", | ||
| ( | ||
| addr_of!(array) as Py_uintptr_t, | ||
| addr_of!(schema) as Py_uintptr_t, | ||
| ), |
There was a problem hiding this comment.
is the use of PyArrow's private _import_from_c advisable?
There was a problem hiding this comment.
This code is a near duplicate of how we already convert ArrayData into a pyarrow object. You can see the original here. The difference in this function is that we know the field instead of only the data type.
There was a problem hiding this comment.
A more modern way is to use __arrow_c_schema__ (although I think import_from_c will be around for a while). It's only a few lines:
https://github.com/apache/sedona-db/blob/main/python/sedonadb/src/import_from.rs#L151-L157
|
Opened new issue so there isn't too much scope creep |
be7daf5 to
39754e5
Compare
39754e5 to
dcb9e25
Compare
There was a problem hiding this comment.
Pull request overview
This PR enhances Python UDFs to support PyArrow Field information (including metadata and nullability) instead of only DataType information, enabling more sophisticated data type handling in Python-written scalar UDFs.
Changes:
- Implements a custom
ScalarUDFImplfor Python UDFs instead of using the genericcreate_udffunction - Adds
PyArrowArrayExportablestruct to support FFI conversion with Field schema information - Updates Python API to accept
pa.Fieldobjects while maintaining backwards compatibility withpa.DataType
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/udf.rs | Replaces create_udf with custom PythonFunctionScalarUDF implementation supporting Field-based metadata |
| src/lib.rs | Adds new array module to the crate |
| src/array.rs | Implements PyArrowArrayExportable for FFI conversion with Field information |
| python/tests/test_udf.py | Adds tests for UUID metadata handling and nullability preservation |
| python/datafusion/user_defined.py | Updates API to accept Field/DataType with helper conversion functions |
| pyproject.toml | Adds minimum PyArrow version constraint |
| docs/source/user-guide/common-operations/udf-and-udfa.rst | Documents Field vs DataType usage and references Rust UDF blog post |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dcb9e25 to
844592f
Compare
Which issue does this PR close?
Closes #1172
Rationale for this change
Since we now have the ability to pass Field information instead of just DataType with ScalarUDFs, this feature adds similar support for Python written UDFs. Without this feature you must write your UDFs in rust and expose them to Python. This enhancement greatly expands the use cases where PyArrow data can be leveraged.
What changes are included in this PR?
create_udffunctionAre there any user-facing changes?
This expands on the current API and is backwards compatible.