@@ -69,20 +69,25 @@ What travels with the expression
6969
7070* **Built-in functions ** (``abs ``, ``length ``, arithmetic, comparisons, etc.)
7171 — fully portable. Worker needs nothing pre-registered.
72- * **Python scalar UDFs ** (defined with :py:func: `datafusion.udf `) — fully
73- portable. The callable and its signature travel inside the pickled bytes
74- and are reconstructed on the worker automatically.
75- * **Aggregate UDFs **, **window UDFs **, **UDFs imported via the FFI capsule
76- protocol ** — travel **by name only **. The worker must already have a
77- matching registration on its :py:class: `SessionContext `. Without that
78- registration, evaluation raises an error.
72+ * **Python UDFs ** — fully portable. The callable, its signature, and any
73+ state captured in closures travel inside the pickled bytes and are
74+ reconstructed on the worker automatically. Applies equally to:
75+
76+ * **scalar UDFs ** (:py:func: `datafusion.udf `)
77+ * **aggregate UDFs ** (:py:func: `datafusion.udaf `)
78+ * **window UDFs ** (:py:func: `datafusion.udwf `)
79+ * **UDFs imported via the FFI capsule protocol ** — travel **by name only **.
80+ The worker must already have a matching registration on its
81+ :py:class: `SessionContext `. Without that registration, evaluation raises
82+ an error.
7983
8084Registering shared UDFs on workers
8185----------------------------------
8286
83- When an expression references something that travels by name only (aggregate
84- UDF, window UDF, FFI UDF), set up the worker's :py:class: `SessionContext `
85- once per process and install it as the *worker context *:
87+ When an expression references an FFI capsule UDF (or any UDF the worker
88+ must resolve from its registered functions), set up the worker's
89+ :py:class: `SessionContext ` once per process and install it as the
90+ *worker context *:
8691
8792.. code-block :: python
8893
@@ -92,7 +97,7 @@ once per process and install it as the *worker context*:
9297
9398 def init_worker ():
9499 ctx = SessionContext()
95- ctx.register_udaf(my_aggregate )
100+ ctx.register_udaf(my_ffi_aggregate )
96101 set_worker_ctx(ctx)
97102
98103
@@ -104,8 +109,8 @@ once per process and install it as the *worker context*:
104109 Inside a worker, expressions reconstructed by :py:func: `pickle.loads ` resolve
105110their by-name references against the installed worker context. If no worker
106111context is installed, a fresh empty :py:class: `SessionContext ` is used —
107- fine for expressions that only reference built-ins and Python scalar UDFs,
108- but anything by-name-only will fail to resolve.
112+ fine for expressions that only reference built-ins and Python UDFs, but
113+ FFI-capsule-backed registrations will fail to resolve.
109114
110115Python 3.14 default change
111116--------------------------
@@ -122,30 +127,25 @@ Practical considerations
122127
123128* **Pickled size scales with what travels inline. ** A pickled expression of
124129 just built-ins is small (tens of bytes). An expression carrying a Python
125- scalar UDF is hundreds of bytes (the callable and its signature). When the
126- same UDF is shipped many times, pre-registering it on each worker via
127- :py:func: `~datafusion.ipc.set_worker_ctx ` and referring to it by name
128- cuts the per-blob overhead.
129- * **Closure capture. ** When a Python scalar UDF closes over surrounding
130- state — local variables, module-level objects, file paths — that state
131- is captured at pickling time. Surprises are possible if the captured
132- state is large, mutable, or not portable to the worker's environment.
133- * **Aggregate and window UDFs always travel by name. ** Their Python state
134- is held inside opaque factory closures that cannot be reconstructed from
135- bytes alone. Use :py:func: `~datafusion.ipc.set_worker_ctx ` to register
136- them on each worker.
130+ UDF is hundreds of bytes (the callable and its signature). When the same
131+ UDF is shipped many times, registering an equivalent FFI-capsule UDF on
132+ each worker via :py:func: `~datafusion.ipc.set_worker_ctx ` and referring
133+ to it by name cuts the per-blob overhead.
134+ * **Closure capture. ** When a Python UDF closes over surrounding state —
135+ local variables, module-level objects, file paths — that state is
136+ captured at pickling time. Surprises are possible if the captured state
137+ is large, mutable, or not portable to the worker's environment.
137138
138139Security
139140--------
140141
141142.. warning ::
142143
143- Reconstructing an expression containing a Python scalar UDF executes
144- arbitrary Python code on the receiver. Only :py:func: `pickle.loads `
145- expressions from trusted sources. For untrusted-source workflows,
146- restrict senders to built-in functions and pre-registered Rust-side
147- UDFs, and never feed externally supplied bytes through
148- :py:func: `pickle.loads `.
144+ Reconstructing an expression containing a Python UDF executes arbitrary
145+ Python code on the receiver. Only :py:func: `pickle.loads ` expressions
146+ from trusted sources. For untrusted-source workflows, restrict senders
147+ to built-in functions and pre-registered Rust-side UDFs, and never feed
148+ externally supplied bytes through :py:func: `pickle.loads `.
149149
150150See also
151151--------
0 commit comments