1818 Distributing expressions across processes
1919=========================================
2020
21- DataFusion expressions (:py:class: `~datafusion.Expr `) can be serialized and
22- shipped across process boundaries — useful for distributing work over a
23- ``multiprocessing.Pool ``, a Ray actor pool, or any framework that supports a
24- per-worker initialization hook.
21+ A common pattern is to build a DataFusion expression
22+ (:py:class: `~datafusion.Expr `) in a driver process, hand it to a pool of
23+ worker processes (``multiprocessing.Pool ``, a Ray actor pool, or any other
24+ framework with a per-worker initialization hook), and have each worker
25+ evaluate the expression against its own slice of data.
2526
26- Pickle support
27- --------------
27+ DataFusion expressions support this directly: they can be sent through
28+ :py:mod: `pickle ` like any other Python object. Python scalar UDFs ride along
29+ inside the pickled bytes — the receiver does not need to pre-register them.
2830
29- :py:class: `~datafusion.Expr ` implements the pickle protocol directly. Call
30- :py:func: `pickle.dumps ` on an expression and ship the bytes; the receiver
31- calls :py:func: `pickle.loads `. Python *scalar UDFs * are cloudpickled into the
32- proto wire format by a Rust-side codec (``PythonUDFCodec ``), so the blob is
33- self-contained — the receiver does not need to pre-register the UDF.
31+ Basic worker-pool example
32+ -------------------------
3433
3534.. code-block :: python
3635
3736 import multiprocessing as mp
3837 import pickle
3938
4039 import pyarrow as pa
41- from datafusion import SessionContext, col, lit, udf
40+ from datafusion import SessionContext, col, udf
4241
43- def init_worker ():
44- # Optional: install a worker context for aggregate / window UDFs,
45- # table providers, or Rust-side function registrations. Not needed
46- # for built-ins or Python scalar UDFs.
47- pass
4842
4943 def evaluate (blob_and_batch ):
5044 blob, batch = blob_and_batch
51- expr = pickle.loads(blob)
45+ expr = pickle.loads(blob) # Python scalar UDFs ride along inline.
5246 ctx = SessionContext()
5347 df = ctx.from_pydict({" a" : batch})
5448 return df.with_column(" result" , expr).select(" result" ).to_pydict()[" result" ]
5549
50+
5651 if __name__ == " __main__" :
5752 double = udf(
5853 lambda arr : pa.array([(v.as_py() or 0 ) * 2 for v in arr]),
@@ -68,74 +63,92 @@ self-contained — the receiver does not need to pre-register the UDF.
6863 )
6964 print (results) # [[2, 4, 6], [20, 40, 60]]
7065
71- Worker-scoped context
72- ---------------------
7366
74- For references the codec cannot inline — aggregate UDFs, window UDFs, FFI
75- capsule UDFs, or anything resolved through the
76- :class: `SessionContext `'s function registry — set a worker-scoped context
77- once per process using :py:func: `datafusion.ipc.set_worker_ctx `:
67+ What travels with the expression
68+ --------------------------------
69+
70+ * **Built-in functions ** (``abs ``, ``length ``, arithmetic, comparisons, etc.)
71+ — fully portable. Worker needs nothing pre-registered.
72+ * **Python scalar UDFs ** (defined with :py:func: `datafusion.udf `) — fully
73+ portable. The callable and its signature travel inside the pickled bytes
74+ and are reconstructed on the worker automatically.
75+ * **Aggregate UDFs **, **window UDFs **, **UDFs imported via the FFI capsule
76+ protocol ** — travel **by name only **. The worker must already have a
77+ matching registration on its :py:class: `SessionContext `. Without that
78+ registration, evaluation raises an error.
79+
80+ Registering shared UDFs on workers
81+ ----------------------------------
82+
83+ When an expression references something that travels by name only (aggregate
84+ UDF, window UDF, FFI UDF), set up the worker's :py:class: `SessionContext `
85+ once per process and install it as the *worker context *:
7886
7987.. code-block :: python
8088
8189 from datafusion import SessionContext
8290 from datafusion.ipc import set_worker_ctx
8391
92+
8493 def init_worker ():
8594 ctx = SessionContext()
86- ctx.register_udaf(my_aggregate) # if needed
95+ ctx.register_udaf(my_aggregate)
8796 set_worker_ctx(ctx)
8897
98+
8999 with mp.get_context(" forkserver" ).Pool(
90100 processes = 4 , initializer = init_worker
91101 ) as pool:
92102 ...
93103
94- Without a worker context, unpickling falls back to a fresh
95- :py:class: ` SessionContext `. Built-in functions resolve; Python scalar UDFs
96- ride along inside the blob via the codec. References to aggregate / window
97- UDFs or other registry- only entries raise an informative error if not
98- registered on the worker .
104+ Inside a worker, expressions reconstructed by :py:func: ` pickle.loads ` resolve
105+ their by-name references against the installed worker context. If no worker
106+ context is installed, a fresh empty :py:class: ` SessionContext ` is used —
107+ fine for expressions that only reference built-ins and Python scalar UDFs,
108+ but anything by-name-only will fail to resolve .
99109
100110Python 3.14 default change
101111--------------------------
102112
103113Python 3.14 changed the POSIX default start method for
104- :py:mod: `multiprocessing ` from ``fork `` to ``forkserver ``. With ``fork ``, a
105- context set in the parent was visible in workers via copy-on-write; with
106- ``forkserver `` and ``spawn `` it is not. The codec + worker-init pattern works
107- on every start method — prefer it over relying on inherited state.
108-
109- Trade-offs of inline UDFs
110- -------------------------
111-
112- * **Blob size ** — cloudpickled callables add bytes per blob. A trivial
113- built-in expression is ~20 bytes; an expression referencing a Python scalar
114- UDF is hundreds of bytes (the cloudpickled callable + signature). Pre-register
115- shared UDFs on workers via :py:func: `~datafusion.ipc.set_worker_ctx ` when
116- the same UDF is shipped many times and you want to avoid the overhead.
117- * **Closure capture ** — cloudpickle captures closure state. Surprises are
118- possible if the UDF closes over large objects, module-level mutable state,
119- or non-portable file paths.
120- * **FFI scalar UDFs cannot be inlined ** — PyCapsule-backed UDFs have no
121- Python callable to cloudpickle. The codec leaves their ``fun_definition ``
122- empty; the receiver must have a matching registration.
123- * **Aggregate and window UDFs cannot be inlined yet ** — their Python state
124- is held inside opaque factory closures on the Rust side. Pre-register on
125- the worker.
114+ :py:mod: `multiprocessing ` from ``fork `` to ``forkserver ``. With ``fork ``, any
115+ state set in the parent was visible in workers via copy-on-write; with
116+ ``forkserver `` and ``spawn `` it is not. The
117+ :py:func: `~datafusion.ipc.set_worker_ctx ` pattern works on every start
118+ method — prefer it over relying on inherited state.
119+
120+ Practical considerations
121+ ------------------------
122+
123+ * **Pickled size scales with what travels inline. ** A pickled expression of
124+ just built-ins is small (tens of bytes). An expression carrying a Python
125+ scalar UDF is hundreds of bytes (the callable and its signature). When the
126+ same UDF is shipped many times, pre-registering it on each worker via
127+ :py:func: `~datafusion.ipc.set_worker_ctx ` and referring to it by name
128+ cuts the per-blob overhead.
129+ * **Closure capture. ** When a Python scalar UDF closes over surrounding
130+ state — local variables, module-level objects, file paths — that state
131+ is captured at pickling time. Surprises are possible if the captured
132+ state is large, mutable, or not portable to the worker's environment.
133+ * **Aggregate and window UDFs always travel by name. ** Their Python state
134+ is held inside opaque factory closures that cannot be reconstructed from
135+ bytes alone. Use :py:func: `~datafusion.ipc.set_worker_ctx ` to register
136+ them on each worker.
126137
127138Security
128139--------
129140
130141.. warning ::
131142
132- Pickle blobs containing inlined UDFs deserialize via :py:mod: `cloudpickle `,
133- which executes arbitrary code on the receiver. Only :py:func: `pickle.loads `
134- blobs from trusted sources. For untrusted-source workflows, restrict the
135- sender to built-in functions and pre-registered Rust-side UDFs.
143+ Reconstructing an expression containing a Python scalar UDF executes
144+ arbitrary Python code on the receiver. Only :py:func: `pickle.loads `
145+ expressions from trusted sources. For untrusted-source workflows,
146+ restrict senders to built-in functions and pre-registered Rust-side
147+ UDFs, and never feed externally supplied bytes through
148+ :py:func: `pickle.loads `.
136149
137150See also
138151--------
139152
140- * :py:mod: `datafusion.ipc ` — module-level API reference .
153+ * :py:mod: `datafusion.ipc ` — worker context API .
141154* ``examples/ray_pickle_expr.py `` — runnable Ray actor example.
0 commit comments