-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Summary
Propose adding a cloudpickle.patch_multiprocessing() helper that replaces multiprocessing.reduction.ForkingPickler with a cloudpickle-based pickler, enabling Pool.map(lambda x: x**2, range(10)) to work out of the box.
Motivation: ecosystem fragmentation
Every project that needs cloudpickle + multiprocessing.Pool independently reinvents this patching. At least 6 projects maintain their own version:
| Project | Approach |
|---|---|
| loky/joblib | Full custom _LokyPickler subsystem in loky/backend/reduction.py |
| PySpark | Own CloudPickleSerializer wrapping cloudpickle.dumps/loads |
| Ray | Bundled fork as ray.cloudpickle with custom object store |
| Dask | Custom serialization protocol in distributed scheduler |
| multiprocess | Complete fork of CPython's multiprocessing with dill substituted |
| trading-strategy/exec-sandbox/pypeln/pyrocko | Ad-hoc monkey patches of varying correctness |
Most ad-hoc implementations are incomplete because of a non-obvious CPython pitfall (see below).
The _ForkingPickler double-binding pitfall
CPython has two separate name bindings for ForkingPickler:
# multiprocessing/reduction.py
class ForkingPickler(pickle.Pickler):
...# multiprocessing/connection.py
from .context import reduction
_ForkingPickler = reduction.ForkingPickler # captured at import time
class Connection:
def send(self, obj):
self._send_bytes(_ForkingPickler.dumps(obj)) # uses the captured referencePatching reduction.ForkingPickler alone is insufficient — Connection.send() still uses the stale _ForkingPickler reference captured at import time. You must also patch multiprocessing.connection._ForkingPickler. Most ad-hoc implementations miss this.
Additionally, reduction.dump() is a module-level function that also needs replacing for completeness.
Proposed API
import cloudpickle
cloudpickle.patch_multiprocessing()One call, idempotent, patches all three binding sites:
multiprocessing.reduction.ForkingPickler— the classmultiprocessing.reduction.dump— the module-level helpermultiprocessing.connection._ForkingPickler— the import-time captured reference
Reference implementation
Here's a minimal working implementation (tested on Python 3.14):
import copyreg
import io
import multiprocessing.connection
import multiprocessing.reduction
import cloudpickle
class CloudForkingPickler(cloudpickle.Pickler):
"""ForkingPickler replacement backed by cloudpickle."""
_extra_reducers = {}
_copyreg_dispatch_table = copyreg.dispatch_table
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.dispatch_table = self._copyreg_dispatch_table.copy()
self.dispatch_table.update(self._extra_reducers)
@classmethod
def register(cls, type, reduce):
cls._extra_reducers[type] = reduce
@classmethod
def dumps(cls, obj, protocol=None):
buf = io.BytesIO()
cls(buf, protocol).dump(obj)
return buf.getbuffer()
loads = staticmethod(cloudpickle.loads)
def patch_multiprocessing():
"""Replace multiprocessing's ForkingPickler with cloudpickle-based version."""
# 1. The class itself
multiprocessing.reduction.ForkingPickler = CloudForkingPickler
# 2. The module-level dump() helper
multiprocessing.reduction.dump = lambda obj, file, protocol=None: \
CloudForkingPickler(file, protocol).dump(obj)
# 3. The import-time captured reference in connection.py
multiprocessing.connection._ForkingPickler = CloudForkingPicklerAfter patch_multiprocessing():
from multiprocessing import Pool
with Pool(4) as p:
print(p.map(lambda x: x**2, range(10)))
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]Why cloudpickle (not CPython)
There's an open discussion on discuss.python.org about adding a pluggable pickler API to multiprocessing, but no PEP has materialized. cloudpickle is the pragmatic place for this — it already provides Pickler/dumps/loads, and adding a one-shot integration helper is a small, natural extension.
Alternatives considered
- "Just use loky/joblib" — Valid for many users, but loky replaces the entire process management layer. Many projects only need cloudpickle serialization with stdlib
multiprocessing.Pool. - "Just use multiprocess (dill)" — Requires replacing all
multiprocessingimports. dill is heavier than cloudpickle and has different serialization semantics. - "Document the pattern instead" — The
_ForkingPicklerdouble-binding makes documentation insufficient; people will keep getting it wrong.
Happy to submit a PR if there's interest.