Skip to content

Feature request: cloudpickle.patch_multiprocessing() utility for ForkingPickler replacement #589

@clemlesne

Description

@clemlesne

Summary

Propose adding a cloudpickle.patch_multiprocessing() helper that replaces multiprocessing.reduction.ForkingPickler with a cloudpickle-based pickler, enabling Pool.map(lambda x: x**2, range(10)) to work out of the box.

Motivation: ecosystem fragmentation

Every project that needs cloudpickle + multiprocessing.Pool independently reinvents this patching. At least 6 projects maintain their own version:

Project Approach
loky/joblib Full custom _LokyPickler subsystem in loky/backend/reduction.py
PySpark Own CloudPickleSerializer wrapping cloudpickle.dumps/loads
Ray Bundled fork as ray.cloudpickle with custom object store
Dask Custom serialization protocol in distributed scheduler
multiprocess Complete fork of CPython's multiprocessing with dill substituted
trading-strategy/exec-sandbox/pypeln/pyrocko Ad-hoc monkey patches of varying correctness

Most ad-hoc implementations are incomplete because of a non-obvious CPython pitfall (see below).

The _ForkingPickler double-binding pitfall

CPython has two separate name bindings for ForkingPickler:

# multiprocessing/reduction.py
class ForkingPickler(pickle.Pickler):
    ...
# multiprocessing/connection.py
from .context import reduction
_ForkingPickler = reduction.ForkingPickler   # captured at import time

class Connection:
    def send(self, obj):
        self._send_bytes(_ForkingPickler.dumps(obj))  # uses the captured reference

Patching reduction.ForkingPickler alone is insufficientConnection.send() still uses the stale _ForkingPickler reference captured at import time. You must also patch multiprocessing.connection._ForkingPickler. Most ad-hoc implementations miss this.

Additionally, reduction.dump() is a module-level function that also needs replacing for completeness.

Proposed API

import cloudpickle

cloudpickle.patch_multiprocessing()

One call, idempotent, patches all three binding sites:

  1. multiprocessing.reduction.ForkingPickler — the class
  2. multiprocessing.reduction.dump — the module-level helper
  3. multiprocessing.connection._ForkingPickler — the import-time captured reference

Reference implementation

Here's a minimal working implementation (tested on Python 3.14):

import copyreg
import io
import multiprocessing.connection
import multiprocessing.reduction

import cloudpickle


class CloudForkingPickler(cloudpickle.Pickler):
    """ForkingPickler replacement backed by cloudpickle."""
    _extra_reducers = {}
    _copyreg_dispatch_table = copyreg.dispatch_table

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dispatch_table = self._copyreg_dispatch_table.copy()
        self.dispatch_table.update(self._extra_reducers)

    @classmethod
    def register(cls, type, reduce):
        cls._extra_reducers[type] = reduce

    @classmethod
    def dumps(cls, obj, protocol=None):
        buf = io.BytesIO()
        cls(buf, protocol).dump(obj)
        return buf.getbuffer()

    loads = staticmethod(cloudpickle.loads)


def patch_multiprocessing():
    """Replace multiprocessing's ForkingPickler with cloudpickle-based version."""
    # 1. The class itself
    multiprocessing.reduction.ForkingPickler = CloudForkingPickler
    # 2. The module-level dump() helper
    multiprocessing.reduction.dump = lambda obj, file, protocol=None: \
        CloudForkingPickler(file, protocol).dump(obj)
    # 3. The import-time captured reference in connection.py
    multiprocessing.connection._ForkingPickler = CloudForkingPickler

After patch_multiprocessing():

from multiprocessing import Pool
with Pool(4) as p:
    print(p.map(lambda x: x**2, range(10)))
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Why cloudpickle (not CPython)

There's an open discussion on discuss.python.org about adding a pluggable pickler API to multiprocessing, but no PEP has materialized. cloudpickle is the pragmatic place for this — it already provides Pickler/dumps/loads, and adding a one-shot integration helper is a small, natural extension.

Alternatives considered

  • "Just use loky/joblib" — Valid for many users, but loky replaces the entire process management layer. Many projects only need cloudpickle serialization with stdlib multiprocessing.Pool.
  • "Just use multiprocess (dill)" — Requires replacing all multiprocessing imports. dill is heavier than cloudpickle and has different serialization semantics.
  • "Document the pattern instead" — The _ForkingPickler double-binding makes documentation insufficient; people will keep getting it wrong.

Happy to submit a PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions