Skip to content

Conversation

@XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Dec 13, 2025

Description

Fixes #5926

For !defined(PYBIND11_HAS_SUBINTERPRETER_SUPPORT), the previous behavior is unchanged.

For defined(PYBIND11_HAS_SUBINTERPRETER_SUPPORT), the pybind11 internals is already a per-interpreter-dependent singleton stored in the interpreter's state dict. This PR adds a new map to the interpreter-dependent internals from the gil_safe_call_once_and_store's address to the result, where the gil_safe_call_once_and_store is a static object that has a fixed address.

UPDATE: The storage map is moved to a separate capsule in the per-interpreter state dict instead of putting it as a member of internals. Because the per-interpreter internals are leaked in the program, the storage map is never GC-ed on (sub-)interpreter shutdown.

Suggested changelog entry:

  • Add per-interpreter storage for gil_safe_call_once_and_store to make it safe under multi-interpreters.

📚 Documentation preview 📚: https://pybind11--5933.org.readthedocs.build/

@XuehaiPan XuehaiPan force-pushed the subinterp-call-once-and-store branch 3 times, most recently from 58834ac to a46973b Compare December 14, 2025 04:07
@XuehaiPan XuehaiPan force-pushed the subinterp-call-once-and-store branch from a46973b to e741760 Compare December 14, 2025 04:08
@XuehaiPan XuehaiPan marked this pull request as ready for review December 14, 2025 06:47
Comment on lines 599 to 608
// FIXME: We cannot use `get_num_interpreters_seen() > 1` here to create a fast path for
// the multi-interpreter case. The singleton may be initialized by a subinterpreter not the
// main interpreter.
//
// For multi-interpreter support, the subinterpreters can be initialized concurrently, and
// the first time this function may not be called in the main interpreter.
// For example, a clean main interpreter that does not import any pybind11 module and then
// spawns multiple subinterpreters using `InterpreterPoolExecutor` that each imports a
// pybind11 module concurrently.
if (get_num_interpreters_seen() > 1) {
Copy link
Contributor Author

@XuehaiPan XuehaiPan Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @b-pass @rwgk for thoughts about this, see the code comment referenced for more details.

I test the patch in this PR in:

Most things works fine except test_import_in_subinterpreter_before_main:

run_in_subprocess(
    """
    with contextlib.closing(interpreters.create()) as subinterpreter:
        subinterpreter.exec('import optree')

    import optree
    """
)

If I remove the get_num_interpreters_seen() > 1 condition, my import test works but the cpptest in pybind11 CI breaks because internals_singleton_pp_ is never initialized. For instance, the memory address get_internals() should be a shared constant for differenet pybind11 modules in a single program.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_num_interpreters_seen() > 1 check is meant to keep the "there are no subinterpreters" case (the most common case) as fast as possible.

Can you help me understand the problem that concurrent subinterpreter imports is causing?

I'm not understanding what exactly the problem is... doing concurrent imports of pybind11 modules each in its own subinterpreter should still result in each subinterpreter getting its own internals (as it should). The one that gets to this code first would populate the internals_singleton_pp_, but this does not have to be the main interpreter, and the next time get_pp is called from that first subinterprter it will take the other (> 1) code path. Since get_num_interpreters_seen() never decreases (it is "seen" not "currently alive"), once it goes up to 2 the inner code path will be used and what is stored in the internals_singleton_pp_ doesn't matter for the rest of the program.

What's the purpose of get_pp_for_main_interpreter() and the associated once flag?

Copy link
Contributor Author

@XuehaiPan XuehaiPan Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me elaborate more about my failed test. I have a C++ singleton instance using gil_safe_call_once_and_store to ensure exactly one instance per-interpreter. With the patch in this PR, the call-once result is stored in the internals (*get_pp()), where the internals is stored in the interpreter's state dict.

MyClass &get_singleton() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<MyClass> storage;
    return storage
        .call_once_and_store_result([]() -> MyClass {
            MyClass instance{};

            {
                // initialize
                ...
            }
            return instance;
        })
        .get_stored();
}
  • OK - test_import_in_subinterpreters_concurrently: import the module in multiple subinterpreters concurrently, without any import for a pybind11 module in the main interpreter.
  • OK - test_import_in_subinterpreter_after_main: import the module in the main interpreter, then import in a subinterpreter.
  • FAIL - test_import_in_subinterpreter_before_main: import the module in a subinterpreter (the main interpreter does not import any pybind11 module yet), then import the module in the main interpreter.

    The import succeeds in the subinterpreter, but the instance is ill-initialized in the main interpreter. If I remove the get_num_interpreters_seen() > 1 entirely or initialize the atomic counter for seen interpreters with 2 instead of 0, my test passes.


What's the purpose of get_pp_for_main_interpreter() and the associated once flag?

Just to ensure the singleton is initialized once. I'm not sure if the scoped GIL is still working when there are multiple interpreters running. I can revert the call_once approach back to scopded GIL approach.

@XuehaiPan XuehaiPan force-pushed the subinterp-call-once-and-store branch from 3a6ec0e to 0830872 Compare December 17, 2025 04:09
@XuehaiPan XuehaiPan force-pushed the subinterp-call-once-and-store branch from 0830872 to ac02a32 Compare December 17, 2025 04:11
@XuehaiPan
Copy link
Contributor Author

I have updated the approach to move the storage map as a separate capsule in the per-interpreter state dict instead of putting it in the internals. The PR description is also updated.

The CI tests are looking good so far. But I found some concurrency bugs for this patch in my downstream PR metaopt/optree#245. I'm not sure if it is coming from pybind11 or the Python core itself.

I will do further investigation about these.


Bug 1: Duplicate native enum registration.

ImportError: pybind11::native_enum<...>("PyTreeKind") is already registered!
Test Case
def test_import_in_subinterpreter_before_main():
    # OK
    check_script_in_subprocess(
        textwrap.dedent(
            """
            import contextlib
            import gc
            from concurrent import interpreters

            subinterpreter = None
            with contextlib.closing(interpreters.create()) as subinterpreter:
                subinterpreter.exec('import optree')

            import optree

            del optree, subinterpreter
            for _ in range(10):
                gc.collect()
            """,
        ).strip(),
        output='',
        rerun=NUM_FLAKY_RERUNS,
    )

    # FAIL
    check_script_in_subprocess(
        textwrap.dedent(
            f"""
            import contextlib
            import gc
            import random
            from concurrent import interpreters

            subinterpreter = subinterpreters = stack = None
            with contextlib.ExitStack() as stack:
                subinterpreters = [
                    stack.enter_context(contextlib.closing(interpreters.create()))
                    for _ in range({NUM_FUTURES})
                ]
                random.shuffle(subinterpreters)
                for subinterpreter in subinterpreters:
                    subinterpreter.exec('import optree')

            import optree

            del optree, subinterpreter, subinterpreters, stack
            for _ in range(10):
                gc.collect()
            """,
        ).strip(),
        output='',
        rerun=NUM_FLAKY_RERUNS,
    )

    # FAIL
    check_script_in_subprocess(
        textwrap.dedent(
            f"""
            import contextlib
            import gc
            import random
            from concurrent import interpreters

            subinterpreter = subinterpreters = stack = None
            with contextlib.ExitStack() as stack:
                subinterpreters = [
                    stack.enter_context(contextlib.closing(interpreters.create()))
                    for _ in range({NUM_FUTURES})
                ]
                random.shuffle(subinterpreters)
                for subinterpreter in subinterpreters:
                    subinterpreter.exec('import optree')

                import optree

            del optree, subinterpreter, subinterpreters, stack
            for _ in range(10):
                gc.collect()
            """,
        ).strip(),
        output='',
        rerun=NUM_FLAKY_RERUNS,
    )
Traceback
__________________________________________________________________ test_import_in_subinterpreter_before_main ___________________________________________________________________

    def test_import_in_subinterpreter_before_main():
        check_script_in_subprocess(
            textwrap.dedent(
                """
                import contextlib
                import gc
                from concurrent import interpreters
    
                subinterpreter = None
                with contextlib.closing(interpreters.create()) as subinterpreter:
                    subinterpreter.exec('import optree')
    
                import optree
    
                del optree, subinterpreter
                for _ in range(10):
                    gc.collect()
                """,
            ).strip(),
            output='',
            rerun=NUM_FLAKY_RERUNS,
        )
    
>       check_script_in_subprocess(
            textwrap.dedent(
                f"""
                import contextlib
                import gc
                import random
                from concurrent import interpreters
    
                subinterpreter = subinterpreters = stack = None
                with contextlib.ExitStack() as stack:
                    subinterpreters = [
                        stack.enter_context(contextlib.closing(interpreters.create()))
                        for _ in range({NUM_FUTURES})
                    ]
                    random.shuffle(subinterpreters)
                    for subinterpreter in subinterpreters:
                        subinterpreter.exec('import optree')
    
                import optree
    
                del optree, subinterpreter, subinterpreters, stack
                for _ in range(10):
                    gc.collect()
                """,
            ).strip(),
            output='',
            rerun=NUM_FLAKY_RERUNS,
        )


concurrent/test_subinterpreters.py:267: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

script = "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters ...optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n    gc.collect()"

    def check_script_in_subprocess(script, /, *, output, env=None, cwd=TEST_ROOT, rerun=1):
        result = ''
        for _ in range(rerun):
            try:
                result = subprocess.check_output(
                    [sys.executable, '-Walways', '-Werror', '-c', script],
                    stderr=subprocess.STDOUT,
                    text=True,
                    encoding='utf-8',
                    cwd=cwd,
                    env={
                        key: value
                        for key, value in (env if env is not None else os.environ).items()
                        if not key.startswith(('PYTHON', 'PYTEST', 'COV_'))
                    },
                )
            except subprocess.CalledProcessError as ex:
>               raise CalledProcessError(ex.returncode, ex.cmd, ex.output, ex.stderr) from None
E               helpers.CalledProcessError: Command '['/home/PanXuehai/Projects/optree/venv/bin/python3', '-Walways', '-Werror', '-c', "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters = stack = None\nwith contextlib.ExitStack() as stack:\n    subinterpreters = [\n        stack.enter_context(contextlib.closing(interpreters.create()))\n        for _ in range(16)\n    ]\n    random.shuffle(subinterpreters)\n    for subinterpreter in subinterpreters:\n        subinterpreter.exec('import optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n    gc.collect()"]' returned non-zero exit status 1.
E               Output:
E               Traceback (most recent call last):
E                 File "<string>", line 16, in <module>
E                   import optree
E                 File "/home/PanXuehai/Projects/optree/optree/__init__.py", line 17, in <module>
E                   from optree import accessors, dataclasses, functools, integrations, pytree, treespec, typing
E                 File "/home/PanXuehai/Projects/optree/optree/accessors.py", line 25, in <module>
E                   import optree._C as _C
E               ImportError: pybind11::native_enum<...>("PyTreeKind") is already registered!

_          = 0
cwd        = PosixPath('/home/PanXuehai/Projects/optree/tests')
env        = None
output     = ''
rerun      = 8
result     = ''
script     = "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters ...optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n    gc.collect()"

helpers.py:187: CalledProcessError

Bug 2: Thread-safety issues for multiple-interpreters are ever created.

  • Case 1: No multiple interpreters, only the main interpreter. Everything works fine.

    # Run tests separately
    make test PYTESTOPTS='-k "subinterpreter"'       # OK, no threading concurrency tests
    make test PYTESTOPTS='-k "not subinterpreter"'   # OK, the program only has the main interpreter
  • Case 2: Create multiple interpreters, and destroy them before doing threading concurrency tests (only the main interpreter is alive during the concurrency tests).

    The C++ objects are locked with critical sections, and the GIL is enabled during the test.

    # There is only the main interpreter alive during threading concurrency tests
    #
    #     ```python
    #     with contextlib.closing(concurrent.interpreters.create()) as subintrepreter:
    #         # do something
    #         ...
    #
    #     # at this point, only the main interpreter is alive
    #
    #     # threading tests
    #     ...
    #     ```
    
    # Run tests alltogather
    make test  # subinterpreter tests all passed, but threading concurrency tests produce incorrect results

@XuehaiPan
Copy link
Contributor Author

The test is still flaky on some platforms after removing get_num_interpreters_seen() > 1. But the number of failing tests goes down.

I don't know why the get_num_interpreters_seen() <= 1 path does not work and why the tests pass on Linux 3.14t and macOS 3.14/3.14t but still fail on Linux 3.14.

Copy link
Collaborator

@b-pass b-pass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these internals changes are a good idea. Which things are being flaky and on which python versions? I'd like to get to the bottom of those issues, but taking out big chunks of the internals management definitely makes me nervous!

@b-pass
Copy link
Collaborator

b-pass commented Dec 24, 2025

I did some poking around in your new tests, and at least in the case of MyEnum, that name is used in another test so I think some of the errors you're getting are caused by that. MyClass is also used elsewhere, but I think the MyClass errors might be a different problem with the tests.

It's possible there's a bug with the "import into sub-interpreter first" case ... we can probably detect that case during module init and increment the get_num_interpreters_seen to at least 2 if needed. But I can't tell from everything going on in this PR whether that one is still a problem or not....

Are there tests in here not directly related to gil_safe_call_once that could be split out into a separate PR?

@XuehaiPan
Copy link
Contributor Author

I did some poking around in your new tests, and at least in the case of MyEnum, that name is used in another test so I think some of the errors you're getting are caused by that. MyClass is also used elsewhere, but I think the MyClass errors might be a different problem with the tests.

The import test is in a completely new subprocess with a clean state. Modules built for other tests will not affect the correctness of the import tests for multiple-interpreters.

@XuehaiPan
Copy link
Contributor Author

Are there tests in here not directly related to gil_safe_call_once that could be split out into a separate PR?

I can do that. But I want to point out that PYBIND11_MODULE(..., py::multiple_interpreters::per_interpreter_gil()) is not usable at this moment due to duplicate registrations of C++ types.

I will comment out the newly added test in this PR and let this get merged first. Then open a new PR for concurrency issues for internals.

@XuehaiPan XuehaiPan requested review from b-pass and rwgk December 24, 2025 04:06
@b-pass
Copy link
Collaborator

b-pass commented Dec 24, 2025

But I want to point out that PYBIND11_MODULE(..., py::multiple_interpreters::per_interpreter_gil()) is not usable at this moment due to duplicate registrations of C++ types.

The existing tests do define a class ("A") and import the module into multiple subinterpreters at the same time.

Do you have a test that shows this without using concurrent pools? The concurrency makes it a lot more difficult to debug, and if it doesn't work at all it should be easy to demonstrate...?

Comment on lines +120 to +123
// "Move Subinterpreter" test is disabled on free-threaded Python 3.14+ due to a hang
// in Py_EndInterpreter() when the subinterpreter is destroyed from a different thread
// than it was created on. See: https://github.com/pybind/pybind11/pull/5940
# if PY_VERSION_HEX >= 0x030D0000 && !(PY_VERSION_HEX >= 0x030E0000 && defined(Py_GIL_DISABLED))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for this was merged to master, so this change should come out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: pybind11::gil_safe_call_once_and_store is not safe for Python subinterpreters

4 participants