Skip to content

V5.0.x revert reorder and fix ucc#10

Open
janjust wants to merge 2 commits into
v5.0.xfrom
v5.0.x-revert-reorder-and-fix-ucc
Open

V5.0.x revert reorder and fix ucc#10
janjust wants to merge 2 commits into
v5.0.xfrom
v5.0.x-revert-reorder-and-fix-ucc

Conversation

@janjust
Copy link
Copy Markdown
Owner

@janjust janjust commented Apr 8, 2026

No description provided.

janjust added 2 commits April 7, 2026 21:04
The reordering in 4370cd8 moved the cid>=3 communicator cleanup loop
to before OBJ_DESTRUCT(&ompi_mpi_comm_world), intending to keep
MPI_COMM_WORLD alive while user-comm coll modules tear down.  However
this breaks any coll component (e.g. coll/han) that creates
OMPI-internal sub-communicators on behalf of MPI_COMM_WORLD: those
sub-comms all have cid >= 3 > cid(COMM_WORLD)=1 and therefore carry no
EXTRA_RETAIN flag.  The loop frees them before COMM_WORLD is destructed,
leaving COMM_WORLD's module destructors (HAN, ACOLL, …) with dangling
pointers that they then attempt to ompi_comm_free() → use-after-free.

Restore the original ordering: COMM_SELF and COMM_WORLD are destructed
first (their coll module destructors run while the world is still fully
functional for OOB), then MPI_COMM_NULL, then the cid>=3 leak-cleanup
loop.  The verbose opal_output_verbose() logging added in the earlier
commit is retained as a useful improvement.

The coll/ucc comment that referenced the old ordering is updated to
reflect the correct invariant.

Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
With the original ompi_comm_finalize() ordering restored, any
OMPI-internal communicators created by coll components (e.g. HAN
sub-comms) may still hold live UCC teams when MPI_COMM_WORLD's module
destructor fires mca_coll_ucc_finalize_ctx().  Those communicators are
cleaned up by the cid>=3 loop after COMM_WORLD is gone, so their module
destructors would call ucc_team_destroy() on an already-destroyed
context.

Fix: maintain an active_modules list on the component.  Every module
that successfully enables itself appends to the list; its destructor
removes itself from the list before doing any team teardown.
mca_coll_ucc_finalize_ctx() now iterates active_modules and destroys
any remaining teams before calling ucc_context_destroy(), ensuring the
context is only destroyed once all teams are gone.  Late-firing module
destructors find ucc_team already NULL and skip team teardown safely.

Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant