V5.0.x revert reorder and fix ucc#10
Open
janjust wants to merge 2 commits into
Open
Conversation
The reordering in 4370cd8 moved the cid>=3 communicator cleanup loop to before OBJ_DESTRUCT(&ompi_mpi_comm_world), intending to keep MPI_COMM_WORLD alive while user-comm coll modules tear down. However this breaks any coll component (e.g. coll/han) that creates OMPI-internal sub-communicators on behalf of MPI_COMM_WORLD: those sub-comms all have cid >= 3 > cid(COMM_WORLD)=1 and therefore carry no EXTRA_RETAIN flag. The loop frees them before COMM_WORLD is destructed, leaving COMM_WORLD's module destructors (HAN, ACOLL, …) with dangling pointers that they then attempt to ompi_comm_free() → use-after-free. Restore the original ordering: COMM_SELF and COMM_WORLD are destructed first (their coll module destructors run while the world is still fully functional for OOB), then MPI_COMM_NULL, then the cid>=3 leak-cleanup loop. The verbose opal_output_verbose() logging added in the earlier commit is retained as a useful improvement. The coll/ucc comment that referenced the old ordering is updated to reflect the correct invariant. Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
With the original ompi_comm_finalize() ordering restored, any OMPI-internal communicators created by coll components (e.g. HAN sub-comms) may still hold live UCC teams when MPI_COMM_WORLD's module destructor fires mca_coll_ucc_finalize_ctx(). Those communicators are cleaned up by the cid>=3 loop after COMM_WORLD is gone, so their module destructors would call ucc_team_destroy() on an already-destroyed context. Fix: maintain an active_modules list on the component. Every module that successfully enables itself appends to the list; its destructor removes itself from the list before doing any team teardown. mca_coll_ucc_finalize_ctx() now iterates active_modules and destroys any remaining teams before calling ucc_context_destroy(), ensuring the context is only destroyed once all teams are gone. Late-firing module destructors find ucc_team already NULL and skip team teardown safely. Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.