Skip to content

Conversation

@bendudson
Copy link
Contributor

Catch exceptions and call MPI_Abort rather than trying to continue.

There is no way to recover and synchronise processors unless all processors threw an exception at the same point. If only one processor throws an exception then the others will wait indefinitely on the next MPI communication or collective operation.

Fixes Hermes-3 issue boutproject/hermes-3#448

bendudson and others added 2 commits January 15, 2026 15:54
Catch exceptions and call MPI_Abort rather than trying to continue.

There is no way to recover and synchronise processors unless all
processors threw an exception at the same point. If only one processor
throws an exception then the others will wait indefinitely on the
next MPI communication or collective operation.
@mikekryjak
Copy link
Contributor

If the failures are to random transient NaNs, can we substitute them with an average of surrounding cells to prevent the failure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants