Fix scheduler exception propagation to Python #2983
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Propagates exceptions from CPU scheduler worker threads to Python, preventing process crashes when operations like
mx.linalg.solveencounter errors (e.g., singular matrices).Closes #2888. Supersedes #2964.
This revision addresses @awni's feedback from the original PR:
On state management - the task wrapper in
encoder.hnow uses try-catch to guaranteenotify_task_completion()always runs. This prevents the task counter from staying elevated and causing deadlocks. Memory cleanup is handled via RAII.On timing - added
check_cpu_exceptions()to check all CPU streams, and now checking exceptions after everywait_for_one()call intransforms.cpp. Exceptions surface during the evaluation loop, not just atsynchronize().@awni - please let me know if there are any areas I might have missed! 🙏