Skip to content

Conversation

@Anri-Lombard
Copy link
Contributor

Propagates exceptions from CPU scheduler worker threads to Python, preventing process crashes when operations like mx.linalg.solve encounter errors (e.g., singular matrices).

Closes #2888. Supersedes #2964.

This revision addresses @awni's feedback from the original PR:

On state management - the task wrapper in encoder.h now uses try-catch to guarantee notify_task_completion() always runs. This prevents the task counter from staying elevated and causing deadlocks. Memory cleanup is handled via RAII.

On timing - added check_cpu_exceptions() to check all CPU streams, and now checking exceptions after every wait_for_one() call in transforms.cpp. Exceptions surface during the evaluation loop, not just at synchronize().

@awni - please let me know if there are any areas I might have missed! 🙏

Exceptions thrown in CPU scheduler worker threads (e.g., when LAPACK
detects a singular matrix) were not being caught, causing process
crashes instead of raising Python exceptions.

Added exception capture in StreamThread and re-throwing during
synchronize() so errors propagate as RuntimeError to Python.

Fixes ml-explore#2888
- Wrap task in try-catch to guarantee notify_task_completion runs
- Add check_cpu_exceptions() to check all CPU streams
- Check exceptions after wait_for_one() calls in transforms.cpp

Fixes state management concern: task counter always decremented.
Fixes timing concern: exceptions checked during eval, not just sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Singular matrix

1 participant