-
Notifications
You must be signed in to change notification settings - Fork 171
add dealing with memory access fault in Mp tuner #1680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the MP (multiprocessing) tuner's robustness by adding comprehensive error handling for GPU memory access faults and kernel execution issues. When tuning GPU kernels, some configurations can cause memory faults or hangs that would previously interrupt the entire tuning process. The changes isolate failures to individual tasks and allow tuning to continue.
Key changes include:
- Enhanced exception handling in the worker process with specific handling for CUDA errors, timeouts, and memory faults
- Introduction of distinct error codes:
us=-1for unsupported operations/errors andus=inffor memory faults/hangs - Configurable warmup iterations to improve profiling accuracy for short-running kernels
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 28 comments.
Show a summary per file
| File | Description |
|---|---|
| aiter/utility/mp_tuner.py | Major refactoring to add timeout handling, process pool restart on crashes, explicit GPU ID passing, and comprehensive exception handling with multiple error types |
| aiter/utility/base_tuner.py | Added constants for error classification (INVALID_TIME, INF_TIME, INVLAID_ERR_RATIO), configurable warmup/iteration counts, and updated filtering logic to handle both error types |
| gradlib/gradlib/GemmTuner.py | Added num_warms parameter to configure warmup iterations, increased default cold/warm iteration counts, and integrated set_run_iters for dynamic profiling configuration |
| hsa/gfx942/fmoe_2stages/tune.py | Fixed tensor reshaping for scale parameters, updated return values from (-1, -1) to (0, 0) for invalid cases, and improved filtering to exclude both -1 and inf times |
| aiter/jit/core.py | Added check to prevent duplicate "cu_num" key when updating config files |
Comments suppressed due to low confidence (1)
aiter/utility/mp_tuner.py:125
- This comment appears to contain commented-out code.
#try:
# torch.cuda.synchronize()
# torch.cuda.empty_cache()
#except:
# pass
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* refactor tuner to support memory access fault not interrupting tuning * reboot pool and us=inf for access fault, us=-1 for other error * fix conflict * Update tune.py * fix error when tuning hipblaslt * add timeout and iters args * clean code * update * update * rm print and update check_interval=10s * update mp_tuner.py * format * fix batch_gemm tune * update * rm redundant code * fix format
Motivation
when add new implements, there are some cases that will cause memory access fault, which may interrupt tuning processing,
This will deal with the problem to continue tuning.
Technical Details
--timeout, timeout for get result per test (s)
--warmup, warmup iteration before profiling
--iters, iterations to gather perf
Test Plan
Test Result
Submission Checklist