add dealing with memory access fault in Mp tuner #1680

yzhou103 · 2025-12-18T05:59:15Z

Motivation

when add new implements, there are some cases that will cause memory access fault, which may interrupt tuning processing,
This will deal with the problem to continue tuning.

Technical Details

fix memory access fault tuning, us=inf means memory fault or hang, us=-1 means other error (example: not support)
warm iterators, for some cases, kernel time is very short, which may cause inaccurate profiling time. it is better to set warm up time >100us ?
add three parameters:
--timeout, timeout for get result per test (s)
--warmup, warmup iteration before profiling
--iters, iterations to gather perf

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR improves the MP (multiprocessing) tuner's robustness by adding comprehensive error handling for GPU memory access faults and kernel execution issues. When tuning GPU kernels, some configurations can cause memory faults or hangs that would previously interrupt the entire tuning process. The changes isolate failures to individual tasks and allow tuning to continue.

Key changes include:

Enhanced exception handling in the worker process with specific handling for CUDA errors, timeouts, and memory faults
Introduction of distinct error codes: us=-1 for unsupported operations/errors and us=inf for memory faults/hangs
Configurable warmup iterations to improve profiling accuracy for short-running kernels

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 28 comments.

Show a summary per file

File	Description
aiter/utility/mp_tuner.py	Major refactoring to add timeout handling, process pool restart on crashes, explicit GPU ID passing, and comprehensive exception handling with multiple error types
aiter/utility/base_tuner.py	Added constants for error classification (INVALID_TIME, INF_TIME, INVLAID_ERR_RATIO), configurable warmup/iteration counts, and updated filtering logic to handle both error types
gradlib/gradlib/GemmTuner.py	Added num_warms parameter to configure warmup iterations, increased default cold/warm iteration counts, and integrated set_run_iters for dynamic profiling configuration
hsa/gfx942/fmoe_2stages/tune.py	Fixed tensor reshaping for scale parameters, updated return values from (-1, -1) to (0, 0) for invalid cases, and improved filtering to exclude both -1 and inf times
aiter/jit/core.py	Added check to prevent duplicate "cu_num" key when updating config files

Comments suppressed due to low confidence (1)

aiter/utility/mp_tuner.py:125

This comment appears to contain commented-out code.

        #try:
        #    torch.cuda.synchronize()
        #    torch.cuda.empty_cache()
        #except:
        #    pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aiter/utility/base_tuner.py

aiter/utility/mp_tuner.py

aiter/utility/base_tuner.py

aiter/utility/mp_tuner.py

aiter/utility/base_tuner.py

aiter/utility/mp_tuner.py

* refactor tuner to support memory access fault not interrupting tuning * reboot pool and us=inf for access fault, us=-1 for other error * fix conflict * Update tune.py * fix error when tuning hipblaslt * add timeout and iters args * clean code * update * update * rm print and update check_interval=10s * update mp_tuner.py * format * fix batch_gemm tune * update * rm redundant code * fix format

yzhou103 added 3 commits December 15, 2025 05:56

refactor tuner to support memory access fault not interrupting tuning

80a416d

reboot pool and us=inf for access fault, us=-1 for other error

714d136

fix conflict

ca002c1

yzhou103 requested review from a team and Copilot December 18, 2025 05:59

Copilot started reviewing on behalf of yzhou103 December 18, 2025 06:00 View session

Update tune.py

da3ab5a

Copilot AI reviewed Dec 18, 2025

View reviewed changes

yzhou103 and others added 14 commits December 20, 2025 15:18

fix error when tuning hipblaslt

3886b0a

add timeout and iters args

f216f06

Merge branch 'main' into mp_tuner_time

c3b0483

clean code

b9c503d

update

2eea4e3

update

b813d70

rm print and update check_interval=10s

630129d

update mp_tuner.py

f287814

format

e81f308

Merge branch 'main' into mp_tuner_time

943dfc1

fix batch_gemm tune

bf95edf

Merge branch 'main' into mp_tuner_time

dd43e8a

Merge branch 'main' into mp_tuner_time

31c5e72

update

9487dbe

yzhou103 commented Dec 26, 2025

View reviewed changes

aiter/utility/mp_tuner.py Show resolved Hide resolved

yzhou103 and others added 3 commits December 26, 2025 11:43

rm redundant code

22a65c0

fix format

c7513e8

Merge branch 'main' into mp_tuner_time

218d5ee

valarLip approved these changes Dec 29, 2025

View reviewed changes

valarLip merged commit 67b95d4 into main Dec 30, 2025
22 checks passed

valarLip deleted the mp_tuner_time branch December 30, 2025 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add dealing with memory access fault in Mp tuner #1680

add dealing with memory access fault in Mp tuner #1680

Uh oh!

yzhou103 commented Dec 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add dealing with memory access fault in Mp tuner #1680

add dealing with memory access fault in Mp tuner #1680

Uh oh!

Conversation

yzhou103 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzhou103 commented Dec 18, 2025 •

edited

Loading