Skip to content

Conversation

@yzhou103
Copy link
Contributor

@yzhou103 yzhou103 commented Dec 18, 2025

Motivation

when add new implements, there are some cases that will cause memory access fault, which may interrupt tuning processing,
This will deal with the problem to continue tuning.

Technical Details

  1. fix memory access fault tuning, us=inf means memory fault or hang, us=-1 means other error (example: not support)
  2. warm iterators, for some cases, kernel time is very short, which may cause inaccurate profiling time. it is better to set warm up time >100us ?
  3. add three parameters:
    --timeout, timeout for get result per test (s)
    --warmup, warmup iteration before profiling
    --iters, iterations to gather perf

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the MP (multiprocessing) tuner's robustness by adding comprehensive error handling for GPU memory access faults and kernel execution issues. When tuning GPU kernels, some configurations can cause memory faults or hangs that would previously interrupt the entire tuning process. The changes isolate failures to individual tasks and allow tuning to continue.

Key changes include:

  • Enhanced exception handling in the worker process with specific handling for CUDA errors, timeouts, and memory faults
  • Introduction of distinct error codes: us=-1 for unsupported operations/errors and us=inf for memory faults/hangs
  • Configurable warmup iterations to improve profiling accuracy for short-running kernels

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 28 comments.

Show a summary per file
File Description
aiter/utility/mp_tuner.py Major refactoring to add timeout handling, process pool restart on crashes, explicit GPU ID passing, and comprehensive exception handling with multiple error types
aiter/utility/base_tuner.py Added constants for error classification (INVALID_TIME, INF_TIME, INVLAID_ERR_RATIO), configurable warmup/iteration counts, and updated filtering logic to handle both error types
gradlib/gradlib/GemmTuner.py Added num_warms parameter to configure warmup iterations, increased default cold/warm iteration counts, and integrated set_run_iters for dynamic profiling configuration
hsa/gfx942/fmoe_2stages/tune.py Fixed tensor reshaping for scale parameters, updated return values from (-1, -1) to (0, 0) for invalid cases, and improved filtering to exclude both -1 and inf times
aiter/jit/core.py Added check to prevent duplicate "cu_num" key when updating config files
Comments suppressed due to low confidence (1)

aiter/utility/mp_tuner.py:125

  • This comment appears to contain commented-out code.
        #try:
        #    torch.cuda.synchronize()
        #    torch.cuda.empty_cache()
        #except:
        #    pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@valarLip valarLip merged commit 67b95d4 into main Dec 30, 2025
22 checks passed
@valarLip valarLip deleted the mp_tuner_time branch December 30, 2025 13:22
farlukas pushed a commit that referenced this pull request Jan 5, 2026
* refactor tuner to support memory access fault not interrupting tuning

* reboot pool and us=inf for access fault, us=-1 for other error

* fix conflict

* Update tune.py

* fix error when tuning hipblaslt

* add timeout and iters args

* clean code

* update

* update

* rm print and update check_interval=10s

* update mp_tuner.py

* format

* fix batch_gemm tune

* update

* rm redundant code

* fix format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants