NVFP4: Work around intermittent incorrect results for backward GEMMs#580
NVFP4: Work around intermittent incorrect results for backward GEMMs#580matthiasdiener wants to merge 5 commits intodevfrom
Conversation
| # FIXME: hipBLASLt BF16xBF16->FP32 GEMM algos with ALPHA_DEVICE_VECTOR produce | ||
| # incorrect results intermittently on AMDGPU. Skip backward-containing sub-tests for | ||
| # nvfp4. | ||
| if IS_HIP_EXTENSION and QUANTIZATION == "nvfp4": |
There was a problem hiding this comment.
Is there a ticket for that?
nit: keep original test_dict and add conditional override after that so it will be easier to remove the w/a in the future and less merging conflict
There was a problem hiding this comment.
Also, should gpu family be checked here?
There was a problem hiding this comment.
Is there a ticket for that?
I did not create one, I found it impossible to reproduce this issue without TE. Since the current implementation with the bf16 Dequant+GEMM is only a stop-gap until we have nvfp4 support in hipblaslt, it may not be necessary to create a ticket?
nit: keep original test_dict and add conditional override after that so it will be easier to remove the w/a in the future and less merging conflict
The workaround has been restructured in b609614.
Also, should gpu family be checked here?
No, I overserved the same issue on gfx942 and gfx950.
… GEMMs" This reverts commit 8f9f431.
| return {"rtol": 0.4, "atol": 0.25} | ||
| elif QUANTIZATION == "nvfp4": | ||
| # TODO(zhongboz): investigate why the tolerance is so large | ||
| if IS_HIP_EXTENSION: |
There was a problem hiding this comment.
nit: move this nv upstream todo to their tolerance return statement. Otherwise it looks like zhongboz added this IS_HIP_EXTENSION branch
Description
Main failing test:
torchrun --nproc_per_node=4 /dockerx/TransformerEngine/tests/pytorch/distributed/run_numerics.py --quantization nvfp4Observed on gfx942 and gfx950.
Smaller reproducer
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: