train loss is prone to producing NaN. or gradient explosion when train the repo in local device(nvidia H100 )

I encountered a training error. when train the repo in H100 device, train loss will produce NAN in few epoch，use default config_large yaml , train datasets is MUSDB18. like this:
 Training Epoch 3 ...
2025-06-05 14:59:49,388 - INFO - Train Summary | Epoch 3 | Loss=0.0977 | Grad=26896.5504
2025-06-05 14:59:49,388 - INFO - ----------------------------------------------------------------------
2025-06-05 14:59:49,389 - INFO - Cross validation...
2025-06-05 15:09:57,869 - INFO - Valid Summary | Epoch 3 | Loss=0.1629 | Nsdr=5.748
2025-06-05 15:09:57,870 - INFO - New best valid nsdr 5.7475
2025-06-05 15:10:00,124 - INFO - Learning rate adjusted to 0.0003
2025-06-05 15:10:00,125 - INFO - ----------------------------------------------------------------------
2025-06-05 15:10:00,125 - INFO - Training Epoch 4 ...
2025-06-05 15:23:13,110 - INFO - Train Summary | Epoch 4 | Loss=nan | Grad=nan
2025-06-05 15:23:13,111 - INFO - ----------------------------------------------------------------------
2025-06-05 15:23:13,111 - INFO - Cross validation...
2025-06-05 15:33:26,859 - INFO - Valid Summary | Epoch 4 | Loss=0.1709 | Nsdr=5.462
2025-06-05 15:33:29,100 - INFO - Learning rate adjusted to 0.0003

But, when I change device from H100 to nvidia rtx4090, the error disappeared. use same conda env and code.
torch version is 2.1.2, and cuda is 12. how can i  solve the error? thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train loss is prone to producing NaN. or gradient explosion when train the repo in local device(nvidia H100 ) #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

train loss is prone to producing NaN. or gradient explosion when train the repo in local device(nvidia H100 ) #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions