Skip to content

train loss is prone to producing NaN. or gradient explosion when train the repo in local device(nvidia H100 ) #32

@yangyiname

Description

@yangyiname

I encountered a training error. when train the repo in H100 device, train loss will produce NAN in few epoch,use default config_large yaml , train datasets is MUSDB18. like this:
Training Epoch 3 ...
2025-06-05 14:59:49,388 - INFO - Train Summary | Epoch 3 | Loss=0.0977 | Grad=26896.5504
2025-06-05 14:59:49,388 - INFO - ----------------------------------------------------------------------
2025-06-05 14:59:49,389 - INFO - Cross validation...
2025-06-05 15:09:57,869 - INFO - Valid Summary | Epoch 3 | Loss=0.1629 | Nsdr=5.748
2025-06-05 15:09:57,870 - INFO - New best valid nsdr 5.7475
2025-06-05 15:10:00,124 - INFO - Learning rate adjusted to 0.0003
2025-06-05 15:10:00,125 - INFO - ----------------------------------------------------------------------
2025-06-05 15:10:00,125 - INFO - Training Epoch 4 ...
2025-06-05 15:23:13,110 - INFO - Train Summary | Epoch 4 | Loss=nan | Grad=nan
2025-06-05 15:23:13,111 - INFO - ----------------------------------------------------------------------
2025-06-05 15:23:13,111 - INFO - Cross validation...
2025-06-05 15:33:26,859 - INFO - Valid Summary | Epoch 4 | Loss=0.1709 | Nsdr=5.462
2025-06-05 15:33:29,100 - INFO - Learning rate adjusted to 0.0003

But, when I change device from H100 to nvidia rtx4090, the error disappeared. use same conda env and code.
torch version is 2.1.2, and cuda is 12. how can i solve the error? thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions