I encountered a training error. when train the repo in H100 device, train loss will produce NAN in few epoch,use default config_large yaml , train datasets is MUSDB18. like this:
Training Epoch 3 ...
2025-06-05 14:59:49,388 - INFO - Train Summary | Epoch 3 | Loss=0.0977 | Grad=26896.5504
2025-06-05 14:59:49,388 - INFO - ----------------------------------------------------------------------
2025-06-05 14:59:49,389 - INFO - Cross validation...
2025-06-05 15:09:57,869 - INFO - Valid Summary | Epoch 3 | Loss=0.1629 | Nsdr=5.748
2025-06-05 15:09:57,870 - INFO - New best valid nsdr 5.7475
2025-06-05 15:10:00,124 - INFO - Learning rate adjusted to 0.0003
2025-06-05 15:10:00,125 - INFO - ----------------------------------------------------------------------
2025-06-05 15:10:00,125 - INFO - Training Epoch 4 ...
2025-06-05 15:23:13,110 - INFO - Train Summary | Epoch 4 | Loss=nan | Grad=nan
2025-06-05 15:23:13,111 - INFO - ----------------------------------------------------------------------
2025-06-05 15:23:13,111 - INFO - Cross validation...
2025-06-05 15:33:26,859 - INFO - Valid Summary | Epoch 4 | Loss=0.1709 | Nsdr=5.462
2025-06-05 15:33:29,100 - INFO - Learning rate adjusted to 0.0003
But, when I change device from H100 to nvidia rtx4090, the error disappeared. use same conda env and code.
torch version is 2.1.2, and cuda is 12. how can i solve the error? thank you
I encountered a training error. when train the repo in H100 device, train loss will produce NAN in few epoch,use default config_large yaml , train datasets is MUSDB18. like this:
Training Epoch 3 ...
2025-06-05 14:59:49,388 - INFO - Train Summary | Epoch 3 | Loss=0.0977 | Grad=26896.5504
2025-06-05 14:59:49,388 - INFO - ----------------------------------------------------------------------
2025-06-05 14:59:49,389 - INFO - Cross validation...
2025-06-05 15:09:57,869 - INFO - Valid Summary | Epoch 3 | Loss=0.1629 | Nsdr=5.748
2025-06-05 15:09:57,870 - INFO - New best valid nsdr 5.7475
2025-06-05 15:10:00,124 - INFO - Learning rate adjusted to 0.0003
2025-06-05 15:10:00,125 - INFO - ----------------------------------------------------------------------
2025-06-05 15:10:00,125 - INFO - Training Epoch 4 ...
2025-06-05 15:23:13,110 - INFO - Train Summary | Epoch 4 | Loss=nan | Grad=nan
2025-06-05 15:23:13,111 - INFO - ----------------------------------------------------------------------
2025-06-05 15:23:13,111 - INFO - Cross validation...
2025-06-05 15:33:26,859 - INFO - Valid Summary | Epoch 4 | Loss=0.1709 | Nsdr=5.462
2025-06-05 15:33:29,100 - INFO - Learning rate adjusted to 0.0003
But, when I change device from H100 to nvidia rtx4090, the error disappeared. use same conda env and code.
torch version is 2.1.2, and cuda is 12. how can i solve the error? thank you