Fix CUDA out of memory error during backward pass #218

Likhithsai2580 · 2025-04-12T16:39:37Z

Add fix for CUDA out of memory error during backward pass.

Environment Variable: Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True at the beginning of lpm_kernel/L2/train.py to avoid fragmentation.
Error Handling: Add a try-except block around the backward pass to catch torch.cuda.OutOfMemoryError.
- Log an error message when the out of memory error occurs.
- Free up GPU memory using torch.cuda.empty_cache() in the except block.
- Retry the backward pass after freeing up GPU memory.

Add fix for CUDA out of memory error during backward pass. * **Environment Variable**: Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` at the beginning of `lpm_kernel/L2/train.py` to avoid fragmentation. * **Error Handling**: Add a try-except block around the backward pass to catch `torch.cuda.OutOfMemoryError`. * Log an error message when the out of memory error occurs. * Free up GPU memory using `torch.cuda.empty_cache()` in the except block. * Retry the backward pass after freeing up GPU memory.

CLAassistant · 2025-04-12T16:39:43Z

All committers have signed the CLA.

lpm_kernel/L2/train.py

ScarletttMoon · 2025-05-08T09:31:06Z

Hi @Likhithsai2580 👋,

Thank you so much for your contribution to this PR! Your work is really appreciated. If you haven’t already, feel free to join our Discord community here: Discord Invite Link — it's a great place to connect with our team and other contributors, share ideas, and stay up to date with the project! You can find me as @scarlettt_moon there!

Looking forward to connecting! 😊

kevinaimonster requested review from kevinaimonster and yingapple April 14, 2025 02:05

yingapple reviewed Apr 14, 2025

View reviewed changes

lpm_kernel/L2/train.py Show resolved Hide resolved

kevin-mindverse changed the base branch from master to develop April 24, 2025 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CUDA out of memory error during backward pass #218

Fix CUDA out of memory error during backward pass #218

Uh oh!

Likhithsai2580 commented Apr 12, 2025

Uh oh!

CLAassistant commented Apr 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

ScarletttMoon commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix CUDA out of memory error during backward pass #218

Are you sure you want to change the base?

Fix CUDA out of memory error during backward pass #218

Uh oh!

Conversation

Likhithsai2580 commented Apr 12, 2025

Uh oh!

CLAassistant commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ScarletttMoon commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Apr 12, 2025 •

edited

Loading