Skip to content

Conversation

@Likhithsai2580
Copy link

Add fix for CUDA out of memory error during backward pass.

  • Environment Variable: Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True at the beginning of lpm_kernel/L2/train.py to avoid fragmentation.
  • Error Handling: Add a try-except block around the backward pass to catch torch.cuda.OutOfMemoryError.
    • Log an error message when the out of memory error occurs.
    • Free up GPU memory using torch.cuda.empty_cache() in the except block.
    • Retry the backward pass after freeing up GPU memory.

Add fix for CUDA out of memory error during backward pass.

* **Environment Variable**: Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` at the beginning of `lpm_kernel/L2/train.py` to avoid fragmentation.
* **Error Handling**: Add a try-except block around the backward pass to catch `torch.cuda.OutOfMemoryError`.
  * Log an error message when the out of memory error occurs.
  * Free up GPU memory using `torch.cuda.empty_cache()` in the except block.
  * Retry the backward pass after freeing up GPU memory.
@CLAassistant
Copy link

CLAassistant commented Apr 12, 2025

CLA assistant check
All committers have signed the CLA.

@kevin-mindverse kevin-mindverse changed the base branch from master to develop April 24, 2025 06:08
@ScarletttMoon
Copy link
Collaborator

Hi @Likhithsai2580 👋,

Thank you so much for your contribution to this PR! Your work is really appreciated. If you haven’t already, feel free to join our Discord community here: Discord Invite Link — it's a great place to connect with our team and other contributors, share ideas, and stay up to date with the project! You can find me as @scarlettt_moon there!

Looking forward to connecting! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants