-
Notifications
You must be signed in to change notification settings - Fork 459
[TE/Examples] Memory initialization and HIP cleanup fixes #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
GPU Driver may optimize transfers for non-initialized memory, while on real applications the memory is always initialized. So, need to initialize the memory in the benchmark too. Added new command line arguments --init_mem to transfer_engine_bench. Default value is true.
- transfer_engine_validator checks data on random iterations. It may happens that it does not check data at all. So, added checking at least on the first iteration - Print that data validation passed
hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc, which is possible inside NvlinkTransport::allocatePinnedLocalMemory. So, call just freePinnedLocalMemory, which will handle how to free.
Summary of ChangesHello @amd-arozanov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the reliability and accuracy of the Transfer Engine examples and validation tools. It addresses critical issues related to memory initialization in benchmarks to ensure more realistic performance measurements and guarantees data integrity checks within the validator. Additionally, it refines the memory management for HIP environments, resolving potential deallocation failures. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several useful fixes for the transfer engine examples. The changes to initialize memory for more realistic benchmarking and to ensure the validator checks data at least once are great improvements. The fix for HIP memory cleanup is also important. However, I've found a critical resource leak in the memory deallocation logic for CUDA/MUSA that was highlighted by the new HIP-specific code. My review includes suggestions to fix this leak by simplifying and unifying the deallocation logic across all platforms, which will also improve code consistency.
| #if defined(USE_CUDA) || defined(USE_MUSA) | ||
| CUmemGenericAllocationHandle handle; | ||
| auto result = cuMemRetainAllocationHandle(&handle, addr); | ||
| if (result == CUDA_SUCCESS) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif | ||
| #elif USE_HIP | ||
| // hipMemRetainAllocationHandle fails if memory was allocated using | ||
| // hipMalloc, which is possible inside | ||
| // NvlinkTransport::allocatePinnedLocalMemory. So, call just | ||
| // freePinnedLocalMemory, which will handle how to free | ||
| if (FLAGS_use_vram) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif // USE_HIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.
The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.
I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.
if (FLAGS_use_vram) {
// freePinnedLocalMemory can handle memory allocated by either
// NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| #if defined(USE_CUDA) || defined(USE_MUSA) | ||
| CUmemGenericAllocationHandle handle; | ||
| auto result = cuMemRetainAllocationHandle(&handle, addr); | ||
| if (result == CUDA_SUCCESS) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif | ||
| #elif USE_HIP | ||
| // hipMemRetainAllocationHandle fails if memory was allocated using | ||
| // hipMalloc, which is possible inside | ||
| // NvlinkTransport::allocatePinnedLocalMemory. So, call just | ||
| // freePinnedLocalMemory, which will handle how to free | ||
| if (FLAGS_use_vram) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif // USE_HIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.
The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.
I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.
if (FLAGS_use_vram) {
// freePinnedLocalMemory can handle memory allocated by either
// NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| #if defined(USE_CUDA) || defined(USE_MUSA) | ||
| CUmemGenericAllocationHandle handle; | ||
| auto result = cuMemRetainAllocationHandle(&handle, addr); | ||
| if (result == CUDA_SUCCESS) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif | ||
| #elif USE_HIP | ||
| // hipMemRetainAllocationHandle fails if memory was allocated using | ||
| // hipMalloc, which is possible inside | ||
| // NvlinkTransport::allocatePinnedLocalMemory. So, call just | ||
| // freePinnedLocalMemory, which will handle how to free | ||
| if (FLAGS_use_vram) { | ||
| mooncake::NvlinkTransport::freePinnedLocalMemory(addr); | ||
| return; | ||
| } | ||
| #endif // USE_HIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.
The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.
I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.
if (FLAGS_use_vram) {
// freePinnedLocalMemory can handle memory allocated by either
// NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
From gemini: The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources. The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.
| // | ||
| // Modifications Copyright(C) 2025 Advanced Micro Devices, Inc. | ||
| // All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please omit the copyright notice if it's not essential. Thx a lot. Other LGTM. ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stmatengss, I am afraid it is not possible. Internal rules in my company require adding copyrights to any modifications in source files
Description
Transfer engine examples changes:
Type of Change
How Has This Been Tested?
Ran transfer_engine_bench and transfer_engine_validator on nodes with AMG GPU
Checklist