[TE/Examples] Memory initialization and HIP cleanup fixes #1179

amd-arozanov · 2025-12-08T12:41:35Z

Description

Transfer engine examples changes:

Initialize allocated memory:
- GPU Driver may optimize transfers for non-initialized memory, while on real applications the memory is always initialized. So, need to initialize the memory in the benchmarks too.
Check data at least once in TransferEngineValidator:
- transfer_engine_validator checks data on random iterations. It may happen that it does not check data at all. So, added checking at least on the first iteration.
Directly call freePinnedLocalMemory for HIP:
- hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc, which is possible inside NvlinkTransport::allocatePinnedLocalMemory. So, call just freePinnedLocalMemory, which will handle how to free.
- UPDATE: Directly call freePinnedLocalMemory for CUDA and MUSA too as suggested by gemini

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Ran transfer_engine_bench and transfer_engine_validator on nodes with AMG GPU

Checklist

I have performed a self-review of my own code.
I have updated the documentation.
I have added tests to prove my changes are effective.

GPU Driver may optimize transfers for non-initialized memory, while on real applications the memory is always initialized. So, need to initialize the memory in the benchmark too. Added new command line arguments --init_mem to transfer_engine_bench. Default value is true.

- transfer_engine_validator checks data on random iterations. It may happens that it does not check data at all. So, added checking at least on the first iteration - Print that data validation passed

hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc, which is possible inside NvlinkTransport::allocatePinnedLocalMemory. So, call just freePinnedLocalMemory, which will handle how to free.

gemini-code-assist · 2025-12-08T12:41:52Z

Summary of Changes

Hello @amd-arozanov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability and accuracy of the Transfer Engine examples and validation tools. It addresses critical issues related to memory initialization in benchmarks to ensure more realistic performance measurements and guarantees data integrity checks within the validator. Additionally, it refines the memory management for HIP environments, resolving potential deallocation failures.

Highlights

Memory Initialization in Examples: Allocated memory in Transfer Engine examples is now initialized by default, preventing GPU driver optimizations for uninitialized memory that could skew benchmark results. A new flag init_mem allows controlling this behavior.
Guaranteed Data Validation: The TransferEngineValidator now ensures that data is checked at least once during its execution, specifically on the first iteration, to prevent scenarios where random checks might miss validation entirely.
HIP Memory Deallocation Fix: Improved HIP memory deallocation logic in freeMemoryPool to directly call NvlinkTransport::freePinnedLocalMemory when hipMemRetainAllocationHandle might fail for hipMalloc-allocated memory.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several useful fixes for the transfer engine examples. The changes to initialize memory for more realistic benchmarking and to ensure the validator checks data at least once are great improvements. The fix for HIP memory cleanup is also important. However, I've found a critical resource leak in the memory deallocation logic for CUDA/MUSA that was highlighted by the new HIP-specific code. My review includes suggestions to fix this leak by simplifying and unifying the deallocation logic across all platforms, which will also improve code consistency.

gemini-code-assist · 2025-12-08T12:44:23Z

mooncake-transfer-engine/example/transfer_engine_bench.cpp

+#if defined(USE_CUDA) || defined(USE_MUSA)
    CUmemGenericAllocationHandle handle;
    auto result = cuMemRetainAllocationHandle(&handle, addr);
    if (result == CUDA_SUCCESS) {
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }
-#endif
+#elif USE_HIP
+    // hipMemRetainAllocationHandle fails if memory was allocated using
+    // hipMalloc, which is possible inside
+    // NvlinkTransport::allocatePinnedLocalMemory. So, call just
+    // freePinnedLocalMemory, which will handle how to free
+    if (FLAGS_use_vram) {
+        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
+        return;
+    }
+#endif  // USE_HIP


The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

if (FLAGS_use_vram) { // freePinnedLocalMemory can handle memory allocated by either // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc. mooncake::NvlinkTransport::freePinnedLocalMemory(addr); return; }

gemini-code-assist · 2025-12-08T12:44:23Z

mooncake-transfer-engine/example/transfer_engine_bench_with_notify.cpp

+#if defined(USE_CUDA) || defined(USE_MUSA)
    CUmemGenericAllocationHandle handle;
    auto result = cuMemRetainAllocationHandle(&handle, addr);
    if (result == CUDA_SUCCESS) {
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }
-#endif
+#elif USE_HIP
+    // hipMemRetainAllocationHandle fails if memory was allocated using
+    // hipMalloc, which is possible inside
+    // NvlinkTransport::allocatePinnedLocalMemory. So, call just
+    // freePinnedLocalMemory, which will handle how to free
+    if (FLAGS_use_vram) {
+        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
+        return;
+    }
+#endif  // USE_HIP


The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

if (FLAGS_use_vram) { // freePinnedLocalMemory can handle memory allocated by either // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc. mooncake::NvlinkTransport::freePinnedLocalMemory(addr); return; }

gemini-code-assist · 2025-12-08T12:44:23Z

mooncake-transfer-engine/example/transfer_engine_validator.cpp

+#if defined(USE_CUDA) || defined(USE_MUSA)
    CUmemGenericAllocationHandle handle;
    auto result = cuMemRetainAllocationHandle(&handle, addr);
    if (result == CUDA_SUCCESS) {
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }
-#endif
+#elif USE_HIP
+    // hipMemRetainAllocationHandle fails if memory was allocated using
+    // hipMalloc, which is possible inside
+    // NvlinkTransport::allocatePinnedLocalMemory. So, call just
+    // freePinnedLocalMemory, which will handle how to free
+    if (FLAGS_use_vram) {
+        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
+        return;
+    }
+#endif  // USE_HIP


The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

if (FLAGS_use_vram) { // freePinnedLocalMemory can handle memory allocated by either // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc. mooncake::NvlinkTransport::freePinnedLocalMemory(addr); return; }

From gemini: The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources. The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

stmatengss · 2025-12-09T09:06:55Z

mooncake-transfer-engine/example/transfer_engine_bench.cpp

+//
+// Modifications Copyright(C) 2025 Advanced Micro Devices, Inc.
+// All rights reserved.


Please omit the copyright notice if it's not essential. Thx a lot. Other LGTM. ^^

@stmatengss, I am afraid it is not possible. Internal rules in my company require adding copyrights to any modifications in source files

amd-arozanov added 3 commits December 8, 2025 10:11

[TransferEngineValidator] Check data at least once

4250743

- transfer_engine_validator checks data on random iterations. It may happens that it does not check data at all. So, added checking at least on the first iteration - Print that data validation passed

[TE/Examples] Directly call freePinnedLocalMemory for HIP

606ecdd

hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc, which is possible inside NvlinkTransport::allocatePinnedLocalMemory. So, call just freePinnedLocalMemory, which will handle how to free.

amd-arozanov requested review from alogfans, chestnut-Q and doujiang24 as code owners December 8, 2025 12:41

github-actions bot added the run-ci label Dec 8, 2025

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

stmatengss reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TE/Examples] Memory initialization and HIP cleanup fixes #1179

[TE/Examples] Memory initialization and HIP cleanup fixes #1179

Uh oh!

amd-arozanov commented Dec 8, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

amd-arozanov Dec 8, 2025

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

amd-arozanov Dec 8, 2025

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

amd-arozanov Dec 8, 2025

Uh oh!

stmatengss Dec 9, 2025

Uh oh!

amd-arozanov Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[TE/Examples] Memory initialization and HIP cleanup fixes #1179

Are you sure you want to change the base?

[TE/Examples] Memory initialization and HIP cleanup fixes #1179

Uh oh!

Conversation

amd-arozanov commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

amd-arozanov Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

amd-arozanov Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

amd-arozanov Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amd-arozanov Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-arozanov commented Dec 8, 2025 •

edited

Loading