Skip to content

Conversation

@amd-arozanov
Copy link
Contributor

@amd-arozanov amd-arozanov commented Dec 8, 2025

Description

Transfer engine examples changes:

  1. Initialize allocated memory:
    • GPU Driver may optimize transfers for non-initialized memory, while on real applications the memory is always initialized. So, need to initialize the memory in the benchmarks too.
  2. Check data at least once in TransferEngineValidator:
    • transfer_engine_validator checks data on random iterations. It may happen that it does not check data at all. So, added checking at least on the first iteration.
  3. Directly call freePinnedLocalMemory for HIP:
    • hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc, which is possible inside NvlinkTransport::allocatePinnedLocalMemory. So, call just freePinnedLocalMemory, which will handle how to free.
    • UPDATE: Directly call freePinnedLocalMemory for CUDA and MUSA too as suggested by gemini

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Ran transfer_engine_bench and transfer_engine_validator on nodes with AMG GPU

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

GPU Driver may optimize transfers for non-initialized memory,
while on real applications the memory is always initialized.
So, need to initialize the memory in the benchmark too.

Added new command line arguments --init_mem to transfer_engine_bench.
Default value is true.
- transfer_engine_validator checks data on random iterations.
  It may happens that it does not check data at all.
  So, added checking at least on the first iteration
- Print that data validation passed
hipMemRetainAllocationHandle fails if memory was allocated using hipMalloc,
which is possible inside NvlinkTransport::allocatePinnedLocalMemory.

So, call just freePinnedLocalMemory, which will handle how to free.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amd-arozanov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability and accuracy of the Transfer Engine examples and validation tools. It addresses critical issues related to memory initialization in benchmarks to ensure more realistic performance measurements and guarantees data integrity checks within the validator. Additionally, it refines the memory management for HIP environments, resolving potential deallocation failures.

Highlights

  • Memory Initialization in Examples: Allocated memory in Transfer Engine examples is now initialized by default, preventing GPU driver optimizations for uninitialized memory that could skew benchmark results. A new flag init_mem allows controlling this behavior.
  • Guaranteed Data Validation: The TransferEngineValidator now ensures that data is checked at least once during its execution, specifically on the first iteration, to prevent scenarios where random checks might miss validation entirely.
  • HIP Memory Deallocation Fix: Improved HIP memory deallocation logic in freeMemoryPool to directly call NvlinkTransport::freePinnedLocalMemory when hipMemRetainAllocationHandle might fail for hipMalloc-allocated memory.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several useful fixes for the transfer engine examples. The changes to initialize memory for more realistic benchmarking and to ensure the validator checks data at least once are great improvements. The fix for HIP memory cleanup is also important. However, I've found a critical resource leak in the memory deallocation logic for CUDA/MUSA that was highlighted by the new HIP-specific code. My review includes suggestions to fix this leak by simplifying and unifying the deallocation logic across all platforms, which will also improve code consistency.

Comment on lines 133 to 149
#if defined(USE_CUDA) || defined(USE_MUSA)
CUmemGenericAllocationHandle handle;
auto result = cuMemRetainAllocationHandle(&handle, addr);
if (result == CUDA_SUCCESS) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif
#elif USE_HIP
// hipMemRetainAllocationHandle fails if memory was allocated using
// hipMalloc, which is possible inside
// NvlinkTransport::allocatePinnedLocalMemory. So, call just
// freePinnedLocalMemory, which will handle how to free
if (FLAGS_use_vram) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif // USE_HIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

    if (FLAGS_use_vram) {
        // freePinnedLocalMemory can handle memory allocated by either
        // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 128 to 144
#if defined(USE_CUDA) || defined(USE_MUSA)
CUmemGenericAllocationHandle handle;
auto result = cuMemRetainAllocationHandle(&handle, addr);
if (result == CUDA_SUCCESS) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif
#elif USE_HIP
// hipMemRetainAllocationHandle fails if memory was allocated using
// hipMalloc, which is possible inside
// NvlinkTransport::allocatePinnedLocalMemory. So, call just
// freePinnedLocalMemory, which will handle how to free
if (FLAGS_use_vram) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif // USE_HIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

    if (FLAGS_use_vram) {
        // freePinnedLocalMemory can handle memory allocated by either
        // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 123 to 139
#if defined(USE_CUDA) || defined(USE_MUSA)
CUmemGenericAllocationHandle handle;
auto result = cuMemRetainAllocationHandle(&handle, addr);
if (result == CUDA_SUCCESS) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif
#elif USE_HIP
// hipMemRetainAllocationHandle fails if memory was allocated using
// hipMalloc, which is possible inside
// NvlinkTransport::allocatePinnedLocalMemory. So, call just
// freePinnedLocalMemory, which will handle how to free
if (FLAGS_use_vram) {
mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
return;
}
#endif // USE_HIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.

I recommend applying the same simplified logic to all cases (CUDA, MUSA, and HIP) to fix the leak and improve code consistency.

    if (FLAGS_use_vram) {
        // freePinnedLocalMemory can handle memory allocated by either
        // NvlinkTransport::allocatePinnedLocalMemory or regular hipMalloc/cudaMalloc.
        mooncake::NvlinkTransport::freePinnedLocalMemory(addr);
        return;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

From gemini:
The logic for CUDA/MUSA has a resource leak. cuMemRetainAllocationHandle increments the reference count, but there's no corresponding cuMemRelease call for the handle obtained here. This will leak resources.

The implementation for USE_HIP is better as it delegates the deallocation logic to freePinnedLocalMemory, which correctly handles different memory types. This approach avoids redundant checks and resource leaks.
Comment on lines +14 to +16
//
// Modifications Copyright(C) 2025 Advanced Micro Devices, Inc.
// All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please omit the copyright notice if it's not essential. Thx a lot. Other LGTM. ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stmatengss, I am afraid it is not possible. Internal rules in my company require adding copyrights to any modifications in source files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants