TransferBench v1.67.0#273
Merged
Merged
Conversation
* Adding support for GFX/DMA executors accessing remote memory via UALoE
* CUDA Driver API addition * NVML initiation * MNNVL support * pod presets
…eck issue for VMs
merge develop changes
* fix qpCount storage limit to allow 256+ (#237) * fix function header GetClosestGpusToNic (#238) fix the function header GetClosestGpusToNic to match the function definition and function calls * Fixed CQ size for high QPs cases and poll CQ in batch CQ Size: max(100, qpCount) - dynamically sized This avoid hangs at large QPs size, notably experienced with small message size (ex: 256 QPs, 8M message size) Polling: Up to 32 completions per poll call to reduce poll calls * improve DMABUF zcat check improve DMABUF zcat check, similar to ROCM-2855 * add NIC_CQ_POLL_BATCH option as CQ poll batch size Add NIC_CQ_POLL_BATCH as an option to ibv_poll_cq for CQ poll batch size set a default value to `4` which appears to be current RCCL default replace fixed wc_array with vector wc.data Files changed: - `src/header/TransferBench.hpp` - `src/client/EnvVars.hpp` * align with develop * wc_array move out of the while loop from PR review * Update CHANGELOG.md * Revert "fix function header GetClosestGpusToNic (#238)" This reverts commit a8cf384. * Revert "improve DMABUF zcat check" This reverts commit 6d88473. --------- Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Added a preset which sweeps all combination of tuning parameters for a single transfer
Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Move all toolchain logic (ROCM_PATH detection, compiler selection, per-config build flags) into a pre-project() block in CMakeLists, where it executes unconditionally on every configure step. Detection priority is preserved. - Replace plain set(CMAKE_CXX_FLAGS_DEBUG ...) calls with CMake-idiomatic CMAKE_<LANG>_FLAGS_<CONFIG>_INIT variables. - Bump cmake_minimum_required to 3.16 (for MPI::MPI_CXX and hip:: targets) - Fix GPU_TARGETS seeding and respect AMDGPU_TARGETS - Add cmake_push_check_state/pop around check_symbol_exists calls to prevent CMAKE_REQUIRED_* leaking between checks - Fix HSA find_library to use NO_DEFAULT_PATH and search lib64 as well - Fix spurious MPI_PATH logic - Remove redundant double include of cmake/Dependencies.cmake - Modernize target_link_libraries and compact target_include_directories calls - Move PACKAGE_NAME/LIBRARY_NAME/CMAKE_RUNTIME_OUTPUT_DIRECTORY before add_executable - parallel-jobs: add check_cxx_compiler_flag to detect support - DISABLE_DMABUF → DISABLE_DMA_BUF: align CMake env var name - AMD_SMI: add find_library/find_path for amd_smi --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…bes (#269) Previously, pod communication support was gated on hipconfig reporting HIP version, and AMD-SMI support was gated on the amd-smi CLI reporting library version. Both approaches are fragile: they depend on external tools being in PATH and tie enablement to version numbers that may not reflect actual API availability. Replace with build probes that call the exact functions used at runtime: - HIP probe: hipMemFabricHandle_t, hipMemGenericAllocationHandle_t, hipMemExportToShareableHandle, hipMemImportFromShareableHandle - AMD-SMI probe: amdsmi_get_processor_handle_from_bdf, amdsmi_get_gpu_fabric_info The probes are applied consistently in both the Makefile and CMake. DISABLE_AMD_SMI / ENABLE_AMD_SMI controls are preserved as independent user overrides regardless of probe outcome. Also fix amdsmi_get_processor_handle_from_bdf call site in TransferBench.hpp to pass amdsmi_bdf_t instead of the removed char* BDF string argument, update the fabric info field path to fabric_info.fabric_version.v1.*, and guard sscanf failure before populating the BDF struct. --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Guards for CPUs with no memory - Fixing NUMA check ordering - Fixing Topology display in client
…kages (#313) Add candidate branch to PR triggers so pushes and PRs against candidate run the full build pipeline. Gate all S3 upload steps on ref_name and base_ref not being 'candidate', so packages are not published until candidate is promoted to develop. Candidate builds are for validation only; no artifacts should be retained. The verification steps (dpkg-deb, rpm -qip) still confirm that the packages get built, and ease the merge to develop. --------- Co-authored-by: Claude <claude@anthropic.com>
…#315) Adds support for marking RoCE/IB traffic with specific DSCP/QoS values. - NIC_TRAFFIC_CLASS (default=0): sets the DSCP/traffic class byte in the RoCE GRH (grh.traffic_class) when transitioning QPs to RTR state. - NIC_SERVICE_LEVEL (default=0): sets the IB service level (ah_attr.sl) on QPs. This applies to IB and RoCE connections. - NicOptions: I added uint8_t serviceLevel and uint8_t trafficClass fields - TransitionQpToRtr(): accepts trafficClass and serviceLevel as parameters; sets grh.traffic_class (RoCE only) and ah_attr.sl (all QP types) --------- Co-authored-by: Pak Nin Lui <paklui@smc300x-ccs-aus-gpuf2c9.prov.aus.ccs.cpe.ice.amd.com>
Print git branch and short commit hash alongside the existing version number whenever any TransferBench command is run, e.g.: TransferBench v1.67.00 (foo/my-branch:6f5ea52) ... Co-authored-by: Claude <claude@anthropic.com>
- Not using hipSetDevice before allocating memory can use unintended deviceIdx when executing fabric-handle based transfers - Reset numa_set_preferred(-1) before ERR_FATAL early return in the non-POD_COMM_ENABLED path; without this the NUMA policy stays dirty for subsequent CPU allocations in the same process - Use memDevice.memIndex directly in the top-level hipSetDevice call instead of deviceIdx, which is NUMA-remapped for CPU types only; documents that the MEM_CPU_CLOSEST remapping does not apply to GPU - Remove now-redundant hipSetDevice inside the POD_COMM GPU memHandle branch; device was already set at the top of AllocateMemory - Guard CollectTopology GPU agent probe loop with hipSetDevice(i) so each AllocateMemory call targets the correct device --------- Co-authored-by: Claude <claude@anthropic.com>
51d8ebc to
b729d1b
Compare
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
gilbertlee-amd
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
TransferBench v1.67.0 release
Technical Details
Test Plan
Test Result
Submission Checklist