TransferBench v1.67.0 by nileshnegi · Pull Request #273 · ROCm/TransferBench

nileshnegi · 2026-04-27T05:43:03Z

Motivation

TransferBench v1.67.0 release

Technical Details

Initial pod communication support (Initial pod communication support #235)
cuda + MNNVL update & pod presets (cuda + MNNVL update & pod presets #241)
Increase CQ size for high qps (Increase CQ size for high qps #244)
fix hang when NVML is present but fabricmanager isnt (fix hang when NVML is present but fabricmanager isnt #246)
Adding nica2a preset (Adding nica2a preset #248)
Adding HBM read bandwidth preset (Adding HBM read bandwidth preset #250)
Pod Ring preset (Pod Ring preset #251)
gfxsweep preset (gfxsweep draft #254) (Modifying the gfxsweep preset #256)
Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset #255)
Adding a wallclock consistency detection preset (Adding a wallclock consistency detection preset #258)
Adding smoketest preset for simple correctness tests (Adding smoketest preset for simple correctness tests #266)
Help / envvars / presets presets (Help preset #267)
Modernize CMake build (Modernize CMake build #268)
Replace version-based pod/amd-smi detection with compile-time API probes (Replace version-based pod/amd-smi detection with compile-time API probes #269)
Fix collective mismatch hangs in multi-rank error paths (Fix collective mismatch hangs in multi-rank error paths #270)
Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (Fix SHOW_ITERATIONS table truncation with multiple transfers #271)
Reformat a2asweep output to match gfxsweep style (Reformat a2asweep output to match gfxsweep style #272)
Gfx sweep update (Gfx sweep update #274)
Increasing flush frequency in smoketest (Increasing flush frequency in smoketest #275)
Adding new experimental copy-only GFX kernel, gfxsweep update (Adding new experimental copy-only GFX kernel, gfxsweep update #277)
Fixes for cuMem compilation and invalid device ordinal (Fixes for cuMem compilation and invalid device ordinal #278)
Simplifying socket connect, allow for using host address (Simplifying socket connect, allow for using host address #279)
Updating podring to run on single node without need to force single pod (Updating podring to run on single node without need to force single pod #280)
Adding SHOW_PERCENTILES to show extra per-iteration statistics (Adding SHOW_PERCENTILES to show extra per-iteration statistics #281)
Adding LaunchTransferBench helper script (Adding LaunchTransferBench helper script #294)
Adding 'empty' kernel launch preset (Adding 'empty' kernel launch preset #297)
Adding ability to remove barrier, mask off XCCs (Adding ability to remove barrier, mask off XCCs #298)
Add NIC_TRAFFIC_CLASS and NIC_SERVICE_LEVEL env vars for DSCP marking (add NIC_TRAFFIC_CLASS and NIC_SERVICE_LEVEL env vars for DSCP marking #315)
Embed git branch and commit hash in version string (Add git branch and commit hash to version string #312)

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* Adding support for GFX/DMA executors accessing remote memory via UALoE

* CUDA Driver API addition * NVML initiation * MNNVL support * pod presets

…eck issue for VMs

merge develop changes

* fix qpCount storage limit to allow 256+ (#237) * fix function header GetClosestGpusToNic (#238) fix the function header GetClosestGpusToNic to match the function definition and function calls * Fixed CQ size for high QPs cases and poll CQ in batch CQ Size: max(100, qpCount) - dynamically sized This avoid hangs at large QPs size, notably experienced with small message size (ex: 256 QPs, 8M message size) Polling: Up to 32 completions per poll call to reduce poll calls * improve DMABUF zcat check improve DMABUF zcat check, similar to ROCM-2855 * add NIC_CQ_POLL_BATCH option as CQ poll batch size Add NIC_CQ_POLL_BATCH as an option to ibv_poll_cq for CQ poll batch size set a default value to `4` which appears to be current RCCL default replace fixed wc_array with vector wc.data Files changed: - `src/header/TransferBench.hpp` - `src/client/EnvVars.hpp` * align with develop * wc_array move out of the while loop from PR review * Update CHANGELOG.md * Revert "fix function header GetClosestGpusToNic (#238)" This reverts commit a8cf384. * Revert "improve DMABUF zcat check" This reverts commit 6d88473. --------- Co-authored-by: Pak Nin Lui <pak.lui@amd.com>

Added a preset which sweeps all combination of tuning parameters for a single transfer

…#255)

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Move all toolchain logic (ROCM_PATH detection, compiler selection, per-config build flags) into a pre-project() block in CMakeLists, where it executes unconditionally on every configure step. Detection priority is preserved. - Replace plain set(CMAKE_CXX_FLAGS_DEBUG ...) calls with CMake-idiomatic CMAKE_<LANG>_FLAGS_<CONFIG>_INIT variables. - Bump cmake_minimum_required to 3.16 (for MPI::MPI_CXX and hip:: targets) - Fix GPU_TARGETS seeding and respect AMDGPU_TARGETS - Add cmake_push_check_state/pop around check_symbol_exists calls to prevent CMAKE_REQUIRED_* leaking between checks - Fix HSA find_library to use NO_DEFAULT_PATH and search lib64 as well - Fix spurious MPI_PATH logic - Remove redundant double include of cmake/Dependencies.cmake - Modernize target_link_libraries and compact target_include_directories calls - Move PACKAGE_NAME/LIBRARY_NAME/CMAKE_RUNTIME_OUTPUT_DIRECTORY before add_executable - parallel-jobs: add check_cxx_compiler_flag to detect support - DISABLE_DMABUF → DISABLE_DMA_BUF: align CMake env var name - AMD_SMI: add find_library/find_path for amd_smi --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…bes (#269) Previously, pod communication support was gated on hipconfig reporting HIP version, and AMD-SMI support was gated on the amd-smi CLI reporting library version. Both approaches are fragile: they depend on external tools being in PATH and tie enablement to version numbers that may not reflect actual API availability. Replace with build probes that call the exact functions used at runtime: - HIP probe: hipMemFabricHandle_t, hipMemGenericAllocationHandle_t, hipMemExportToShareableHandle, hipMemImportFromShareableHandle - AMD-SMI probe: amdsmi_get_processor_handle_from_bdf, amdsmi_get_gpu_fabric_info The probes are applied consistently in both the Makefile and CMake. DISABLE_AMD_SMI / ENABLE_AMD_SMI controls are preserved as independent user overrides regardless of probe outcome. Also fix amdsmi_get_processor_handle_from_bdf call site in TransferBench.hpp to pass amdsmi_bdf_t instead of the removed char* BDF string argument, update the fabric info field path to fabric_info.fabric_version.v1.*, and guard sscanf failure before populating the BDF struct. --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Guards for CPUs with no memory - Fixing NUMA check ordering - Fixing Topology display in client

…e, empty batch (#307)

…309)

…kages (#313) Add candidate branch to PR triggers so pushes and PRs against candidate run the full build pipeline. Gate all S3 upload steps on ref_name and base_ref not being 'candidate', so packages are not published until candidate is promoted to develop. Candidate builds are for validation only; no artifacts should be retained. The verification steps (dpkg-deb, rpm -qip) still confirm that the packages get built, and ease the merge to develop. --------- Co-authored-by: Claude <claude@anthropic.com>

…#315) Adds support for marking RoCE/IB traffic with specific DSCP/QoS values. - NIC_TRAFFIC_CLASS (default=0): sets the DSCP/traffic class byte in the RoCE GRH (grh.traffic_class) when transitioning QPs to RTR state. - NIC_SERVICE_LEVEL (default=0): sets the IB service level (ah_attr.sl) on QPs. This applies to IB and RoCE connections. - NicOptions: I added uint8_t serviceLevel and uint8_t trafficClass fields - TransitionQpToRtr(): accepts trafficClass and serviceLevel as parameters; sets grh.traffic_class (RoCE only) and ah_attr.sl (all QP types) --------- Co-authored-by: Pak Nin Lui <paklui@smc300x-ccs-aus-gpuf2c9.prov.aus.ccs.cpe.ice.amd.com>

Print git branch and short commit hash alongside the existing version number whenever any TransferBench command is run, e.g.: TransferBench v1.67.00 (foo/my-branch:6f5ea52) ... Co-authored-by: Claude <claude@anthropic.com>

- Not using hipSetDevice before allocating memory can use unintended deviceIdx when executing fabric-handle based transfers - Reset numa_set_preferred(-1) before ERR_FATAL early return in the non-POD_COMM_ENABLED path; without this the NUMA policy stays dirty for subsequent CPU allocations in the same process - Use memDevice.memIndex directly in the top-level hipSetDevice call instead of deviceIdx, which is NUMA-remapped for CPU types only; documents that the MEM_CPU_CLOSEST remapping does not apply to GPU - Remove now-redundant hipSetDevice inside the POD_COMM GPU memHandle branch; device was already set at the top of AllocateMemory - Guard CollectTopology GPU agent probe loop with hipSetDevice(i) so each AllocateMemory call targets the correct device --------- Co-authored-by: Claude <claude@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 3 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

gilbertlee-amd and others added 30 commits February 19, 2026 17:32

Initial pod communication support (#235)

0252788

* Adding support for GFX/DMA executors accessing remote memory via UALoE

Adjust min HIP version in Makefile for pod support

9edaae8

Adding TB_DUMP_CFG_FILE and fixing a deallocation bug

2b17e62

Adding gfx1250 to CMakeFiles

2b707b7

Adjusting how HIP headers are included

2bb8302

Updating a2asweep and scaling presets

060abc2

Fixing logging to prevent recursive error

6c2ecf7

Fixing fabric handle bug

794bcf7

Changing table formatting to make it easier to paste

8de0154

Showing num iterations when running in timed mode

4a0f390

cuda + MNNVL update & pod presets (#241)

bf49ba4

* CUDA Driver API addition * NVML initiation * MNNVL support * pod presets

Changing NIC_FILTER to TB_NIC_FILTER

5e61666

prefixing remaining env vars with TB_, fixing potential filesystem ch…

bec2c5e

…eck issue for VMs

Fixing TB_PAUSE issue

561e2f7

Merge pull request #245 from ROCm/develop

94cf3c9

merge develop changes

fix hang when NVML is present but fabricmanager isnt (#246)

275998b

Adding HBM read bandwidth preset (#250)

168cdc1

Adding TB_WALLCLOCK_RATE in case wallclock rate is reported as 0

fdec7d5

Fixing numeric limits from min to lowest for doubles

bae804c

Fixing CMakeLists missing rename of ENABLE_DMA_BUF

a03b06e

Adding XCC detection for GFX12, increasing max GFX unroll to 16

1ef9c51

gfxsweep preset (#254)

2900b4e

Added a preset which sweeps all combination of tuning parameters for a single transfer

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (…

2dba07f

…#255)

Modifying the gfxsweep preset (#256)

2aa036c

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Adding a wallclock consistency detection preset (#258)

b57f2e2

Adding smoketest preset for simple correctness tests (#266)

a4fc836

Help / envvars / presets presets (#267)

3744d24

gilbertlee-amd and others added 16 commits May 11, 2026 21:39

Adding 'empty' kernel launch preset (#297)

fa10775

Adding ability to remove barrier, mask off XCCs (#298)

297e00c

improve limit reached message (#302)

e8edacf

Fixing NUMA checks (set_mempolicy) (#303)

24cfdc7

- Guards for CPUs with no memory - Fixing NUMA check ordering - Fixing Topology display in client

Fixing nearest GPU numa detection (#305)

fcb2a3c

[smoketest] Adding BDMA, A2A-remoteread, dma,gfx,fast testlists (#306)

c1c3561

[empty] Adding ability to switch between hipExtLaunch and default mod…

48d4fb5

…e, empty batch (#307)

[wallclock] Adding average usec cost for timestamp collection on GPU (#…

6f5ea52

…309)

[empty] Adding SHOW_PERCENTILES support (#310)

d777426

Merge branch 'develop' into candidate

0c9b70f

removing secondary reordering (#314)

e923e24

Minor change to output format (#317)

479408d

Embed git branch and commit hash in version string (#312)

9b22b00

Print git branch and short commit hash alongside the existing version number whenever any TransferBench command is run, e.g.: TransferBench v1.67.00 (foo/my-branch:6f5ea52) ... Co-authored-by: Claude <claude@anthropic.com>

Copilot AI review requested due to automatic review settings June 1, 2026 18:18

nileshnegi force-pushed the merge/TransferBench-v1.67.0 branch from 51d8ebc to b729d1b Compare June 1, 2026 18:18

Copilot started reviewing on behalf of nileshnegi June 1, 2026 18:19 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread src/client/Topology.hpp

Comment thread src/client/Utilities.hpp

Comment thread src/client/Utilities.hpp Outdated

AtlantaPepsi and others added 2 commits June 1, 2026 14:55

disable pinned host memory for pod (#318)

26c9cf8

Potential fix for pull request finding

6f73586

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 1, 2026 23:15

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Potential fix for pull request finding

6943b68

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 1, 2026 23:15

Copilot AI reviewed Jun 1, 2026

View reviewed changes

gilbertlee-amd self-requested a review June 1, 2026 23:15

gilbertlee-amd approved these changes Jun 1, 2026

View reviewed changes

nileshnegi merged commit 2bc42cd into develop Jun 2, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransferBench v1.67.0#273

TransferBench v1.67.0#273
nileshnegi merged 72 commits into
developfrom
merge/TransferBench-v1.67.0

nileshnegi commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nileshnegi commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nileshnegi commented Apr 27, 2026 •

edited

Loading