Skip to content

TransferBench v1.67.0#273

Merged
nileshnegi merged 72 commits into
developfrom
merge/TransferBench-v1.67.0
Jun 2, 2026
Merged

TransferBench v1.67.0#273
nileshnegi merged 72 commits into
developfrom
merge/TransferBench-v1.67.0

Conversation

@nileshnegi
Copy link
Copy Markdown
Collaborator

@nileshnegi nileshnegi commented Apr 27, 2026

Motivation

TransferBench v1.67.0 release

Technical Details

Test Plan

Test Result

Submission Checklist

gilbertlee-amd and others added 30 commits February 19, 2026 17:32
* Adding support for GFX/DMA executors accessing remote memory via UALoE
* CUDA Driver API addition

* NVML initiation

* MNNVL support

* pod presets
* fix qpCount storage limit to allow 256+ (#237)

* fix function header GetClosestGpusToNic (#238)

fix the function header GetClosestGpusToNic to match the function
definition and function calls

* Fixed CQ size for high QPs cases and poll CQ in batch

CQ Size: max(100, qpCount) - dynamically sized
This avoid hangs at large QPs size, notably experienced with small message size  (ex: 256 QPs, 8M message size)
Polling: Up to 32 completions per poll call
to reduce poll calls

* improve DMABUF zcat check

improve DMABUF zcat check, similar to ROCM-2855

* add NIC_CQ_POLL_BATCH option as CQ poll batch size

Add NIC_CQ_POLL_BATCH as an option to ibv_poll_cq for CQ poll batch size
set a default value to `4` which appears to be current RCCL default
replace fixed wc_array with vector wc.data

Files changed:
- `src/header/TransferBench.hpp`
- `src/client/EnvVars.hpp`

* align with develop

* wc_array move out of the while loop from PR review

* Update CHANGELOG.md

* Revert "fix function header GetClosestGpusToNic (#238)"

This reverts commit a8cf384.

* Revert "improve DMABUF zcat check"

This reverts commit 6d88473.

---------

Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Added a preset which sweeps all combination of tuning parameters for a single transfer
Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Move all toolchain logic (ROCM_PATH detection, compiler selection, per-config
  build flags) into a pre-project() block in CMakeLists, where it executes
  unconditionally on every configure step. Detection priority is preserved.
- Replace plain set(CMAKE_CXX_FLAGS_DEBUG ...) calls with CMake-idiomatic
  CMAKE_<LANG>_FLAGS_<CONFIG>_INIT variables.
- Bump cmake_minimum_required to 3.16 (for MPI::MPI_CXX and hip:: targets)
- Fix GPU_TARGETS seeding and respect AMDGPU_TARGETS
- Add cmake_push_check_state/pop around check_symbol_exists calls to prevent
  CMAKE_REQUIRED_* leaking between checks
- Fix HSA find_library to use NO_DEFAULT_PATH and search lib64 as well
- Fix spurious MPI_PATH logic
- Remove redundant double include of cmake/Dependencies.cmake
- Modernize target_link_libraries and compact target_include_directories calls
- Move PACKAGE_NAME/LIBRARY_NAME/CMAKE_RUNTIME_OUTPUT_DIRECTORY before add_executable
- parallel-jobs: add check_cxx_compiler_flag to detect support
- DISABLE_DMABUF → DISABLE_DMA_BUF: align CMake env var name
- AMD_SMI: add find_library/find_path for amd_smi

---------

Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…bes (#269)

Previously, pod communication support was gated on hipconfig reporting
HIP version, and AMD-SMI support was gated on the amd-smi CLI reporting
library version. Both approaches are fragile: they depend on external
tools being in PATH and tie enablement to version numbers that may not
reflect actual API availability.

Replace with build probes that call the exact functions used at runtime:
- HIP probe: hipMemFabricHandle_t, hipMemGenericAllocationHandle_t,
  hipMemExportToShareableHandle, hipMemImportFromShareableHandle
- AMD-SMI probe: amdsmi_get_processor_handle_from_bdf,
  amdsmi_get_gpu_fabric_info

The probes are applied consistently in both the Makefile and CMake.
DISABLE_AMD_SMI / ENABLE_AMD_SMI controls are preserved as
independent user overrides regardless of probe outcome.

Also fix amdsmi_get_processor_handle_from_bdf call site in
TransferBench.hpp to pass amdsmi_bdf_t instead of the removed char*
BDF string argument, update the fabric info field path to
fabric_info.fabric_version.v1.*, and guard sscanf failure before
populating the BDF struct.

---------

Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
gilbertlee-amd and others added 16 commits May 11, 2026 21:39
- Guards for CPUs with no memory
- Fixing NUMA check ordering
- Fixing Topology display in client
…kages (#313)

Add candidate branch to PR triggers so pushes and PRs against
candidate run the full build pipeline. Gate all S3 upload steps on
ref_name and base_ref not being 'candidate', so packages are not 
published until candidate is promoted to develop.

Candidate builds are for validation only; no artifacts should be
retained. The verification steps (dpkg-deb, rpm -qip) still confirm
that the packages get built, and ease the merge to develop.

---------

Co-authored-by: Claude <claude@anthropic.com>
…#315)

Adds support for marking RoCE/IB traffic with specific DSCP/QoS values.

- NIC_TRAFFIC_CLASS (default=0): sets the DSCP/traffic class byte in the
  RoCE GRH (grh.traffic_class) when transitioning QPs to RTR state.
- NIC_SERVICE_LEVEL (default=0): sets the IB service level (ah_attr.sl)
  on QPs. This applies to IB and RoCE connections.
- NicOptions: I added uint8_t serviceLevel and uint8_t trafficClass fields
- TransitionQpToRtr(): accepts trafficClass and serviceLevel as parameters;
  sets grh.traffic_class (RoCE only) and ah_attr.sl (all QP types)

---------

Co-authored-by: Pak Nin Lui <paklui@smc300x-ccs-aus-gpuf2c9.prov.aus.ccs.cpe.ice.amd.com>
Print git branch and short commit hash alongside the existing version
number whenever any TransferBench command is run, e.g.:
  TransferBench v1.67.00 (foo/my-branch:6f5ea52) ...

Co-authored-by: Claude <claude@anthropic.com>
- Not using hipSetDevice before allocating memory can use unintended
deviceIdx when executing fabric-handle based transfers
- Reset numa_set_preferred(-1) before ERR_FATAL early return in the
  non-POD_COMM_ENABLED path; without this the NUMA policy stays dirty
  for subsequent CPU allocations in the same process
- Use memDevice.memIndex directly in the top-level hipSetDevice call
  instead of deviceIdx, which is NUMA-remapped for CPU types only;
  documents that the MEM_CPU_CLOSEST remapping does not apply to GPU
- Remove now-redundant hipSetDevice inside the POD_COMM GPU memHandle
  branch; device was already set at the top of AllocateMemory
- Guard CollectTopology GPU agent probe loop with hipSetDevice(i) so
  each AllocateMemory call targets the correct device

---------

Co-authored-by: Claude <claude@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 18:18
@nileshnegi nileshnegi force-pushed the merge/TransferBench-v1.67.0 branch from 51d8ebc to b729d1b Compare June 1, 2026 18:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 3 comments.

Comment thread src/client/Topology.hpp
Comment thread src/client/Utilities.hpp
Comment thread src/client/Utilities.hpp Outdated
AtlantaPepsi and others added 2 commits June 1, 2026 14:55
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 1, 2026 23:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 1, 2026 23:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@gilbertlee-amd gilbertlee-amd self-requested a review June 1, 2026 23:15
@nileshnegi nileshnegi merged commit 2bc42cd into develop Jun 2, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants