Skip to content

Add NIC_TRAFFIC_CLASS_FIFO env var for ctrl-path DSCP (like NCCL_IB_FIFO_TC)#316

Open
paklui wants to merge 12 commits into
ROCm:candidatefrom
paklui:candidate-fifo-tc
Open

Add NIC_TRAFFIC_CLASS_FIFO env var for ctrl-path DSCP (like NCCL_IB_FIFO_TC)#316
paklui wants to merge 12 commits into
ROCm:candidatefrom
paklui:candidate-fifo-tc

Conversation

@paklui
Copy link
Copy Markdown
Contributor

@paklui paklui commented May 29, 2026

Motivation

We can configure network switches to steer traffic to different priority queues based on DSCP values. Like in the related PR #315 which we added the env var NIC_TRAFFIC_CLASS to allow marking the bulk data QP traffic. With that we used a single traffic class for both data and control paths, which means control/signaling traffic competes in the same queue as large data transfers, which has a potential to causing head-of-line blocking at larger scale, when there are victim flow which caused a lot of ranks to compete to send back the control path while the same same traffic class is congested.

Reference in the section "Receiver-driven traffic admission", because of that NCCL/RCCL addresses this with NCCL_IB_FIFO_TC which separates the traffic class for its control QPs.

This PR adds the NIC_TRAFFIC_CLASS_FIFO env var to TransferBench, which is the NCCL/RCCL equivalent of NCCL_IB_FIFO_TC.

Technical Details

We introduce a new env var: NIC_TRAFFIC_CLASS_FIFO which sets the GRH traffic_class for a 0-byte signaling message per ctrl QP, which separate from the data QPs, defined by NIC_TRAFFIC_CLASS (introduced in #315).
Each ExecuteNicTransfer() call fires one zero-byte inline IBV_WR_SEND per ctrl QP, which goes on the wire with the specified DSCP, then polls for completion before the data send, and NIC_TRAFFIC_CLASS_FIFO=0 (default) is a no-op: no ctrl QPs are created, no overhead per iteration

Test Plan

I verified the DSCP value set for QoS on the NIC to confirm:
ack_dscp=46 (TC=184, TC = DSCP << 2)
data_dscp=24 (TC=96, TC = DSCP << 2)
Then I confirm on the (SONiC) switch ports the right DSCP-TC are mapped, and the interfaces are using the right qos-map profile

sonic-cli
show qos map dscp-tc
show running-configuration interface Eth

Then I reset the counters for those interfaces

clear counters interface Eth 1/18
show queue counters interface Eth 1/18

then use show queue counters on leaf switch(es) that shows ctrl packets land on the correct priority queue and no drops on either queue and bandwidth on the data not being affected

I verified on the smc300x cluster (8x MI300X, RoCE NICs) that without NIC_TRAFFIC_CLASS_FIFO, the control traffic went on same priority/queue as data QPs.

make clean && make GPU_TARGETS=native SINGLE_KERNEL=1 MPI_PATH=/usr/lib/x86_64-linux-gnu/openmpi
NIC_TRAFFIC_CLASS=96 \
NUM_ITERATIONS=5 \
NUM_WARMUPS=2 \
mpirun -np 2 ./TransferBench nicp2p

With NIC_TRAFFIC_CLASS_FIFO: ctrl traffic steered to separate queue (DSCP=46/TC=184)

NIC_TRAFFIC_CLASS=96 \
NIC_TRAFFIC_CLASS_FIFO=184 \
NUM_ITERATIONS=5 \
NUM_WARMUPS=2 \
mpirun -np 2 ./TransferBench nicp2p

Test Result

Verified on smc300x cluster (8x MI300X, ionic RoCE NICs, leaf switch with RoCE QoS profile):

  • Without NIC_TRAFFIC_CLASS_FIFO: all traffic (ctrl + data) lands on the same switch queue
  • With NIC_TRAFFIC_CLASS_FIFO=184 (DSCP 46): ctrl packets steered to the correct priority queue, data packets remain on their queue; zero drops on both; data bandwidth unaffected

Submission Checklist

paklui and others added 6 commits May 28, 2026 19:03
Mirrors NCCL_IB_FIFO_TC: stamps control/signaling QP traffic with a
separate DSCP value from bulk data traffic so network switches can
steer ctrl and data packets into different priority queues.

NIC_TRAFFIC_CLASS_FIFO=0 (default) is a complete no-op: no ctrl QPs
are created and no extra work is done per iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hoisted QpTransitionResult struct definition above the data QP
exchange loop so it is also visible to the ctrl QP exchange loop
that follows it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gate was using fifoTrafficClass != trafficClass which activated the ctrl
path unintentionally when only NIC_TRAFFIC_CLASS was set. Changed all
three gate points to fifoTrafficClass != 0 so default NIC_TRAFFIC_CLASS_FIFO=0
is a true no-op.

Removed MPI barrier inside ExecuteNicTransfer() — dst rank never calls
that function so the barrier deadlocked. Pre-posting recv WRs during
setup (where both ranks participate) avoids the need for any per-iteration
synchronisation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d recvs

The ctrl QP on the DST rank pre-posts all recv WRs during setup to
avoid per-iteration barriers. However the QP was created with
max_recv_wr = cfg.nic.maxRecvWorkReq (default 16), which is smaller
than ctrlNumPrepost (numWarmups + numIterations + 5 = 18 with defaults),
causing ibv_post_recv to fail.

Fix by computing ctrlNumPrepost before ctrl QP creation and passing it
as max_recv_wr for the dst ctrl QPs. The optional maxRecvWr parameter
is added to CreateQueuePair (-1 means use cfg value, preserving
behaviour for all existing data QP callers).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ExecuteNicTransfer() is called once per sub-iteration inside the loop in RunNicExecutor()
The pre-posted recv WR count should be:
(numWarmups + numIterations) * numSubIterations

Previously the formula omitted the numSubIterations multiplier, which would cause ibv_post_recv failures when NUM_SUBITERS > 1.
Copilot AI review requested due to automatic review settings May 29, 2026 19:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new NIC control-path QoS knob (NIC_TRAFFIC_CLASS_FIFO) to let TransferBench mark a lightweight signaling path with a different DSCP/traffic-class than the bulk data QPs (similar to NCCL/RCCL’s NCCL_IB_FIFO_TC). The intent is to reduce head-of-line blocking by steering control traffic into a different switch priority queue.

Changes:

  • Added NIC_TRAFFIC_CLASS_FIFO env var parsing/printing and plumbed it into ConfigOptions::nic.
  • Added “ctrl/FIFO” QP/CQ setup and a per-iteration 0-byte inline IBV_WR_SEND before the data-path RDMA posts.
  • Updated changelog entries to document the new env var and equivalence to NCCL knobs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/header/TransferBench.hpp Adds NIC ctrl-path QPs/CQs and per-iteration ctrl signaling send before data RDMA posts.
src/client/EnvVars.hpp Adds env var NIC_TRAFFIC_CLASS_FIFO parsing/validation/help text and maps it into cfg.nic.fifoTrafficClass.
CHANGELOG.md Documents the new env var and related QoS options.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp Outdated
Comment thread CHANGELOG.md Outdated
Copilot AI review requested due to automatic review settings May 29, 2026 23:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp Outdated
paklui added 2 commits May 29, 2026 17:59
Set wr_id = qpIdx before each ibv_post_send so completions can be identified regardless of arrival order

Handle ibv_poll_cq returning < 0 (CQ error) to avoid an infinite spin

Use wc.wr_id instead of qpIdx in the error message for identifying QP
src ctrl CQ needs qpCount slots (drained each ExecuteNicTransfer call)
dst ctrl CQ needs ctrlNumPrepost * qpCount slots since it is never polled and completions accumulate over the full run
Copilot AI review requested due to automatic review settings May 30, 2026 01:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread src/header/TransferBench.hpp Outdated
… modes

ctrlNumPrepost assumes a fixed iteration count. When NUM_ITERATIONS <= 0,
such as in timed or infinite mode, when the actual send count is unbounded,
it would exhaust the pre-posted recv WRs, which can cause RNR retries or hangs.

To workaroundm, exit with an error instead if NIC_TRAFFIC_CLASS_FIFO is set in these modes.
@paklui
Copy link
Copy Markdown
Contributor Author

paklui commented Jun 2, 2026

After the latest changes above, and I rerun again on a few nodes, and I was able to verify that NIC_TRAFFIC_CLASS_FIFO is able to marks the ctrl QP traffic with a separate DSCP that is specified, therefore we can move forward with this PR from the draft state to be reviewed next.

@paklui paklui marked this pull request as ready for review June 2, 2026 02:13
@paklui paklui requested review from a team as code owners June 2, 2026 02:13
Copilot AI review requested due to automatic review settings June 2, 2026 02:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
- Gate ctrl QP creation and exchange/connect blocks on rss.srcIsExeNic;
  RDMA_READ transfers never use ctrl QPs so creating them wastes resources
  and can fail near device QP/CQ limits
- Compute ctrlNumPrepost in int64_t to avoid signed overflow for large
  iteration counts; drop std::abs (NIC_TRAFFIC_CLASS_FIFO already rejects
  numIterations<=0); clamp to INT_MAX before casting back to int

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants