Add NIC_TRAFFIC_CLASS_FIFO env var for ctrl-path DSCP (like NCCL_IB_FIFO_TC)#316
Add NIC_TRAFFIC_CLASS_FIFO env var for ctrl-path DSCP (like NCCL_IB_FIFO_TC)#316paklui wants to merge 12 commits into
Conversation
Mirrors NCCL_IB_FIFO_TC: stamps control/signaling QP traffic with a separate DSCP value from bulk data traffic so network switches can steer ctrl and data packets into different priority queues. NIC_TRAFFIC_CLASS_FIFO=0 (default) is a complete no-op: no ctrl QPs are created and no extra work is done per iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hoisted QpTransitionResult struct definition above the data QP exchange loop so it is also visible to the ctrl QP exchange loop that follows it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gate was using fifoTrafficClass != trafficClass which activated the ctrl path unintentionally when only NIC_TRAFFIC_CLASS was set. Changed all three gate points to fifoTrafficClass != 0 so default NIC_TRAFFIC_CLASS_FIFO=0 is a true no-op. Removed MPI barrier inside ExecuteNicTransfer() — dst rank never calls that function so the barrier deadlocked. Pre-posting recv WRs during setup (where both ranks participate) avoids the need for any per-iteration synchronisation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d recvs The ctrl QP on the DST rank pre-posts all recv WRs during setup to avoid per-iteration barriers. However the QP was created with max_recv_wr = cfg.nic.maxRecvWorkReq (default 16), which is smaller than ctrlNumPrepost (numWarmups + numIterations + 5 = 18 with defaults), causing ibv_post_recv to fail. Fix by computing ctrlNumPrepost before ctrl QP creation and passing it as max_recv_wr for the dst ctrl QPs. The optional maxRecvWr parameter is added to CreateQueuePair (-1 means use cfg value, preserving behaviour for all existing data QP callers). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ExecuteNicTransfer() is called once per sub-iteration inside the loop in RunNicExecutor() The pre-posted recv WR count should be: (numWarmups + numIterations) * numSubIterations Previously the formula omitted the numSubIterations multiplier, which would cause ibv_post_recv failures when NUM_SUBITERS > 1.
There was a problem hiding this comment.
Pull request overview
This PR adds a new NIC control-path QoS knob (NIC_TRAFFIC_CLASS_FIFO) to let TransferBench mark a lightweight signaling path with a different DSCP/traffic-class than the bulk data QPs (similar to NCCL/RCCL’s NCCL_IB_FIFO_TC). The intent is to reduce head-of-line blocking by steering control traffic into a different switch priority queue.
Changes:
- Added
NIC_TRAFFIC_CLASS_FIFOenv var parsing/printing and plumbed it intoConfigOptions::nic. - Added “ctrl/FIFO” QP/CQ setup and a per-iteration 0-byte inline
IBV_WR_SENDbefore the data-path RDMA posts. - Updated changelog entries to document the new env var and equivalence to NCCL knobs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/header/TransferBench.hpp |
Adds NIC ctrl-path QPs/CQs and per-iteration ctrl signaling send before data RDMA posts. |
src/client/EnvVars.hpp |
Adds env var NIC_TRAFFIC_CLASS_FIFO parsing/validation/help text and maps it into cfg.nic.fifoTrafficClass. |
CHANGELOG.md |
Documents the new env var and related QoS options. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Set wr_id = qpIdx before each ibv_post_send so completions can be identified regardless of arrival order Handle ibv_poll_cq returning < 0 (CQ error) to avoid an infinite spin Use wc.wr_id instead of qpIdx in the error message for identifying QP
src ctrl CQ needs qpCount slots (drained each ExecuteNicTransfer call) dst ctrl CQ needs ctrlNumPrepost * qpCount slots since it is never polled and completions accumulate over the full run
… modes ctrlNumPrepost assumes a fixed iteration count. When NUM_ITERATIONS <= 0, such as in timed or infinite mode, when the actual send count is unbounded, it would exhaust the pre-posted recv WRs, which can cause RNR retries or hangs. To workaroundm, exit with an error instead if NIC_TRAFFIC_CLASS_FIFO is set in these modes.
|
After the latest changes above, and I rerun again on a few nodes, and I was able to verify that |
- Gate ctrl QP creation and exchange/connect blocks on rss.srcIsExeNic; RDMA_READ transfers never use ctrl QPs so creating them wastes resources and can fail near device QP/CQ limits - Compute ctrlNumPrepost in int64_t to avoid signed overflow for large iteration counts; drop std::abs (NIC_TRAFFIC_CLASS_FIFO already rejects numIterations<=0); clamp to INT_MAX before casting back to int Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Motivation
We can configure network switches to steer traffic to different priority queues based on DSCP values. Like in the related PR #315 which we added the env var
NIC_TRAFFIC_CLASSto allow marking the bulk data QP traffic. With that we used a single traffic class for both data and control paths, which means control/signaling traffic competes in the same queue as large data transfers, which has a potential to causing head-of-line blocking at larger scale, when there are victim flow which caused a lot of ranks to compete to send back the control path while the same same traffic class is congested.Reference in the section "Receiver-driven traffic admission", because of that NCCL/RCCL addresses this with
NCCL_IB_FIFO_TCwhich separates the traffic class for its control QPs.This PR adds the
NIC_TRAFFIC_CLASS_FIFOenv var to TransferBench, which is the NCCL/RCCL equivalent ofNCCL_IB_FIFO_TC.Technical Details
We introduce a new env var:
NIC_TRAFFIC_CLASS_FIFOwhich sets the GRH traffic_class for a 0-byte signaling message per ctrl QP, which separate from the data QPs, defined byNIC_TRAFFIC_CLASS(introduced in #315).Each
ExecuteNicTransfer()call fires one zero-byte inline IBV_WR_SEND per ctrl QP, which goes on the wire with the specified DSCP, then polls for completion before the data send, andNIC_TRAFFIC_CLASS_FIFO=0(default) is a no-op: no ctrl QPs are created, no overhead per iterationTest Plan
I verified the DSCP value set for QoS on the NIC to confirm:
ack_dscp=46 (TC=184, TC = DSCP << 2)
data_dscp=24 (TC=96, TC = DSCP << 2)
Then I confirm on the (SONiC) switch ports the right DSCP-TC are mapped, and the interfaces are using the right qos-map profile
Then I reset the counters for those interfaces
then use show queue counters on leaf switch(es) that shows ctrl packets land on the correct priority queue and no drops on either queue and bandwidth on the data not being affected
I verified on the smc300x cluster (8x MI300X, RoCE NICs) that without
NIC_TRAFFIC_CLASS_FIFO, the control traffic went on same priority/queue as data QPs.With NIC_TRAFFIC_CLASS_FIFO: ctrl traffic steered to separate queue (DSCP=46/TC=184)
Test Result
Verified on smc300x cluster (8x MI300X, ionic RoCE NICs, leaf switch with RoCE QoS profile):
NIC_TRAFFIC_CLASS_FIFO: all traffic (ctrl + data) lands on the same switch queueNIC_TRAFFIC_CLASS_FIFO=184(DSCP 46): ctrl packets steered to the correct priority queue, data packets remain on their queue; zero drops on both; data bandwidth unaffectedSubmission Checklist