[JACCL] Fix silent data corruption from unchecked RDMA work completion status#3152
[JACCL] Fix silent data corruption from unchecked RDMA work completion status#31520xDaizz wants to merge 1 commit intoml-explore:mainfrom
Conversation
|
That looks great thanks! I will run some benchmarks (it shouldn't affect them) and then will merge. |
|
@0xDaizz I 'll check after the tests pass. |
|
@angeloskath Just fixed the lint error! |
Add check_wc_status() to validate ibv_wc.status after every poll() call. Previously, RDMA transport errors (e.g. IBV_WC_RETRY_EXC_ERR) were silently ignored, leading to corrupted wr_id parsing and potential data corruption. Changes: - utils.h: add wc_status_name() and check_wc_status() helpers - mesh_impl.h: add status checks at 4 poll sites - ring_impl.h: add status checks at 5 poll sites Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3226a89 to
9fdfd73
Compare
|
@angeloskath Rebased onto the latest Changes (rebased)
Total: 3 files, +61 lines. Single commit on top of Contributing checklist
|
|
Ran the benchmarks and it does have a consistent impact for latency sensitive message sizes. ~10% on 16KB 4-way all reduce (or about 1μs extra). The slowdown obviously is not from the condition but could be a code gen issue or similar. I am not saying we shouldn't merge this but given that we treat the TB as a reliable channel I am inclined to wait for a refactor/optimization pass on all of JACCL to remedy this. I will leave it open for now. |
|
Thanks for the detailed benchmarks. Noted on the ~1μs regression for latency-sensitive sizes. Happy to keep this open and fold it into a broader JACCL optimization pass. Let me know if I can help when the time comes. |
Summary
The JACCL RDMA backend polls
ibv_wcwork completions to track in-flight RDMA operations but never checkswc[i].status. When an RDMA operation fails (e.g., due to memory pressure), the failure is silently ignored and the receive buffer—still containing stale or uninitialized data—is used as the result of the collective operation. This can cause silent data corruption in JACCL-based distributed workloads when RDMA operations fail. This PR addswc[i].statusvalidation to all 9 poll loops acrossring.cppandmesh.cpp, converting silent corruption into immediate, descriptive errors.Problem
All 9 completion-polling loops in the JACCL backend (5 in
ring.cpp, 4 inmesh.cpp) follow the same pattern:The
wc[i].statusfield is never checked. Per the IBV specification, a polled completion withstatus != IBV_WC_SUCCESSindicates that the corresponding RDMA operation failed. The data in the associated receive buffer is undefined.Additionally, the return value of
ibv_poll_cq(wrapped bypoll()) is not checked for negative values, which indicate a polling error.Affected functions:
RingGroup::all_gather(1 loop)RingGroup::send(1 loop)RingGroup::recv(1 loop)RingGroup::all_reduce_impl(2 loops: reduce-scatter phase + all-gather phase)MeshGroup::all_gather(1 loop)MeshGroup::send(1 loop)MeshGroup::recv(1 loop)MeshGroup::all_reduce(1 loop)Impact
Fix
Centralized helpers in
utils.h— instead of duplicating error-handling code in each file:wc_status_name()— maps allibv_wc_statusenum values to human-readable strings (full coverage including UNKNOWN fallback)check_wc_status(const ibv_wc&)— single-line status check that throws a descriptivestd::runtime_errorpoll()helpers inutils.hfixed to checkibv_poll_cqnegative returns internally:Connection::poll()— now throws on negative returnpoll(vector, ...)— now throws on negative return from any CQError messages include
qp_num— valid on error per IBV spec, useful for identifying which connection failed in multi-peer topologiesCall sites simplified — each of the 9 poll loops in
ring.cppandmesh.cppnow uses justcheck_wc_status(wc[i])instead of inline error handlingExample error message:
Testing