[None][fix] Sync disagg KV transfer error handling under ADP across all executor loops#13902
Open
qiaoxj07 wants to merge 1 commit intoNVIDIA:feat/deepseek_v4from
Open
Conversation
…ll executor loops
In disaggregated serving with attention DP, a NIXL/UCX KV transfer
failure is a rank-local event: the failing rank sets the request state
to DISAGG_TRANS_ERROR and proceeds to _check_cache_transfer_errors ->
_handle_errors -> _enqueue_responses -> tp_gather, while peer ranks
without a local error continue to _can_queue -> tp_allgather. Two
different TP collectives on the same group cause an MPI deadlock that
the hang detector eventually surfaces after 300s.
Add _handle_disagg_cache_errors_synced, called at the top of every
executor iteration. It tp_allgather's a single bool ("do I have any
local errors") so all TP ranks agree, then enters _handle_errors
together when any rank reports errors. Each rank passes its own
(possibly empty) list of error requests, keeping the downstream
tp_gather in lockstep.
The synced call is wired into all three executor loops -
_executor_loop, _executor_loop_overlap, and _executor_loop_pp - so the
fix covers the overlap path where the hang was originally observed in
dep16 disagg gen. _check_cache_transfer_errors becomes a no-op under
ADP to avoid double-handling.
Per-iteration cost: one tp_allgather of a single bool (no payload
allocation in the common no-error path).
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_handle_disagg_cache_errors_syncedand wires it into the top of all three executor loops (_executor_loop,_executor_loop_overlap,_executor_loop_pp).Problem
With ADP enabled in disaggregated serving, a NIXL/UCX KV transfer failure is rank-local: only the failing DP rank sees
DISAGG_TRANS_ERRORand enters_check_cache_transfer_errors → _handle_errors → _enqueue_responses → tp_gather. Peer ranks without a local error continue to_can_queue → tp_allgather. The two different TP collectives on the same group hang MPI; the hang detector then dumps stacks after 300s.Repro log (dep16, batch4, eplb0, mtp3): only one rank in
tp_gatherfrom_check_cache_transfer_errors, the other 15 intp_allgatherfrom_can_queue(_executor_loop_overlap).Approach
A new
_handle_disagg_cache_errors_synced:tp_allgathera singlebool(does this rank have local error requests)._handle_errors(...)together, each passing its own (possibly empty)local_error_requests.Wired into the top of every iteration of
_executor_loop,_executor_loop_overlap, and_executor_loop_ppso the downstreamtp_gatherin_enqueue_responsesis reached by all ranks in lockstep._check_cache_transfer_errorsbecomes a no-op under ADP to avoid double-handling; the non-ADP path is unchanged.Relationship to PR #13900
This is the same root cause and same approach as #13900 by @Shixiaowei02, narrowed to the deadlock fix and extended in two ways:
_executor_loop. Our hang was observed in_executor_loop_overlap, so this PR also covers the overlap loop and_executor_loop_pp._pending_transfer_responsesfor the success path,_adp_dummy_rolerefactor,_count_schedulable_active_requestsGENERATION_TO_COMPLETE exclusion,_pad_attention_dp_dummy_requesttoken_nums, NIXL diagnostic logging) so it can land independently. Happy to drop in favor of [None][fix] Fix dis-agg with ADP and avoid TP collective deadlock #13900 if its scope is rebased to cover the overlap loop.Cost
Per iteration: one
tp_allgather(bool)(~30 bytes pickled). The slow path only runs when at least one rank reports errors.Test plan
tests/unittest/_torch/executor/test_py_executor.py:test_synced_error_handler_no_op_when_disagg_disabledtest_synced_error_handler_no_op_when_attention_dp_disabledtest_synced_error_handler_skips_when_no_rank_has_errorstest_synced_error_handler_invokes_handle_errors_when_local_has_errorstest_synced_error_handler_runs_on_silent_rank_when_peer_has_errors(deadlock reproducer)