[None][fix] PyExecutor Hang in Disagg TP Prefill#14020
Conversation
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR adds Tensor Parallelism-wide allreduce consensus to disaggregated context-cache status checking, replacing rank-local conditions in the PP executor loop and batch preparation. Each rank computes whether it needs the check locally, then all ranks synchronize via ReduceOp.MAX; if any rank needs it, ranks conditionally call blocking or non-blocking variants based on both the consensus and local state. ChangesDisaggregated context-cache allreduce consensus
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #47868 [ run ] triggered by Bot. Commit: |
In disaggregated serving on TP ≥ 2, the CTX executor can deadlock because the entry into _check_disagg_ctx_cache_transfer_status (which performs a TP-collective allgather inside CacheTransceiver::checkContextTransferStatus) is gated on rank-local values from the local scheduler. When per-rank KV-block free counts drift even by a single block (which happens under load because UCX send completion + CUDA event sync timing varies per rank), num_fitting_reqs flips from >0 to 0 on a subset of ranks. The "0 fits" ranks enter the conditional and call the allgather; the "1+ fits" ranks skip the conditional and proceed to the next phase, which has its own TP collective (MLA attention's MNNVL allreduce on the GPU; with a KV connector, an mpi_broadcast in prepare_resources). Half the ranks at one collective, the other half at another, neither completes; hang_detector trips at 300s.
In a reproduction the per-iteration
[batchmgr] Capacity scheduler allows N requestslog shows ranks agreeing for hundreds of iterations and then diverging on a single iteration (e.g. rank0 = 1, rank4 = 0). In the hang dump from that iteration, ranks with 1 are stuck in attention.mla_custom_op_inplace and ranks with 0 are stuck in kv_cache_transceiver.check_context_transfer_status.This change OR-reduces the gating condition across TP ranks: if any rank locally wants the call, all ranks call it. Ranks that locally don't need it use the non-blocking variant (atLeastRequestNum=0) so the internal allgather still has full quorum without making them wait on futures they don't have. The same OR pattern is applied to the mirror site in _pp_schedule_and_propagate defensively (has_any_inflight_requests there is rank-local even though num_fitting_reqs is broadcast).
check_context_transfer_status is idempotent on a rank with no pending sender futures: the allgather contributes empty, the "complete on every rank" intersection is empty, and the future-iteration loop is a no-op. markComplete=true semantics are unchanged — a request is still only marked kDISAGG_CONTEXT_COMPLETE when every rank reports it complete. Added cost is one extra empty allgather per iteration.
Summary by CodeRabbit