Skip to content

[https://nvbugs/6104831][draft] PR 13713 rebased onto v1.3.0rc13 — v7#13913

Draft
yifjiang wants to merge 9 commits intoNVIDIA:mainfrom
yifjiang:pr13713-on-rc13-v7
Draft

[https://nvbugs/6104831][draft] PR 13713 rebased onto v1.3.0rc13 — v7#13913
yifjiang wants to merge 9 commits intoNVIDIA:mainfrom
yifjiang:pr13713-on-rc13-v7

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented May 8, 2026

Copy of PR #13713 rebased onto v1.3.0rc13, branched off PR #13727's HEAD 66746088b for additional iteration.

Same 8 commits as PR #13713 (HEAD 5719008e22ec6b5fa6612525c556802ce593f937):

  1. `[fix] Disagg request cancellation fix`
  2. `[test] Add reproducers for broken-promise on disagg cancellation`
  3. `[test] Add disagg cancellation regression tests`
  4. `Fail closed on unquiesced disagg KV transfer`
  5. `[fix] Incorporate PR#13728's improvements`
  6. `Narrow Python disagg transfer timeout handling`
  7. `Defer context transfer cleanup after timeout cancel`
  8. `[fix] Complete deferred disagg cleanup after transfer` (the latest fix landing the deferred-cleanup retry path)

Maintained as a working branch for additional changes on top. PR #13727 stays as the pristine "PR 13713 rebased" reference.

Status: draft, do not merge.

chienchunhung and others added 8 commits May 7, 2026 17:26
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
…disagg cancellation.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
(cherry picked from commit 944561d)
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> 1777930306 -0700
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

# Conflicts:
#	tensorrt_llm/_torch/pyexecutor/py_executor.py
…transfer

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 8, 2026
…state to Python

Adds public accessors `isRecvPoolPoisoned()` / `isSendPoolPoisoned()` on
BaseTransBufferManager and BaseCacheTransceiver, surfacing the existing
`ConcurrenceResource::mPoisoned` flag.

Once any underlying pool's mPoisoned flag is set, BaseTransBufferManager::
assignBufferIndex throws unconditionally for the lifetime of the process
(see baseTransBuffer.cpp:350 and the matching message: "The process must
restart before these memory ranges can be safely reused"). Until now the
only signal surfaced to callers was a per-request RequestError carrying
the C++ exception text, which forces higher layers to string-match the
message to distinguish a permanent worker-fatal state from genuinely
transient per-request KV-transfer errors.

This commit gives Python (the dynamo request handler / readiness probe)
a structured, contract-stable way to ask "is this worker still able to
serve disagg KV transfers, or must it be restarted?" without parsing
error strings.

Changes:
- baseTransBuffer.h: two inline `noexcept` accessors that read the
  per-direction `mPoisoned` atomics (relaxed load mirrors the producer
  in poisonBufferIndex).
- cacheTransceiver.h: virtual `isRecvPoolPoisoned()` /
  `isSendPoolPoisoned()` on BaseCacheTransceiver with a `return false`
  default so non-disagg subclasses (test mocks, generation-first Python
  transceiver) need not override. CacheTransceiver overrides walk the
  KV + RNN buffer-manager pointers and OR the per-manager state.
- nanobind: bind the two methods on BaseCacheTransceiver. No trampoline
  size change required (defaults are concrete).
- _torch/pyexecutor/kv_cache_transceiver.py: matching `is_recv_pool_poisoned`
  / `is_send_pool_poisoned` on the Python ABC (default False) and
  delegating implementations on `BindKvCacheTransceiver`.
- tests/unittest/others/test_kv_cache_transceiver.py: regression test
  asserting both accessors return False on a fresh transceiver pair
  (NIXL + UCX backends).

Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants