Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence by nathon-lee · Pull Request #7888 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-06T06:59:13Z

Title

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence

Summary

This PR fixes a hard crash when using Ulysses sequence parallelism with ZeRO Stage 0 (BF16_Optimizer).
In this configuration, DeepSpeed calls deepspeed.utils.bwc.bwc_tensor_model_parallel_rank(mpu=...), and the passed mpu object can be deepspeed.runtime.sequence_parallel.parallel_state_sp, which does not implement the deprecated get_model_parallel_rank() API. The current fallback path unconditionally calls mpu.get_model_parallel_rank(), raising AttributeError.

The fix adds a defensive capability check before calling the deprecated API. If the provided mpu does not expose any known tensor/model-parallel rank API, we treat it as “no tensor model parallelism” and return rank 0.

Motivation / Context

Affected scenario: Ulysses sequence parallel + ZeRO Stage 0
Failure mode: AttributeError: ... parallel_state_sp has no attribute get_model_parallel_rank
Root cause: bwc_tensor_model_parallel_rank() falls back to a deprecated API without an hasattr() check.

This change keeps the original priority order intact:

get_tensor_model_parallel_rank()
get_slice_parallel_rank()
get_model_parallel_rank() (deprecated)
fallback to 0 if none exist

Changes

deepspeed/utils/bwc.py
- Update bwc_tensor_model_parallel_rank() to check hasattr(mpu, "get_model_parallel_rank") before calling it.
- If mpu provides none of the expected tensor/model-parallel rank APIs, return 0 (no TP).

Why this is safe

For Megatron / DeepSpeed Topology / any existing MPU that already implements get_tensor_model_parallel_rank() or get_slice_parallel_rank() or get_model_parallel_rank(), behavior is unchanged.
The new code path only affects the previously-crashing case where the mpu object does not provide any of these methods.

Reproduction

Using the Ulysses ALST tutorial flow, switching ZeRO stage from 3 to 0 triggers the crash during optimizer step when grad norm is computed.

Testing

Existing unit tests should continue to pass.
Minimal repro: calling bwc_tensor_model_parallel_rank(mpu=deepspeed.runtime.sequence_parallel.parallel_state_sp) should no longer raise.

References

DeepSpeed Issue: [BUG] Ulysses crashes in Stage 0 #7833 “Ulysses crashes in Stage 0”

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Revert "fix: update 1 file reformatted." (ff88670)

This reverts commit b90aee5.

Revert accidental Muon optimizer code re-introduction from copilot PRs

tohtana · 2026-03-06T20:41:00Z

Hi @nathon-lee,
Thank you for reporting!

I found that we already have a fallback from get_model_parallel_world_size to get_sequence_parallel_world_size.
This was introduced in #7649. Can you make sure that the latest version still raises the error?

nathon-lee · 2026-03-08T03:27:32Z

Hi @tohtana, thanks for looking into this!

I double-checked master and I think #7649 addresses a different fallback path than the one crashing in Stage 0.

The crash in [BUG] Ulysses crashes in Stage 0 #7833 happens in deepspeed/utils/bwc.py:bwc_tensor_model_parallel_rank() (called from BF16_Optimizer.get_grads_for_norm() in ZeRO Stage 0).
In current master, bwc_tensor_model_parallel_rank() falls back to the deprecated API unconditionally:

else: return mpu.get_model_parallel_rank()

When mpu is deepspeed.runtime.sequence_parallel.parallel_state_sp (Ulysses), that module does not implement get_model_parallel_rank(), so we still hit:

AttributeError: ... parallel_state_sp has no attribute 'get_model_parallel_rank'

My understanding is that #7649 added/adjusted APIs and fallbacks around world size (e.g., get_model_parallel_world_size -> get_sequence_parallel_world_size), but it doesn't protect the rank fallback above.

That’s why I opened #7888: it makes bwc_tensor_model_parallel_rank() defensive (checks for get_model_parallel_rank before calling it, and otherwise treats it as “no tensor MP” and returns rank 0).

If you have a preferred behavior (e.g., should we fallback to a sequence-parallel rank instead of 0?), I’m happy to adjust the PR.

Add check for model parallel rank in mpu. Signed-off-by: nathon-lee <leejianwoo@gmail.com>

tohtana · 2026-03-08T23:27:36Z

Hi @nathon-lee,
Thank you for the reply. I still don't see why this issue happens with the latest master.
Can you share a small repro?

Copilot AI and others added 7 commits February 27, 2026 06:30

Initial plan

001f77c

Revert "fix: update 1 file reformatted."

b90aee5

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Merge pull request #5 from nathon-lee/copilot/git-revert-ff886701

b6da9af

Revert "fix: update 1 file reformatted." (ff88670)

Merge branch 'deepspeedai:master' into master

bb7f64f

Initial plan

cbc816c

Reapply "fix: update 1 file reformatted."

5fcc9a7

This reverts commit b90aee5.

Merge pull request #6 from nathon-lee/copilot/remove-commits-from-master

f7c5d75

Revert accidental Muon optimizer code re-introduction from copilot PRs

nathon-lee requested review from tjruwase and tohtana as code owners March 6, 2026 06:59

nathon-lee changed the title ~~Fix iss 7833~~ Fix issue 7833 Mar 6, 2026

nathon-lee changed the title ~~Fix issue 7833~~ Fix issue #7833 Mar 6, 2026

nathon-lee changed the title ~~Fix issue #7833~~ Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence Mar 6, 2026

nathon-lee mentioned this pull request Mar 6, 2026

fix: Validate fp16.loss_scale is finite and non-negative #7889

Merged

Enhance tensor model parallel rank retrieval

d11732f

Add check for model parallel rank in mpu. Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_iss_7833 branch from 5b1f8c8 to d11732f Compare March 8, 2026 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence#7888

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence#7888
nathon-lee wants to merge 8 commits intodeepspeedai:masterfrom
nathon-lee:fix_iss_7833

nathon-lee commented Mar 6, 2026

Uh oh!

tohtana commented Mar 6, 2026

Uh oh!

nathon-lee commented Mar 8, 2026

Uh oh!

tohtana commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nathon-lee commented Mar 6, 2026

Title

Summary

Motivation / Context

Changes

Why this is safe

Reproduction

Testing

References

Uh oh!

tohtana commented Mar 6, 2026

Uh oh!

nathon-lee commented Mar 8, 2026

Uh oh!

tohtana commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants