Skip to content

Conversation

@Beichen-Ma
Copy link

Ring attention implementation does not support Flash Attention 3 yet. Using --attn-implementation flash_attention_3 with --context-parallel-size > 1 would silently cause NaN loss during training. This change adds an early validation check
that raises a clear error with actionable guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant