Fix CUDA block-sparse matrix-vector product contribution routing#2813
Closed
LwhJesse wants to merge 1 commit into
Closed
Fix CUDA block-sparse matrix-vector product contribution routing#2813LwhJesse wants to merge 1 commit into
LwhJesse wants to merge 1 commit into
Conversation
Author
|
Closing this draft because the scope has changed. This PR attempted to fix the routing issue inside the existing custom CUDA matvec kernel. After further testing, I found that the issue is better addressed by replacing the custom kernel with cuSPARSE BSR SpMV for the square-block CUDA path. I opened a new draft PR with the replacement approach and validation results here: |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This draft PR fixes a CUDA block-sparse matrix-vector product contribution-routing issue in:
The current CUDA matvec kernel derives the output block row from the fixed lane/thread id, while the matrix entry being processed is selected by a loop index that changes by
activeThreads.For some valid block-sparse inputs, this can route a contribution to the wrong output component.
This first draft keeps the existing atomic accumulation structure and only fixes the routing of each contribution. A runnable reproducer, full CUDA smoke testing, and performance cleanup will follow.
Current source-level issue
The current CUDA kernel computes:
and then loops over matrix entries using:
At the end, the accumulated value is added to:
The problem is that
blockRowis derived from the fixedthreadNo, while the actual dense-block entry is determined by the currentindex.For a block-sparse matrix entry:
the contribution must go to:
Therefore the correct dense-block local row for each processed matrix entry is determined by the current
index:If
trueBlockRow != blockRow_from_threadNo, the current kernel can route the contribution to the wrong output component.Minimal source-level counterexample
Consider this valid block-sparse input:
Set all values to zero except:
Interpretation:
The expected block-sparse matvec result is:
For
nVar = nEqn = 5:For
threadNo = 0, the current kernel computes:The loop executed by that lane visits:
For
index = 30:So the nonzero contribution is:
By block-sparse matvec semantics, this belongs to:
However, the current kernel routes the lane's accumulated result to:
Therefore the expected result is:
while the current CUDA mapping can produce:
This is a deterministic output-component routing error. It is not a floating-point roundoff issue and not an
atomicAdd()nondeterminism issue.Fix in this draft
This draft computes the dense-block local row from each current
index:and routes each contribution to:
The existing atomic accumulation structure is kept for now in order to keep this first draft focused strictly on correctness.
This may increase the number of atomic operations compared with the previous accumulation pattern. Performance cleanup is intentionally left for a later revision after the correctness issue is pinned down.
Scope
This draft PR only touches the CUDA block-sparse matvec path in:
It does not change:
Draft status
PR Checklist
pre-commit run --allto format old commits.