Skip to content

Fix CUDA block-sparse matrix-vector product contribution routing#2813

Closed
LwhJesse wants to merge 1 commit into
su2code:developfrom
LwhJesse:fix/cuda-block-sparse-matvec-routing
Closed

Fix CUDA block-sparse matrix-vector product contribution routing#2813
LwhJesse wants to merge 1 commit into
su2code:developfrom
LwhJesse:fix/cuda-block-sparse-matvec-routing

Conversation

@LwhJesse
Copy link
Copy Markdown

Summary

This draft PR fixes a CUDA block-sparse matrix-vector product contribution-routing issue in:

Common/src/linear_algebra/CSysMatrixGPU.cu

The current CUDA matvec kernel derives the output block row from the fixed lane/thread id, while the matrix entry being processed is selected by a loop index that changes by activeThreads.

For some valid block-sparse inputs, this can route a contribution to the wrong output component.

This first draft keeps the existing atomic accumulation structure and only fixes the routing of each contribution. A runnable reproducer, full CUDA smoke testing, and performance cleanup will follow.

Current source-level issue

The current CUDA kernel computes:

threadNo = threadIdx.x % 32
activeThreads = nVar * (32 / nVar)
blockRow = (threadNo / nVar) % nVar

and then loops over matrix entries using:

index += activeThreads

At the end, the accumulated value is added to:

prod[row * nVar + blockRow]

The problem is that blockRow is derived from the fixed threadNo, while the actual dense-block entry is determined by the current index.

For a block-sparse matrix entry:

matrix[block*nVar*nEqn + blockRow*nEqn + blockCol]

the contribution must go to:

prod[row*nVar + blockRow]

Therefore the correct dense-block local row for each processed matrix entry is determined by the current index:

localIndex   = index % (nVar * nEqn)
trueBlockRow = localIndex / nEqn
blockCol     = localIndex % nEqn

If trueBlockRow != blockRow_from_threadNo, the current kernel can route the contribution to the wrong output component.

Minimal source-level counterexample

Consider this valid block-sparse input:

nPointDomain = 1
nVar = 5
nEqn = 5

row_ptr = [0, 2]
col_ind = [0, 1]

matrix length = 2 * 5 * 5 = 50
vec length    = 2 * 5     = 10
prod length   = 1 * 5     = 5

Set all values to zero except:

matrix[30] = 7
vec[5]     = 2

Interpretation:

matrix[30]
  = block 1, local offset 5
  = blockRow 1, blockCol 0

vec[5]
  = column block 1, blockCol 0

The expected block-sparse matvec result is:

prod = [0, 14, 0, 0, 0]

For nVar = nEqn = 5:

activeThreads = nVar * floor(32 / nVar)
              = 5 * 6
              = 30

For threadNo = 0, the current kernel computes:

blockRow_from_threadNo = (0 / 5) % 5 = 0

The loop executed by that lane visits:

index = 0
index = 30

For index = 30:

dense block size = nVar * nEqn = 25

blockNo      = 30 / 25 = 1
localIndex   = 30 % 25 = 5
trueBlockRow = 5 / 5   = 1
blockCol     = 5 % 5   = 0

So the nonzero contribution is:

matrix[30] * vec[5] = 7 * 2 = 14

By block-sparse matvec semantics, this belongs to:

prod[row * nVar + trueBlockRow] = prod[1]

However, the current kernel routes the lane's accumulated result to:

prod[row * nVar + blockRow_from_threadNo] = prod[0]

Therefore the expected result is:

prod = [0, 14, 0, 0, 0]

while the current CUDA mapping can produce:

prod = [14, 0, 0, 0, 0]

This is a deterministic output-component routing error. It is not a floating-point roundoff issue and not an atomicAdd() nondeterminism issue.

Fix in this draft

This draft computes the dense-block local row from each current index:

trueBlockRow = (index % (nVar * nEqn)) / nEqn

and routes each contribution to:

prod[row*nVar + trueBlockRow]

The existing atomic accumulation structure is kept for now in order to keep this first draft focused strictly on correctness.

This may increase the number of atomic operations compared with the previous accumulation pattern. Performance cleanup is intentionally left for a later revision after the correctness issue is pinned down.

Scope

This draft PR only touches the CUDA block-sparse matvec path in:

Common/src/linear_algebra/CSysMatrixGPU.cu

It does not change:

solver-level matrix upload lifetime
HtDTransfer / DtHTransfer policy
preconditioners
FGMRES
residual storage
non-CUDA matvec paths
CFD solver logic
physics models

Draft status

- Source-level correctness issue documented
- Minimal counterexample documented
- Minimal contribution-routing fix implemented
- Runnable reproducer pending
- Full SU2 CUDA smoke test pending
- Performance measurement pending

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings.
  • My contribution is commented and consistent with SU2 style.
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation, if necessary.

@LwhJesse
Copy link
Copy Markdown
Author

Closing this draft because the scope has changed.

This PR attempted to fix the routing issue inside the existing custom CUDA matvec kernel. After further testing, I found that the issue is better addressed by replacing the custom kernel with cuSPARSE BSR SpMV for the square-block CUDA path.

I opened a new draft PR with the replacement approach and validation results here:

#2816

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant