Skip to content

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816

Open
LwhJesse wants to merge 1 commit into
su2code:developfrom
LwhJesse:replace-cuda-matvec-with-cusparse-bsr
Open

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816
LwhJesse wants to merge 1 commit into
su2code:developfrom
LwhJesse:replace-cuda-matvec-with-cusparse-bsr

Conversation

@LwhJesse
Copy link
Copy Markdown

@LwhJesse LwhJesse commented May 17, 2026

Summary

This PR replaces the custom CUDA CSysMatrix block-sparse matrix-vector product kernel with cuSPARSE BSR SpMV for the CUDA square-block path.

This is not only a performance or maintenance refactor. It is primarily a correctness fix.

The previous custom CUDA kernel has two independent numerical correctness problems:

  1. It silently produces incorrect results for some valid square-block inputs.
  2. It also silently executes on rectangular block layouts (nVar != nEqn), but that path was never semantically correct and can produce wrong results.

Because of these issues, the previous CUDA kernel was not semantically safe for all inputs accepted by the interface. For the square-block CUDA path, the matrix storage is already BSR-like, so this PR replaces the custom kernel with cuSPARSE BSR SpMV instead of extending a path with known correctness mismatches.

What changed

  • Removed the custom CUDA GPUMatrixVectorProductAdd kernel.
  • Replaced CSysMatrix<ScalarType>::GPUMatrixVectorProduct with cuSPARSE BSR SpMV for the CUDA path.
  • Added an explicit CUDA-side guard for nVar != nEqn.
  • Added cuSPARSE as a CUDA build dependency in Meson.
  • Kept the CPU matvec path unchanged.

Why the previous CUDA kernel is being replaced

The motivation here is not only simplification. The previous CUDA kernel had correctness risks that were reproducible at the matrix-vector-product level.

Problem 1: square-block routing bug

The old kernel derives the output component from a fixed threadNo-based blockRow, while the actual matrix entry being processed changes with the loop index:

threadNo = threadIdx.x % 32
blockRow = (threadNo / nVar) % nVar
index += activeThreads

But the true dense-block row of the current matrix entry is determined by the current index, not by the fixed thread id.

That means a single thread can process entries belonging to different local block rows, while still accumulating everything into one fixed output component. This creates an algorithmic routing mismatch.

A minimal synthetic square-block case demonstrates it directly:

nPointDomain = 1
nPoint       = 2
nVar = nEqn  = 5

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 1
matrix[30] = 10
all other matrix entries = 0

vec[:] = 1

Expected CPU reference output:

[1, 10, 0, 0, 0]

Observed outputs:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

In this example, one CUDA lane processes both index = 0 and index = 30. Those two entries belong to different local block rows, but the old kernel routes both contributions to the same output component.

Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path

CSysMatrix has a general block interface with shape nVar x nEqn.

The CPU path uses the correct input-vector stride:

vec[col_j * nEqn]

However, the old CUDA kernel uses:

vec[d_col_ind[blockNo] * nVar + blockCol]

That is not semantically consistent with the CPU rectangular-block matvec definition when nVar != nEqn.

So although the old CUDA kernel would execute for rectangular blocks, that execution was not actually equivalent to the CPU rectangular-block matvec semantics. In other words, the rectangular CUDA path appeared to be accepted by the implementation, but its indexing was not consistent with the CPU rectangular-block definition.

A minimal synthetic rectangular case shows this clearly:

nPointDomain = 1
nPoint       = 2
nVar         = 3
nEqn         = 6

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 5
matrix[30] = 7

vec[0] = 1
vec[3] = 100
vec[6] = 1
all other vec entries = 0

Observed outputs:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]

This is not just a corner-case API detail. It shows that the previous CUDA kernel could silently accept a rectangular block layout and produce a wrong result without any explicit failure.

Why explicit rejection is better than silent execution

This PR intentionally supports only the CUDA square-block case:

nVar == nEqn

For nVar != nEqn, the new CUDA path now fails explicitly instead of silently running an incorrect algorithm.

So this PR does not reduce the set of rectangular CUDA cases that were correctly supported before, because the old kernel did not provide a semantically correct rectangular CUDA implementation in the first place.

The behavior change is:

before: rectangular CUDA input may run and silently compute the wrong answer
after:  rectangular CUDA input is rejected explicitly

That is a correctness improvement, not a compatibility regression.

Why cuSPARSE BSR is the right replacement here

For the CUDA square-block path, the existing storage is BSR-like in the sense relevant here:

  • row_ptr
  • col_ind
  • matrix as dense block payloads

The cuSPARSE path uses the matching BSR block direction for the existing block payload layout and provides a well-tested implementation of the required operation.

Given the observed correctness problems in both square and rectangular cases, using cuSPARSE for the valid square-block CUDA path is more robust than preserving a separate custom implementation for that path.

Validation

Build

  • Serial CUDA build completed locally.
  • MPI-enabled CUDA build could not be evaluated on my local Arch/GCC 16.1.1 environment because nvcc failed Meson's CUDA compiler sanity check before SU2 compilation. This happened before building the modified source files and was not a cuSPARSE link failure.

1. Synthetic square-block validation

Compared:

  • CPU reference BSR matvec
  • old custom CUDA kernel
  • new cuSPARSE BSR path

Result:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

2. Synthetic rectangular-block validation

Compared:

  • CPU rectangular-block reference
  • old custom CUDA kernel
  • new CUDA-path behavior

Result:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]
new CUDA path             = explicit rejection via square-block guard

3. Real SU2 integration cases

Ran 6 existing runnable implicit Krylov cases with four-way comparison:

develop CPU
develop CUDA
new branch CPU
new branch CUDA

Tested cases:

periodic2d_sector
udf_lam_flatplate_s
udf_lam_flatplate_m
udf_lam_flatplate_l
udf_test_11_probes_s
udf_test_11_probes_m

Summary:

  • new CPU matched develop CPU in 6/6 cases.
  • new CUDA matched new CPU in 6/6 cases.
  • develop CUDA differed from CPU / new CUDA in 5/6 cases.
  • No GPUassert, cuSPARSEassert, illegal memory access, or segmentation fault was observed.
  • One case showed NaN flags in all four runs, so that behavior was not new-CUDA-specific.

For cases without standard residual columns, the final history.csv row was compared across all common numeric fields.

Scope

This PR changes only the CUDA matrix-vector product implementation for the square-block CSysMatrix path.

It does not change:

  • CPU matvec semantics
  • Krylov solver logic
  • preconditioner logic
  • CFD solver physics
  • matrix assembly semantics

Follow-up work

This PR focuses on correctness first.

The current implementation creates/destroys cuSPARSE descriptors and allocates/frees the SpMV workspace inside each matvec call. That can be optimized later by caching:

  • the cuSPARSE handle
  • BSR matrix descriptor
  • dense vector descriptors
  • SpMV workspace
  • optional cusparseSpMV_preprocess() state

Follow-up draft

I have also prepared a follow-up draft PR in my fork to show the next intended step after this correctness PR:

That draft is based on this PR branch, not directly on develop, so it shows only the incremental resource-lifecycle change: caching the cuSPARSE handle, BSR matrix descriptor, and SpMV workspace after the cuSPARSE BSR path introduced here.

Related Work

This replaces the earlier narrower draft PR #2813, which attempted to fix only the contribution-routing issue inside the existing custom CUDA kernel. After additional testing, the scope changed to replacing the square-block CUDA path with cuSPARSE BSR SpMV and explicitly rejecting unsupported rectangular CUDA blocks.

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant