Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV by LwhJesse · Pull Request #2816 · su2code/SU2

LwhJesse · 2026-05-17T08:23:28Z

Summary

This PR replaces the custom CUDA CSysMatrix block-sparse matrix-vector product kernel with cuSPARSE BSR SpMV for the CUDA square-block path.

This is not only a performance or maintenance refactor. It is primarily a correctness fix.

The previous custom CUDA kernel has two independent numerical correctness problems:

It silently produces incorrect results for some valid square-block inputs.
It also silently executes on rectangular block layouts (nVar != nEqn), but that path was never semantically correct and can produce wrong results.

Because of these issues, the previous CUDA kernel was not semantically safe for all inputs accepted by the interface. For the square-block CUDA path, the matrix storage is already BSR-like, so this PR replaces the custom kernel with cuSPARSE BSR SpMV instead of extending a path with known correctness mismatches.

What changed

Removed the custom CUDA GPUMatrixVectorProductAdd kernel.
Replaced CSysMatrix<ScalarType>::GPUMatrixVectorProduct with cuSPARSE BSR SpMV for the CUDA path.
Added an explicit CUDA-side guard for nVar != nEqn.
Added cuSPARSE as a CUDA build dependency in Meson.
Kept the CPU matvec path unchanged.

Why the previous CUDA kernel is being replaced

The motivation here is not only simplification. The previous CUDA kernel had correctness risks that were reproducible at the matrix-vector-product level.

Problem 1: square-block routing bug

The old kernel derives the output component from a fixed threadNo-based blockRow, while the actual matrix entry being processed changes with the loop index:

threadNo = threadIdx.x % 32
blockRow = (threadNo / nVar) % nVar
index += activeThreads

But the true dense-block row of the current matrix entry is determined by the current index, not by the fixed thread id.

That means a single thread can process entries belonging to different local block rows, while still accumulating everything into one fixed output component. This creates an algorithmic routing mismatch.

A minimal synthetic square-block case demonstrates it directly:

nPointDomain = 1
nPoint       = 2
nVar = nEqn  = 5

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 1
matrix[30] = 10
all other matrix entries = 0

vec[:] = 1

Expected CPU reference output:

[1, 10, 0, 0, 0]

Observed outputs:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

In this example, one CUDA lane processes both index = 0 and index = 30. Those two entries belong to different local block rows, but the old kernel routes both contributions to the same output component.

Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path

CSysMatrix has a general block interface with shape nVar x nEqn.

The CPU path uses the correct input-vector stride:

vec[col_j * nEqn]

However, the old CUDA kernel uses:

vec[d_col_ind[blockNo] * nVar + blockCol]

That is not semantically consistent with the CPU rectangular-block matvec definition when nVar != nEqn.

So although the old CUDA kernel would execute for rectangular blocks, that execution was not actually equivalent to the CPU rectangular-block matvec semantics. In other words, the rectangular CUDA path appeared to be accepted by the implementation, but its indexing was not consistent with the CPU rectangular-block definition.

A minimal synthetic rectangular case shows this clearly:

nPointDomain = 1
nPoint       = 2
nVar         = 3
nEqn         = 6

row_ptr = [0, 2]
col_ind = [0, 1]

matrix[0]  = 5
matrix[30] = 7

vec[0] = 1
vec[3] = 100
vec[6] = 1
all other vec entries = 0

Observed outputs:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]

This is not just a corner-case API detail. It shows that the previous CUDA kernel could silently accept a rectangular block layout and produce a wrong result without any explicit failure.

Why explicit rejection is better than silent execution

This PR intentionally supports only the CUDA square-block case:

nVar == nEqn

For nVar != nEqn, the new CUDA path now fails explicitly instead of silently running an incorrect algorithm.

So this PR does not reduce the set of rectangular CUDA cases that were correctly supported before, because the old kernel did not provide a semantically correct rectangular CUDA implementation in the first place.

The behavior change is:

before: rectangular CUDA input may run and silently compute the wrong answer
after:  rectangular CUDA input is rejected explicitly

That is a correctness improvement, not a compatibility regression.

Why cuSPARSE BSR is the right replacement here

For the CUDA square-block path, the existing storage is BSR-like in the sense relevant here:

row_ptr
col_ind
matrix as dense block payloads

The cuSPARSE path uses the matching BSR block direction for the existing block payload layout and provides a well-tested implementation of the required operation.

Given the observed correctness problems in both square and rectangular cases, using cuSPARSE for the valid square-block CUDA path is more robust than preserving a separate custom implementation for that path.

Validation

Build

Serial CUDA build completed locally.
MPI-enabled CUDA build could not be evaluated on my local Arch/GCC 16.1.1 environment because nvcc failed Meson's CUDA compiler sanity check before SU2 compilation. This happened before building the modified source files and was not a cuSPARSE link failure.

1. Synthetic square-block validation

Compared:

CPU reference BSR matvec
old custom CUDA kernel
new cuSPARSE BSR path

Result:

CPU reference   = [1, 10, 0, 0, 0]
old CUDA kernel = [11, 0, 0, 0, 0]
cuSPARSE BSR    = [1, 10, 0, 0, 0]

2. Synthetic rectangular-block validation

Compared:

CPU rectangular-block reference
old custom CUDA kernel
new CUDA-path behavior

Result:

CPU rectangular reference = [5, 0, 7]
old CUDA kernel           = [705, 0, 0]
new CUDA path             = explicit rejection via square-block guard

3. Real SU2 integration cases

Ran 6 existing runnable implicit Krylov cases with four-way comparison:

develop CPU
develop CUDA
new branch CPU
new branch CUDA

Tested cases:

periodic2d_sector
udf_lam_flatplate_s
udf_lam_flatplate_m
udf_lam_flatplate_l
udf_test_11_probes_s
udf_test_11_probes_m

Summary:

new CPU matched develop CPU in 6/6 cases.
new CUDA matched new CPU in 6/6 cases.
develop CUDA differed from CPU / new CUDA in 5/6 cases.
No GPUassert, cuSPARSEassert, illegal memory access, or segmentation fault was observed.
One case showed NaN flags in all four runs, so that behavior was not new-CUDA-specific.

For cases without standard residual columns, the final history.csv row was compared across all common numeric fields.

Scope

This PR changes only the CUDA matrix-vector product implementation for the square-block CSysMatrix path.

It does not change:

CPU matvec semantics
Krylov solver logic
preconditioner logic
CFD solver physics
matrix assembly semantics

Follow-up work

This PR focuses on correctness first.

The current implementation creates/destroys cuSPARSE descriptors and allocates/frees the SpMV workspace inside each matvec call. That can be optimized later by caching:

the cuSPARSE handle
BSR matrix descriptor
dense vector descriptors
SpMV workspace
optional cusparseSpMV_preprocess() state

Follow-up draft

I have also prepared a follow-up draft PR in my fork to show the next intended step after this correctness PR:

Cache cuSPARSE SpMV resources after #2816 LwhJesse/SU2#1

That draft is based on this PR branch, not directly on develop, so it shows only the incremental resource-lifecycle change: caching the cuSPARSE handle, BSR matrix descriptor, and SpMV workspace after the cuSPARSE BSR path introduced here.

Related Work

This replaces the earlier narrower draft PR #2813, which attempted to fix only the contribution-routing issue inside the existing custom CUDA kernel. After additional testing, the scope changed to replacing the square-block CUDA path with cuSPARSE BSR SpMV and explicitly rejecting unsupported rectangular CUDA blocks.

PR Checklist

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Use cuSPARSE BSR SpMV for CUDA matrix-vector products

c4cbdc8

LwhJesse mentioned this pull request May 17, 2026

Fix CUDA block-sparse matrix-vector product contribution routing #2813

Closed

6 tasks

LwhJesse marked this pull request as ready for review May 17, 2026 10:15

This was referenced May 17, 2026

Address review comments from PR #2539 - LU_SGS Preconditioner #2751

Open

Cache cuSPARSE SpMV resources after #2816 LwhJesse/SU2#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816

Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816
LwhJesse wants to merge 1 commit into
su2code:developfrom
LwhJesse:replace-cuda-matvec-with-cusparse-bsr

LwhJesse commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LwhJesse commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why the previous CUDA kernel is being replaced

Problem 1: square-block routing bug

Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path

Why explicit rejection is better than silent execution

Why cuSPARSE BSR is the right replacement here

Validation

Build

1. Synthetic square-block validation

2. Synthetic rectangular-block validation

3. Real SU2 integration cases

Scope

Follow-up work

Follow-up draft

Related Work

PR Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LwhJesse commented May 17, 2026 •

edited

Loading