Cache-Aware and GPU-Accelerated Sparse Matrix-Vector Multiplication (SpMV)

End-to-end SpMV optimization spanning CPU (cache-tiled, AVX2 + software prefetching) and GPU (CUDA) implementations, benchmarked on Northeastern's Explorer cluster against synthetic matrices and four real-world matrices from the SuiteSparse Matrix Collection.

Headline result

Hand-written CUDA vector kernel beats NVIDIA's cuSPARSE library by 24-56 percent on regular real-world matrices and reaches 98-110 percent of A100 HBM2 peak bandwidth. GPU is 8-273x faster than the best CPU implementation depending on matrix size.

GPU bandwidth on SuiteSparse matrices (A100, peak 1555 GB/s)

Matrix	rows	nnz	vector kernel	cuSPARSE	vector vs cuSPARSE
cant	62K	4.0M	1531 GB/s	1172 GB/s	+30.6%
consph	83K	6.0M	1557 GB/s	1252 GB/s	+24.4%
pdb1HYS	36K	4.3M	1718 GB/s	1102 GB/s	+55.8%
webbase-1M	1.0M	3.1M	203 GB/s	847 GB/s	cuSPARSE wins

On highly irregular web-graph matrices (max nnz/row >> avg), the kernel selector correctly skips ELL and cuSPARSE's adaptive scheduling wins.

CPU vs GPU (kernel-only) on synthetic matrices

GPU achieves 8-273x speedup over the best CPU implementation (best of baseline / cache-tiled / AVX2 SIMD with software prefetching). Speedup scales with matrix size: tiny matrices (10K rows) get modest 8-12x because GPU launch overhead dominates; larger matrices (100K rows) get 200-273x as parallelism takes over.

Implementations

CPU

Baseline CSR SpMV (naive row-major)
Cache-tiled SpMV (blocked iteration, L2 reuse)
AVX2 SIMD with software prefetching (256-bit vectors, configurable pfd)

GPU (CUDA)

CSR scalar: 1 thread per row (uncoalesced, divergent on irregular rows)
CSR vector: 1 warp per row, coalesced loads, warp-shuffle reduction
ELL: column-major padded layout, fully coalesced
cuSPARSE CSR_ALG2: NVIDIA library reference

The harness automatically skips ELL when (a) it would consume more than half of free GPU memory, or (b) the row distribution is too irregular (max_nnz_per_row / avg > 50 and max > 100). This avoids OOM on large graph matrices and avoids running a strictly-worse algorithm.

Build

make cpu                       # gcc -O3 -march=native -ftree-vectorize
make gpu                       # nvcc -arch=sm_80 (A100, default)
make gpu CUDA_ARCH=sm_70       # V100
make gpu CUDA_ARCH=sm_90       # H100

GPU build needs CUDA 12.x. Tested with nvcc 12.3 on Explorer.

Run

Single kernel, single matrix

./src/spmv_gpu --kernel vector \
               --matrix data/suitesparse/cant.mtx \
               --validate --iters 20

Full sweep (CPU perf + cachegrind + GPU, all kernels, all matrices)

make full                                          # default A100
sbatch --gres=gpu:v100-sxm2:1 \
       --export=ALL,CUDA_ARCH=sm_70 \
       scripts/run_all.sh                          # V100 fallback

Skip phases via env vars: SKIP_CPU=1, SKIP_GPU=1, SKIP_CACHEGRIND=1.

Kernel-only CPU benchmark (for fair CPU vs GPU comparison)

bash scripts/cpu_bench.sh        # writes results/cpu/timings.csv

CLI flags (`spmv_gpu`)

flag	purpose
`--kernel scalar\|vector\|ell\|cusparse\|all`	which kernel(s) to run
`--matrix <path>`	accepts custom `.csr` or SuiteSparse `.mtx`
`--iters N`	timed iterations after 1 warmup (default 10)
`--csv <file>`	append timings as CSV
`--validate`	compare GPU output against CPU CSR reference

Test matrices

data/*.csr synthetic, generated by csr_generator.c at sizes 10K, 50K, 100K rows with target sparsity 1, 5, 10 percent. Note: the percent label is a generator parameter, not actual density (sparsity 5 produces ~2 nnz per row). These are useful for correctness checks but too small and too uniform to expose real GPU performance.

data/suitesparse/*.mtx real-world matrices from the SuiteSparse Matrix Collection. See data/suitesparse/README.md for download instructions. This is where the kernel comparison is meaningful.

Analysis and plots

python3 scripts/roofline.py --csv results/gpu/timings.csv --peak 1555
python3 plot/generate_gpu_plots.py
python3 plot/generate_cpu_gpu_compare.py
python3 plot/generate_plots.py                # CPU perf plots

roofline.py prints per-(matrix, kernel) achieved bandwidth, percent of peak, and the best kernel per matrix. Override --peak for V100 (900) or H100 (3350).

Findings

CPU (Linux perf, synthetic matrices)

Baseline CSR is memory-latency dominated, highest CPI
Cache tiling cuts runtime ~18 percent via L2 reuse
AVX2 + software prefetching at pfd=16 yields ~30 percent improvement over baseline by overlapping memory latency with vectorized compute
Tile sizes 128 to 256 are most stable across matrix sizes

GPU (A100, SuiteSparse matrices)

vector kernel beats cuSPARSE by 24 to 56 percent on regular matrices, hitting 98 to 110 percent of A100 HBM2 peak bandwidth
ELL is competitive on uniform-row matrices (consph: 1553 GB/s, within 0.3 percent of vector) but loses on matrices with varied row lengths (pdb1HYS: 667 GB/s, 39 percent of vector's bandwidth)
scalar reaches reasonable performance only when rows are uniform and few; on real-world matrices it loses 1.5 to 2x to vector due to warp divergence and uncoalesced loads
The ELL skip heuristic correctly fired on webbase-1M (max nnz/row >> avg), avoiding a multi-GB allocation that would OOM and a kernel that would be strictly worse than CSR
cuSPARSE wins on webbase-1M because its adaptive scheduling handles the heavy-tail row-length distribution that breaks all three custom kernels

Methodology note

Effective bandwidth in bandwidth_gbs() counts every x[col] read as uncached. Real DRAM bandwidth is lower because the dense vector x has locality and is largely L2-resident. Reporting >100 percent of peak indicates cache reuse in the formula, not measurement error.

Project layout

cuda_SpMV/
  data/
    matrix_*.csr            # synthetic (gitignored, regen via csr_generator)
    suitesparse/            # SuiteSparse .mtx (gitignored, download separately)
  src/
    csr.{c,h}, csr_generator.c
    spmv_{baseline,tiled,prefetch_simd}.c
    gpu/
      spmv_csr_{scalar,vector}.cu
      spmv_ell.cu
      spmv_cusparse.cu
      csr_to_ell.cu
      mtx_loader.{h,cpp}
      bench_gpu.cu
      spmv_kernels.h
      gpu_utils.h
    Makefile
  scripts/
    run_perf.sh, run_cachegrind.sh, summarize_perf.sh   # CPU sweeps
    cpu_bench.sh                                         # kernel-only CPU CSV
    run_gpu.sh                                           # SLURM: GPU sweep
    run_nsight.sh                                        # SLURM: ncu profile
    run_all.sh                                           # SLURM: full pipeline
    roofline.py                                          # bandwidth analysis
  plot/
    generate_plots.py                                    # CPU perf plots
    generate_gpu_plots.py                                # GPU bandwidth, % peak
    generate_cpu_gpu_compare.py                          # CPU vs GPU speedup
    plot_*.png                                           # rendered figures
  results/
    perf/, cachegrind/                                   # CPU raw (gitignored)
    perf_summary.txt                                     # CPU summary
    cpu/timings.csv                                      # CPU kernel-only
    gpu/timings.csv, gpu/roofline.txt                    # GPU outputs
    nsight/                                              # ncu .ncu-rep files
  Makefile

Hardware

CPU experiments: Intel Xeon Gold (Explorer compute nodes)
GPU experiments: NVIDIA A100 80GB (Explorer node d1026), CUDA 12.3
All scripts also support V100 SXM2/PCIe (sm_70) and H100/H200 (sm_90) via CUDA_ARCH override

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cache-Aware and GPU-Accelerated Sparse Matrix-Vector Multiplication (SpMV)

Headline result

GPU bandwidth on SuiteSparse matrices (A100, peak 1555 GB/s)

CPU vs GPU (kernel-only) on synthetic matrices

Implementations

CPU

GPU (CUDA)

Build

Run

Single kernel, single matrix

Full sweep (CPU perf + cachegrind + GPU, all kernels, all matrices)

Kernel-only CPU benchmark (for fair CPU vs GPU comparison)

CLI flags (`spmv_gpu`)

Test matrices

Analysis and plots

Findings

CPU (Linux perf, synthetic matrices)

GPU (A100, SuiteSparse matrices)

Methodology note

Project layout

Hardware

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/suitesparse		data/suitesparse
plot		plot
results		results
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Cache-Aware and GPU-Accelerated Sparse Matrix-Vector Multiplication (SpMV)

Headline result

GPU bandwidth on SuiteSparse matrices (A100, peak 1555 GB/s)

CPU vs GPU (kernel-only) on synthetic matrices

Implementations

CPU

GPU (CUDA)

Build

Run

Single kernel, single matrix

Full sweep (CPU perf + cachegrind + GPU, all kernels, all matrices)

Kernel-only CPU benchmark (for fair CPU vs GPU comparison)

CLI flags (spmv_gpu)

Test matrices

Analysis and plots

Findings

CPU (Linux perf, synthetic matrices)

GPU (A100, SuiteSparse matrices)

Methodology note

Project layout

Hardware

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CLI flags (`spmv_gpu`)

Packages