Skip to content

VSJ001/Cache-Aware-and-GPU-Accelerated-Sparse-Matrix-Vector-Multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cache-Aware and GPU-Accelerated Sparse Matrix-Vector Multiplication (SpMV)

End-to-end SpMV optimization spanning CPU (cache-tiled, AVX2 + software prefetching) and GPU (CUDA) implementations, benchmarked on Northeastern's Explorer cluster against synthetic matrices and four real-world matrices from the SuiteSparse Matrix Collection.

Headline result

Hand-written CUDA vector kernel beats NVIDIA's cuSPARSE library by 24-56 percent on regular real-world matrices and reaches 98-110 percent of A100 HBM2 peak bandwidth. GPU is 8-273x faster than the best CPU implementation depending on matrix size.

GPU bandwidth on SuiteSparse matrices (A100, peak 1555 GB/s)

GPU SpMV bandwidth

Matrix rows nnz vector kernel cuSPARSE vector vs cuSPARSE
cant 62K 4.0M 1531 GB/s 1172 GB/s +30.6%
consph 83K 6.0M 1557 GB/s 1252 GB/s +24.4%
pdb1HYS 36K 4.3M 1718 GB/s 1102 GB/s +55.8%
webbase-1M 1.0M 3.1M 203 GB/s 847 GB/s cuSPARSE wins

Vector kernel speedup vs cuSPARSE

On highly irregular web-graph matrices (max nnz/row >> avg), the kernel selector correctly skips ELL and cuSPARSE's adaptive scheduling wins.

CPU vs GPU (kernel-only) on synthetic matrices

CPU vs GPU speedup

GPU achieves 8-273x speedup over the best CPU implementation (best of baseline / cache-tiled / AVX2 SIMD with software prefetching). Speedup scales with matrix size: tiny matrices (10K rows) get modest 8-12x because GPU launch overhead dominates; larger matrices (100K rows) get 200-273x as parallelism takes over.

Implementations

CPU

  • Baseline CSR SpMV (naive row-major)
  • Cache-tiled SpMV (blocked iteration, L2 reuse)
  • AVX2 SIMD with software prefetching (256-bit vectors, configurable pfd)

GPU (CUDA)

  • CSR scalar: 1 thread per row (uncoalesced, divergent on irregular rows)
  • CSR vector: 1 warp per row, coalesced loads, warp-shuffle reduction
  • ELL: column-major padded layout, fully coalesced
  • cuSPARSE CSR_ALG2: NVIDIA library reference

The harness automatically skips ELL when (a) it would consume more than half of free GPU memory, or (b) the row distribution is too irregular (max_nnz_per_row / avg > 50 and max > 100). This avoids OOM on large graph matrices and avoids running a strictly-worse algorithm.

Build

make cpu                       # gcc -O3 -march=native -ftree-vectorize
make gpu                       # nvcc -arch=sm_80 (A100, default)
make gpu CUDA_ARCH=sm_70       # V100
make gpu CUDA_ARCH=sm_90       # H100

GPU build needs CUDA 12.x. Tested with nvcc 12.3 on Explorer.

Run

Single kernel, single matrix

./src/spmv_gpu --kernel vector \
               --matrix data/suitesparse/cant.mtx \
               --validate --iters 20

Full sweep (CPU perf + cachegrind + GPU, all kernels, all matrices)

make full                                          # default A100
sbatch --gres=gpu:v100-sxm2:1 \
       --export=ALL,CUDA_ARCH=sm_70 \
       scripts/run_all.sh                          # V100 fallback

Skip phases via env vars: SKIP_CPU=1, SKIP_GPU=1, SKIP_CACHEGRIND=1.

Kernel-only CPU benchmark (for fair CPU vs GPU comparison)

bash scripts/cpu_bench.sh        # writes results/cpu/timings.csv

CLI flags (spmv_gpu)

flag purpose
--kernel scalar|vector|ell|cusparse|all which kernel(s) to run
--matrix <path> accepts custom .csr or SuiteSparse .mtx
--iters N timed iterations after 1 warmup (default 10)
--csv <file> append timings as CSV
--validate compare GPU output against CPU CSR reference

Test matrices

data/*.csr synthetic, generated by csr_generator.c at sizes 10K, 50K, 100K rows with target sparsity 1, 5, 10 percent. Note: the percent label is a generator parameter, not actual density (sparsity 5 produces ~2 nnz per row). These are useful for correctness checks but too small and too uniform to expose real GPU performance.

data/suitesparse/*.mtx real-world matrices from the SuiteSparse Matrix Collection. See data/suitesparse/README.md for download instructions. This is where the kernel comparison is meaningful.

Analysis and plots

python3 scripts/roofline.py --csv results/gpu/timings.csv --peak 1555
python3 plot/generate_gpu_plots.py
python3 plot/generate_cpu_gpu_compare.py
python3 plot/generate_plots.py                # CPU perf plots

roofline.py prints per-(matrix, kernel) achieved bandwidth, percent of peak, and the best kernel per matrix. Override --peak for V100 (900) or H100 (3350).

Findings

CPU (Linux perf, synthetic matrices)

CPU execution time

  • Baseline CSR is memory-latency dominated, highest CPI
  • Cache tiling cuts runtime ~18 percent via L2 reuse
  • AVX2 + software prefetching at pfd=16 yields ~30 percent improvement over baseline by overlapping memory latency with vectorized compute
  • Tile sizes 128 to 256 are most stable across matrix sizes

GPU (A100, SuiteSparse matrices)

GPU percent of peak HBM bandwidth

  • vector kernel beats cuSPARSE by 24 to 56 percent on regular matrices, hitting 98 to 110 percent of A100 HBM2 peak bandwidth
  • ELL is competitive on uniform-row matrices (consph: 1553 GB/s, within 0.3 percent of vector) but loses on matrices with varied row lengths (pdb1HYS: 667 GB/s, 39 percent of vector's bandwidth)
  • scalar reaches reasonable performance only when rows are uniform and few; on real-world matrices it loses 1.5 to 2x to vector due to warp divergence and uncoalesced loads
  • The ELL skip heuristic correctly fired on webbase-1M (max nnz/row >> avg), avoiding a multi-GB allocation that would OOM and a kernel that would be strictly worse than CSR
  • cuSPARSE wins on webbase-1M because its adaptive scheduling handles the heavy-tail row-length distribution that breaks all three custom kernels

Methodology note

Effective bandwidth in bandwidth_gbs() counts every x[col] read as uncached. Real DRAM bandwidth is lower because the dense vector x has locality and is largely L2-resident. Reporting >100 percent of peak indicates cache reuse in the formula, not measurement error.

Project layout

cuda_SpMV/
  data/
    matrix_*.csr            # synthetic (gitignored, regen via csr_generator)
    suitesparse/            # SuiteSparse .mtx (gitignored, download separately)
  src/
    csr.{c,h}, csr_generator.c
    spmv_{baseline,tiled,prefetch_simd}.c
    gpu/
      spmv_csr_{scalar,vector}.cu
      spmv_ell.cu
      spmv_cusparse.cu
      csr_to_ell.cu
      mtx_loader.{h,cpp}
      bench_gpu.cu
      spmv_kernels.h
      gpu_utils.h
    Makefile
  scripts/
    run_perf.sh, run_cachegrind.sh, summarize_perf.sh   # CPU sweeps
    cpu_bench.sh                                         # kernel-only CPU CSV
    run_gpu.sh                                           # SLURM: GPU sweep
    run_nsight.sh                                        # SLURM: ncu profile
    run_all.sh                                           # SLURM: full pipeline
    roofline.py                                          # bandwidth analysis
  plot/
    generate_plots.py                                    # CPU perf plots
    generate_gpu_plots.py                                # GPU bandwidth, % peak
    generate_cpu_gpu_compare.py                          # CPU vs GPU speedup
    plot_*.png                                           # rendered figures
  results/
    perf/, cachegrind/                                   # CPU raw (gitignored)
    perf_summary.txt                                     # CPU summary
    cpu/timings.csv                                      # CPU kernel-only
    gpu/timings.csv, gpu/roofline.txt                    # GPU outputs
    nsight/                                              # ncu .ncu-rep files
  Makefile

Hardware

  • CPU experiments: Intel Xeon Gold (Explorer compute nodes)
  • GPU experiments: NVIDIA A100 80GB (Explorer node d1026), CUDA 12.3
  • All scripts also support V100 SXM2/PCIe (sm_70) and H100/H200 (sm_90) via CUDA_ARCH override

About

CUDA SpMV kernels (scalar, warp-per-row, ELL) on NVIDIA A100 benchmarked against cuSPARSE on SuiteSparse matrices, plus AVX2 + cache-tiled CPU baselines on Intel Xeon Gold. Vector kernel reaches 98-110% of HBM2 peak, beating cuSPARSE by 24-56% on regular matrices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors