End-to-end SpMV optimization spanning CPU (cache-tiled, AVX2 + software prefetching) and GPU (CUDA) implementations, benchmarked on Northeastern's Explorer cluster against synthetic matrices and four real-world matrices from the SuiteSparse Matrix Collection.
Hand-written CUDA vector kernel beats NVIDIA's cuSPARSE library by
24-56 percent on regular real-world matrices and reaches 98-110 percent
of A100 HBM2 peak bandwidth. GPU is 8-273x faster than the best CPU
implementation depending on matrix size.
| Matrix | rows | nnz | vector kernel | cuSPARSE | vector vs cuSPARSE |
|---|---|---|---|---|---|
| cant | 62K | 4.0M | 1531 GB/s | 1172 GB/s | +30.6% |
| consph | 83K | 6.0M | 1557 GB/s | 1252 GB/s | +24.4% |
| pdb1HYS | 36K | 4.3M | 1718 GB/s | 1102 GB/s | +55.8% |
| webbase-1M | 1.0M | 3.1M | 203 GB/s | 847 GB/s | cuSPARSE wins |
On highly irregular web-graph matrices (max nnz/row >> avg), the kernel selector correctly skips ELL and cuSPARSE's adaptive scheduling wins.
GPU achieves 8-273x speedup over the best CPU implementation (best of baseline / cache-tiled / AVX2 SIMD with software prefetching). Speedup scales with matrix size: tiny matrices (10K rows) get modest 8-12x because GPU launch overhead dominates; larger matrices (100K rows) get 200-273x as parallelism takes over.
- Baseline CSR SpMV (naive row-major)
- Cache-tiled SpMV (blocked iteration, L2 reuse)
- AVX2 SIMD with software prefetching (256-bit vectors, configurable pfd)
- CSR scalar: 1 thread per row (uncoalesced, divergent on irregular rows)
- CSR vector: 1 warp per row, coalesced loads, warp-shuffle reduction
- ELL: column-major padded layout, fully coalesced
- cuSPARSE CSR_ALG2: NVIDIA library reference
The harness automatically skips ELL when (a) it would consume more than half of free GPU memory, or (b) the row distribution is too irregular (max_nnz_per_row / avg > 50 and max > 100). This avoids OOM on large graph matrices and avoids running a strictly-worse algorithm.
make cpu # gcc -O3 -march=native -ftree-vectorize
make gpu # nvcc -arch=sm_80 (A100, default)
make gpu CUDA_ARCH=sm_70 # V100
make gpu CUDA_ARCH=sm_90 # H100GPU build needs CUDA 12.x. Tested with nvcc 12.3 on Explorer.
./src/spmv_gpu --kernel vector \
--matrix data/suitesparse/cant.mtx \
--validate --iters 20make full # default A100
sbatch --gres=gpu:v100-sxm2:1 \
--export=ALL,CUDA_ARCH=sm_70 \
scripts/run_all.sh # V100 fallbackSkip phases via env vars: SKIP_CPU=1, SKIP_GPU=1, SKIP_CACHEGRIND=1.
bash scripts/cpu_bench.sh # writes results/cpu/timings.csv| flag | purpose |
|---|---|
--kernel scalar|vector|ell|cusparse|all |
which kernel(s) to run |
--matrix <path> |
accepts custom .csr or SuiteSparse .mtx |
--iters N |
timed iterations after 1 warmup (default 10) |
--csv <file> |
append timings as CSV |
--validate |
compare GPU output against CPU CSR reference |
data/*.csr synthetic, generated by csr_generator.c at sizes 10K, 50K,
100K rows with target sparsity 1, 5, 10 percent. Note: the percent label
is a generator parameter, not actual density (sparsity 5 produces ~2 nnz
per row). These are useful for correctness checks but too small and too
uniform to expose real GPU performance.
data/suitesparse/*.mtx real-world matrices from the SuiteSparse Matrix
Collection. See data/suitesparse/README.md for download instructions.
This is where the kernel comparison is meaningful.
python3 scripts/roofline.py --csv results/gpu/timings.csv --peak 1555
python3 plot/generate_gpu_plots.py
python3 plot/generate_cpu_gpu_compare.py
python3 plot/generate_plots.py # CPU perf plotsroofline.py prints per-(matrix, kernel) achieved bandwidth, percent of
peak, and the best kernel per matrix. Override --peak for V100 (900) or
H100 (3350).
- Baseline CSR is memory-latency dominated, highest CPI
- Cache tiling cuts runtime ~18 percent via L2 reuse
- AVX2 + software prefetching at pfd=16 yields ~30 percent improvement over baseline by overlapping memory latency with vectorized compute
- Tile sizes 128 to 256 are most stable across matrix sizes
vectorkernel beatscuSPARSEby 24 to 56 percent on regular matrices, hitting 98 to 110 percent of A100 HBM2 peak bandwidthELLis competitive on uniform-row matrices (consph: 1553 GB/s, within 0.3 percent of vector) but loses on matrices with varied row lengths (pdb1HYS: 667 GB/s, 39 percent of vector's bandwidth)scalarreaches reasonable performance only when rows are uniform and few; on real-world matrices it loses 1.5 to 2x to vector due to warp divergence and uncoalesced loads- The ELL skip heuristic correctly fired on webbase-1M (max nnz/row >> avg), avoiding a multi-GB allocation that would OOM and a kernel that would be strictly worse than CSR
cuSPARSEwins on webbase-1M because its adaptive scheduling handles the heavy-tail row-length distribution that breaks all three custom kernels
Effective bandwidth in bandwidth_gbs() counts every x[col] read as
uncached. Real DRAM bandwidth is lower because the dense vector x has
locality and is largely L2-resident. Reporting >100 percent of peak
indicates cache reuse in the formula, not measurement error.
cuda_SpMV/
data/
matrix_*.csr # synthetic (gitignored, regen via csr_generator)
suitesparse/ # SuiteSparse .mtx (gitignored, download separately)
src/
csr.{c,h}, csr_generator.c
spmv_{baseline,tiled,prefetch_simd}.c
gpu/
spmv_csr_{scalar,vector}.cu
spmv_ell.cu
spmv_cusparse.cu
csr_to_ell.cu
mtx_loader.{h,cpp}
bench_gpu.cu
spmv_kernels.h
gpu_utils.h
Makefile
scripts/
run_perf.sh, run_cachegrind.sh, summarize_perf.sh # CPU sweeps
cpu_bench.sh # kernel-only CPU CSV
run_gpu.sh # SLURM: GPU sweep
run_nsight.sh # SLURM: ncu profile
run_all.sh # SLURM: full pipeline
roofline.py # bandwidth analysis
plot/
generate_plots.py # CPU perf plots
generate_gpu_plots.py # GPU bandwidth, % peak
generate_cpu_gpu_compare.py # CPU vs GPU speedup
plot_*.png # rendered figures
results/
perf/, cachegrind/ # CPU raw (gitignored)
perf_summary.txt # CPU summary
cpu/timings.csv # CPU kernel-only
gpu/timings.csv, gpu/roofline.txt # GPU outputs
nsight/ # ncu .ncu-rep files
Makefile
- CPU experiments: Intel Xeon Gold (Explorer compute nodes)
- GPU experiments: NVIDIA A100 80GB (Explorer node d1026), CUDA 12.3
- All scripts also support V100 SXM2/PCIe (sm_70) and H100/H200 (sm_90)
via
CUDA_ARCHoverride




