Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 169 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,175 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Christina Qiu
* [LinkedIn](https://www.linkedin.com/in/christina-qiu-6094301b6/), [personal website](https://christinaqiu3.github.io/), [twitter](), etc.
* Tested on: Windows 11, Intel Core i7-13700H @ 2.40GHz, 16GB RAM, NVIDIA GeForce RTX 4060 Laptop GPU (Personal laptop)

### (TODO: Your README)
## Overview

This is an implementation of scan (prefix sum) and stream compaction algorithms on both the CPU and GPU using CUDA.

This project includes four parts:

### Part 1: CPU Scan & Stream Compaction
* Implements a basic exclusive prefix sum (scan) on the CPU.
* Two stream compaction implementations:
* Without scan: simple loop that filters non-zero values.
* With scan and scatter: mimics the parallel approach by mapping, scanning, and scattering.
* Used for correctness testing and performance comparison against GPU implementations.
* Runtime O(n)

### Part 2: Naive GPU Scan Algorithm
* Implements a naive parallel scan on the GPU using CUDA.
* Iteratively applies scan logic for each depth level (d) in multiple kernel launches.
* Not work-efficient and not in-place.
* Demonstrates basic GPU memory handling and parallel loop structure.
* Runtime O(log n) kernel launches, total work O(n log n)
* At each iteration, a full kernel with n threads is launched

### Part 3: Work-Efficient GPU Scan & Stream Compaction
* Implements the Blelloch (work-efficient) scan algorithm using the upsweep and downsweep phases.
* Handles non-power-of-two input sizes by padding to the next power of two.
* Adds GPU stream compaction using:
* Map step (0/1 flags for zero vs. non-zero),
* Scan on flags,
* Scatter to final output.
* Much faster and scalable compared to the naive implementation.
* Runtime O(log n) kernel launches, total work O(n)
* This is because each kernel does fewer threads of work as d increases.
* At step (d = 0), launched (threads = n/2), per thread (work = constant)
* At step (d = 1), launched (threads = n/4), per thread (work = constant)
* At step (d = logn-1), launched (threads = 1), per thread (work = constant)
* Thus the total work across all kernels: (n/2) + (n/4) + (n/8) + ... + 1 = O(n)

### Part 4: Using Thrust's Implementation
* Leverages Thrust, a high-performance parallel algorithms library built on CUDA.
* Implements scan using thrust::exclusive_scan.
* Simplifies GPU programming and enables performance benchmarking against custom implementations.

### Output of Scan & Stream Compaction Tests
(Array Size = 2^24)
```
****************
** SCAN TESTS **
****************
[ 7 32 41 34 34 45 20 39 29 27 7 38 38 ... 10 0 ]
==== cpu scan, power-of-two ====
elapsed time: 28.6697ms (std::chrono Measured)
[ 0 7 39 80 114 148 193 213 252 281 308 315 353 ... 410882306 410882316 ]
==== cpu scan, non-power-of-two ====
elapsed time: 30.8242ms (std::chrono Measured)
[ 0 7 39 80 114 148 193 213 252 281 308 315 353 ... 410882253 410882290 ]
passed
==== naive scan, power-of-two ====
elapsed time: 14.7507ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 14.6412ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 6.67194ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 6.49674ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 1.34758ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 1.26874ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 2 2 1 2 2 3 0 0 3 3 3 3 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 38.4214ms (std::chrono Measured)
[ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 2 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 38.1998ms (std::chrono Measured)
[ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 3 2 ]
passed
==== cpu compact with scan ====
elapsed time: 88.2723ms (std::chrono Measured)
[ 2 2 1 2 2 3 3 3 3 3 2 2 2 ... 2 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 11.8516ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 11.6095ms (CUDA Measured)
passed
```

## Runtime and Performance Analysis

(note: I tested Array Sizes up to 2^24 because 2^25 is where computation breaks down on my laptop)

1. Hypothesis: The GPU implementations of scan (Naive, Work-Efficient, and Thrust) will outperform the serial CPU scan as the array size increases. The Thrust library implementation will be the fastest on large data sizes due to its highly optimized CUDA backend. The naive GPU scan will be slower than the work-efficient implementation due to redundant work and synchronization overhead.

### Size of Array v. Runtime of Scan (Power of 2) Graph

blockSize = 128

![Data](img/graph1_t.png)
![Graph](img/graph1_v.png)

### Size of Array v. Runtime of Scan (Non Power of 2) Graph

blockSize = 128

![Data](img/graph2_t.png)
![Graph](img/graph2_v.png)

Conclusion:

CPU scan runtime increases exponentially as the array size increases, due to larger data sizes needing to access memory beyond the L1/L2 caches.

For smaller inputs, Naive GPU scan can actually outperform the more complex approaches because it uses fewer threads and has less overhead. However, as the input size grows, the naive approach becomes inefficient due to poor memory access patterns. Specifically, threads must access data locations increasingly farther apart, resulting in uncoalesced memory transactions that hurt bandwidth and performance. Additionally, many threads become idle in later stages, reducing hardware utilization.

Work Efficient GPU scan performs two kernels per iteration (upsweep and downsweep). Despite more frequent kernel invocations, this method excels for larger inputs because it minimizes idle threads and accesses memory in a way that favors coalescing. The upsweep and downsweep phases ensure that each element is processed only a logarithmic number of times with well-structured memory access, improving throughput and scaling much better than the naive approach. As a result, this approach ultimately surpasses both the naive GPU scan and the serial CPU scan in speed for large array sizes.

##

2. Hypothesis: The CPU implementation of stream compaction is expected to become significantly slower as the array size increases due to its sequential nature and growing cache/memory pressure. On the other hand, the work-efficient GPU implementation should demonstrate much better scalability, especially on larger arrays, due to parallel processing and improved memory access patterns.

### Size of Array v. Runtime of Stream Compaction (Power of 2) Graph

blockSize = 128

![Data](img/graph3_t.png)
![Graph](img/graph3_v.png)

### Size of Array v. Runtime of Stream Compaction (Non Power of 2) Graph

blockSize = 128

![Data](img/graph4_t.png)
![Graph](img/graph4_v.png)

Conclusion: As expected, the CPU implementation’s runtime increases rapidly with array size (particularly beyond 2^21). In contrast, the work-efficient GPU implementation shows much better scaling, with runtimes increasing much more slowly as array size grows. For smaller sizes (e.g. 2^18 - 2^20), the CPU is actually slightly faster, likely due to lower kernel launch overhead. But this quickly flips as data size increases. This shows why GPU acceleration is critical for real-time or high-volume applications involving stream compaction.

##

2. Hypothesis: The choice of block size in a GPU kernel greatly affects performance due to how it maps to the hardware's available threads and memory resources. Smaller block sizes may lead to underutilized GPU cores, while overly large blocks can cause increased register and shared memory pressure, reducing occupancy. We expect optimal performance at moderate block sizes, such as 128 or 256, which strike a balance between these factors.

### Blocksize v. Runtime of Scan (Power of 2) Graph

Array Size = 2^20

![Data](img/graph5_t.png)
![Graph](img/graph5_v.png)

### Blocksize v. Runtime of Scan (Non Power of 2) Graph

Array Size = 2^20

![Data](img/graph6_t.png)
![Graph](img/graph6_v.png)

Conclusion: The optimal block size for scan is mid-range (128). Runtimes degrade with both very small and very large blocks, due to either insufficient parallelism or limited warp scheduling capacity. Non-power-of-two inputs introduce negligible overhead, indicating that the scan logic (including padding and bounds checking) is both correct and efficient.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added img/graph1_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph1_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph2_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph2_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph3_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph3_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph4_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph4_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph5_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph5_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph6_t.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/graph6_v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down
10 changes: 10 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= n) return;
if (i < n) {
bools[i] = (idata[i] != 0) ? 1 : 0;
}
}

/**
Expand All @@ -33,6 +38,11 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= n) return;
if (i < n && bools[i]) {
odata[indices[i]] = idata[i];
}
}

}
Expand Down
44 changes: 39 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = idata[i - 1] + odata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +33,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int count = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count] = idata[i];
count++;
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +50,35 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {

int* map = new int[n];
int* scanArr = new int[n];
timer().startCpuTimer();
// TODO

// MAP
for (int i = 0; i < n; i++) {
map[i] = (idata[i] != 0) ? 1 : 0;
}

// SCAN
scanArr[0] = 0;
for (int i = 1; i < n; i++) {
scanArr[i] = scanArr[i - 1] + map[i - 1];
}

// SCATTER
int count = 0;
for (int i = 0; i < n; i++) {
if (map[i] != 0 && scanArr[i] < n) {
odata[scanArr[i]] = idata[i];
count++;
}
}
delete[] map;
delete[] scanArr;

timer().endCpuTimer();
return -1;
return count;
}
}
}
Loading