Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added ArraySize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
138 changes: 132 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,138 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Sirui Zhu
* [LinkedIn](https://www.linkedin.com/in/sirui-zhu-28a24a260/)
* Tested on: Windows 11, i7-13620H, RTX 4060 (Personal)

### (TODO: Your README)
# Project Description

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project implements different approaches to **exclusive prefix-sum (scan)** and **stream compaction** using CUDA.
The goal is to explore the efficiency of various parallel algorithms compared to a sequential CPU baseline,and each method is tested on both power-of-two and non-power-of-two array sizes.

## Features

### CPU Scan
A simple sequential implementation of exclusive prefix-sum used for correctness validation and as a baseline for performance.
- Runs in **O(n)** work and **O(n)** step complexity.
- Serves as the “ground truth” against which GPU implementations are compared.

### Naive GPU Scan
The Naive scan is a straightforward parallel implementation that computes prefix sums using multiple passes with increasing strides.
- At each iteration d, threads compute partial sums at increasing step (2^(d-1)).
- Uses **two device buffers** (devA, devB) that swap after each iteration to hold intermediate results.
- Using cudaDeviceSynchronize() to ensure correctness across kernel launches.
- After all iterations, results are copied back to the host, with the first element set to 0 to ensure exclusivity.
- Work complexity is **O(n log n)**, since all elements are rewritten at every depth.

### Work-Efficient GPU Scan
A more optimized implementation of scan using the **Blelloch parallel scan algorithm**.
- Pads the input to the next power-of-two size for efficient binary tree traversal.
- **Up-sweep** builds partial sums across the tree until the root contains the total sum.
- Sets the root to 0, then performs the **down-sweep** to distribute prefix sums back down the tree.
- Runs in **O(n)** work and **O(log n)** depth, making it much more efficient than the naive scan.
- Handles both **power-of-two and non-power-of-two** input sizes correctly.

### Stream Compaction
Builds on the work-efficient scan to remove unwanted elements (e.g., zeros) from an array.
- Step 1: **Map to boolean** → transforms input into a 0/1 array indicating valid elements.
- Step 2: **Scan** → exclusive prefix-sum over the boolean array produces write indices for valid elements.
- Step 3: **Scatter** → writes only the valid elements to the output buffer using the computed indices.
- Returns the count of remaining elements by combining the last values of the boolean and index arrays.
- Efficiently compresses arrays in parallel with correct ordering preserved.

### Thrust Scan
Uses the high-level Thrust library to run thrust::exclusive_scan on the GPU.
- Provides a highly optimized, production-ready implementation.
- Serves as a performance and correctness comparison against the custom naive and work-efficient implementations.

## Performance Analysis

### Block Size (with Array Size = 1 << 23)

| Implementation | Block Size = 128 | Block Size = 64 | Block Size = 32 |
|------------------------|------------------|-----------------|-----------------|
| **Naive Scan (Power of 2)** | 7.71331 ms | 8.90992 ms | 13.3062 ms |
| **Naive Scan (Non-Power of 2)**| 7.60579 ms | 7.78397 ms | 13.5380 ms |
| **Work-Efficient Scan (Power of 2)** | 2.99248 ms | 3.06944 ms | 4.31466 ms |
| **Work-Efficient Scan (Non-Power of 2)** | 2.02752 ms | 2.16781 ms | 2.08486 ms |

- **Naive Scan**: Block size 128 achieved the lowest runtime for both power-of-two (7.71 ms) and non-power-of-two (7.60 ms) inputs.
- **Work-Efficient Scan**: Block size 128 also performed best, with 2.99 ms (power-of-two) and 2.03 ms (non-power-of-two).
- Smaller block sizes (64, 32) resulted in noticeably slower runtimes.
- Therefore, the optimized block size for this GPU is **128 threads** per block.

### Array Size
![Scan Performance vs Array Size](ArraySize.png)

When comparing performance across increasing array sizes (2^16 to 2^23), each scan implementation demonstrates distinct scaling behavior and bottlenecks:

#### CPU Scan
The CPU version is the simplest one: it just loops through the array one element at a time. This means it does **O(n)** work but has no parallelism. As the array size grows, the runtime grows almost perfectly linearly, reaching ~15 ms at 2^23. The main bottleneck here is computation since everything happens sequentially. This works fine for small arrays, but it quickly falls behind compared to the GPU versions.

#### Naive GPU Scan
The Naive GPU scan improves over the CPU by letting many threads work in parallel, but the algorithm itself is inefficient. At each pass, every element is rewritten, and this happens for **log₂(n)** passes. This adds up to **O(n log n)** total work. Two buffers are swapped back and forth every pass, and synchronization is needed between iterations. The result is lots of redundant computation and heavy global memory traffic. It can beat the CPU for smaller arrays, but as the size grows, the wasted work makes it much slower than more optimized scans.

#### Work-Efficient GPU Scan
The Work-Efficient scan improves on the Naive version by avoiding a lot of wasted work. It only does about O(n) total operations, which makes it scale much better for large arrays. In practice, it runs faster than the Naive scan once the input gets big enough. The main drawback is that its memory access pattern is not always ideal — threads sometimes read and write from scattered locations instead of lined-up ones. Since the actual math per thread is very small, the performance ends up limited by memory bandwidth, rather than by computation. This makes it efficient, but still not as fast as Thrust.

#### Thrust Scan
Thrust is by far the fastest method. Even with very large arrays like 2^23, it finishes in under 1 ms. It’s a library function that is highly optimized for the GPU. It seems to make very efficient use of memory and parallelism, so the runtime ends up being limited only by how fast data can be moved in and out of GPU memory. The math itself is so quick that memory bandwidth is the main bottleneck.

## Test Output (blocksize = 128, arraysize = 1 << 23)

```
****************
** SCAN TESTS **
****************
[ 43 33 42 12 32 37 32 13 10 20 37 6 17 ... 2 0 ]
==== cpu scan, power-of-two ====
elapsed time: 13.384ms (std::chrono Measured)
[ 0 43 76 118 130 162 199 231 244 254 274 311 317 ... 205512735 205512737 ]
==== cpu scan, non-power-of-two ====
elapsed time: 14.9413ms (std::chrono Measured)
[ 0 43 76 118 130 162 199 231 244 254 274 311 317 ... 205512673 205512693 ]
passed
==== naive scan, power-of-two ====
elapsed time: 8.56035ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 8.88022ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 2.5047ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 2.1801ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.925696ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 1.0856ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 3 2 2 2 3 2 1 2 2 1 2 1 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 21.1336ms (std::chrono Measured)
[ 1 3 2 2 2 3 2 1 2 2 1 2 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 20.2159ms (std::chrono Measured)
[ 1 3 2 2 2 3 2 1 2 2 1 2 1 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 50.2771ms (std::chrono Measured)
[ 1 3 2 2 2 3 2 1 2 2 1 2 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 11.0327ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 5.73254ms (CUDA Measured)
passed

```
6 changes: 3 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 23; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -69,14 +69,14 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
StreamCompaction::Efficient::scan(SIZE, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
StreamCompaction::Efficient::scan(NPOT, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
Expand Down
5 changes: 5 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idata[idx] != 0) bools[idx] = 1;
else bools[idx] = 0;
}

/**
Expand All @@ -33,6 +36,8 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (bools[idx] == 1) odata[indices[idx]] = idata[idx];
}

}
Expand Down
35 changes: 33 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int k = 1; k < n; ++k) {
odata[k] = odata[k - 1] + idata[k-1];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +35,15 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int counter = 0;
for (int k = 0; k < n; ++k) {
if (idata[k] != 0) {
odata[counter] = idata[k];
counter++;
}
}
timer().endCpuTimer();
return -1;
return counter;
}

/**
Expand All @@ -43,8 +54,28 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int counter = 0;
int* boolArray = new int[n];
for (int k = 0; k < n; ++k) {
if (idata[k] != 0) boolArray[k] = 1;
else boolArray[k] = 0;
}
int* scanArray = new int[n];
scanArray[0] = 0;
for (int k = 1; k < n; ++k) {
scanArray[k] = scanArray[k - 1] + boolArray[k - 1];
}
for (int k = 0; k < n; ++k) {
if (idata[k] != 0) {
int idx = scanArray[k];
odata[idx] = idata[k];
}
}
counter = scanArray[n - 1] + boolArray[n - 1];
delete[] boolArray;
delete[] scanArray;
timer().endCpuTimer();
return -1;
return counter;
}
}
}
93 changes: 88 additions & 5 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,59 @@ namespace StreamCompaction {
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
timer().startGpuTimer();
__global__ void UpSweepKernel(int n, int d, int* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int step = powf(2, (d+1));
int prevStep = powf(2, (d));
int k = idx * step;
int targetK = k + step - 1;
if (targetK < n) {
data[targetK] += data[k + prevStep - 1];
}
}
__global__ void DownSweepKernel(int n, int d, int* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int step = 1 << (d + 1);
int prevStep = powf(2, (d));
int k = idx * step;
int targetK = k + step - 1;
if (targetK < n) {
int t = data[k + prevStep - 1];
data[k + prevStep - 1] = data[targetK];
data[targetK] += t;
}
}

void scan(int n, int *odata, const int *idata, bool shouldTime) {
// TODO
timer().endGpuTimer();
int nextPowOf2 = powf(2, ilog2ceil(n)); //next power of 2 of n
int* devA;
cudaMalloc((void**)&devA, nextPowOf2 * sizeof(int));
cudaMemcpy(devA, idata, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemset(devA + n, 0, (nextPowOf2 - n) * sizeof(int));

dim3 blockSize(128);

if(shouldTime) timer().startGpuTimer();
for (int d = 0; d <= ilog2ceil(nextPowOf2); ++d) {
int newN = nextPowOf2 / (int)powf(2, d + 1);
dim3 blocks((newN + blockSize.x - 1) / blockSize.x);
UpSweepKernel << <blocks, blockSize >> > (nextPowOf2, d, devA);
cudaDeviceSynchronize();
}

cudaMemset(devA + nextPowOf2 - 1, 0, sizeof(int));

for (int d = ilog2ceil(nextPowOf2) - 1; d >= 0; --d) {
int newN = nextPowOf2 / (int)powf(2, d + 1);
dim3 blocks((newN + blockSize.x - 1) / blockSize.x);
DownSweepKernel << <blocks, blockSize >> > (nextPowOf2, d, devA);
cudaDeviceSynchronize();
}
if (shouldTime) timer().endGpuTimer();
cudaMemcpy(odata, devA, n * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(devA);
}

/**
Expand All @@ -31,10 +80,44 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
timer().startGpuTimer();
// TODO
int* devA;
int* devB;
int* dev_bool;
int* dev_indices;
cudaMalloc((void**)&devA, n * sizeof(int));
cudaMalloc((void**)&devB, n * sizeof(int));
cudaMalloc((void**)&dev_bool, n * sizeof(int));
cudaMalloc((void**)&dev_indices, n * sizeof(int));
cudaMemcpy(devA, idata, n * sizeof(int), cudaMemcpyHostToDevice);

dim3 blockSize(128);
dim3 blocks((n + blockSize.x - 1) / blockSize.x);
timer().startGpuTimer();
//make array of bools
StreamCompaction::Common::kernMapToBoolean << <blocks, blockSize >> > (n, dev_bool, devA);

//scan
scan(n, dev_indices, dev_bool, false);

//scatter
StreamCompaction::Common::kernScatter << <blocks, blockSize >> > (n, devB, devA, dev_bool, dev_indices);

timer().endGpuTimer();
return -1;

int dev_bool_last_element, dev_indices_last_element;
cudaMemcpy(&dev_bool_last_element, dev_bool + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&dev_indices_last_element, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
int counter = dev_bool_last_element + dev_indices_last_element;

cudaMemcpy(odata, devB, counter * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(devA);
cudaFree(devB);
cudaFree(dev_bool);
cudaFree(dev_indices);

return counter;
}
}
}
2 changes: 1 addition & 1 deletion stream_compaction/efficient.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ namespace StreamCompaction {
namespace Efficient {
StreamCompaction::Common::PerformanceTimer& timer();

void scan(int n, int *odata, const int *idata);
void scan(int n, int *odata, const int *idata, bool shouldTime);

int compact(int n, int *odata, const int *idata);
}
Expand Down
Loading