CIS5650-Fall-2025 · AngelaSiruiZhu · Sep 16, 2025 · Sep 17, 2025 · Sep 17, 2025 · Sep 17, 2025
diff --git a/ArraySize.png b/ArraySize.png
diff --git a/README.md b/README.md
@@ -3,12 +3,138 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Sirui Zhu
+  * [LinkedIn](https://www.linkedin.com/in/sirui-zhu-28a24a260/)
+* Tested on: Windows 11, i7-13620H, RTX 4060 (Personal)
 
-### (TODO: Your README)
+# Project Description
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project implements different approaches to **exclusive prefix-sum (scan)** and **stream compaction** using CUDA.  
+The goal is to explore the efficiency of various parallel algorithms compared to a sequential CPU baseline,and each method is tested on both power-of-two and non-power-of-two array sizes.
 
+## Features
+
+### CPU Scan
+A simple sequential implementation of exclusive prefix-sum used for correctness validation and as a baseline for performance.  
+- Runs in **O(n)** work and **O(n)** step complexity.  
+- Serves as the “ground truth” against which GPU implementations are compared.  
+
+### Naive GPU Scan
+The Naive scan is a straightforward parallel implementation that computes prefix sums using multiple passes with increasing strides.  
+- At each iteration d, threads compute partial sums at increasing step (2^(d-1)).  
+- Uses **two device buffers** (devA, devB) that swap after each iteration to hold intermediate results.  
+- Using cudaDeviceSynchronize() to ensure correctness across kernel launches.  
+- After all iterations, results are copied back to the host, with the first element set to 0 to ensure exclusivity.  
+- Work complexity is **O(n log n)**, since all elements are rewritten at every depth.  
+
+### Work-Efficient GPU Scan
+A more optimized implementation of scan using the **Blelloch parallel scan algorithm**.  
+- Pads the input to the next power-of-two size for efficient binary tree traversal.  
+- **Up-sweep** builds partial sums across the tree until the root contains the total sum.  
+- Sets the root to 0, then performs the **down-sweep** to distribute prefix sums back down the tree.  
+- Runs in **O(n)** work and **O(log n)** depth, making it much more efficient than the naive scan.  
+- Handles both **power-of-two and non-power-of-two** input sizes correctly.  
+
+### Stream Compaction
+Builds on the work-efficient scan to remove unwanted elements (e.g., zeros) from an array.  
+- Step 1: **Map to boolean** → transforms input into a 0/1 array indicating valid elements.  
+- Step 2: **Scan** → exclusive prefix-sum over the boolean array produces write indices for valid elements.  
+- Step 3: **Scatter** → writes only the valid elements to the output buffer using the computed indices.  
+- Returns the count of remaining elements by combining the last values of the boolean and index arrays.  
+- Efficiently compresses arrays in parallel with correct ordering preserved.  
+
+### Thrust Scan
+Uses the high-level Thrust library to run thrust::exclusive_scan on the GPU.  
+- Provides a highly optimized, production-ready implementation.  
+- Serves as a performance and correctness comparison against the custom naive and work-efficient implementations.  
+
+## Performance Analysis
+
+### Block Size (with Array Size = 1 << 23)
+
+| Implementation         | Block Size = 128 | Block Size = 64 | Block Size = 32 |
+|------------------------|------------------|-----------------|-----------------|
+| **Naive Scan (Power of 2)**    | 7.71331 ms       | 8.90992 ms      | 13.3062 ms     |
+| **Naive Scan (Non-Power of 2)**| 7.60579 ms       | 7.78397 ms      | 13.5380 ms     |
+| **Work-Efficient Scan (Power of 2)** | 2.99248 ms       | 3.06944 ms      | 4.31466 ms     |
+| **Work-Efficient Scan (Non-Power of 2)** | 2.02752 ms       | 2.16781 ms      | 2.08486 ms     |
+
+- **Naive Scan**: Block size 128 achieved the lowest runtime for both power-of-two (7.71 ms) and non-power-of-two (7.60 ms) inputs.  
+- **Work-Efficient Scan**: Block size 128 also performed best, with 2.99 ms (power-of-two) and 2.03 ms (non-power-of-two).  
+- Smaller block sizes (64, 32) resulted in noticeably slower runtimes.
+- Therefore, the optimized block size for this GPU is **128 threads** per block.
+
+### Array Size
+![Scan Performance vs Array Size](ArraySize.png)
+
+When comparing performance across increasing array sizes (2^16 to 2^23), each scan implementation demonstrates distinct scaling behavior and bottlenecks:
+
+#### CPU Scan  
+The CPU version is the simplest one: it just loops through the array one element at a time. This means it does **O(n)** work but has no parallelism. As the array size grows, the runtime grows almost perfectly linearly, reaching ~15 ms at 2^23. The main bottleneck here is computation since everything happens sequentially. This works fine for small arrays, but it quickly falls behind compared to the GPU versions.  
+
+#### Naive GPU Scan  
+The Naive GPU scan improves over the CPU by letting many threads work in parallel, but the algorithm itself is inefficient. At each pass, every element is rewritten, and this happens for **log₂(n)** passes. This adds up to **O(n log n)** total work. Two buffers are swapped back and forth every pass, and synchronization is needed between iterations. The result is lots of redundant computation and heavy global memory traffic. It can beat the CPU for smaller arrays, but as the size grows, the wasted work makes it much slower than more optimized scans.  
+
+#### Work-Efficient GPU Scan  
+The Work-Efficient scan improves on the Naive version by avoiding a lot of wasted work. It only does about O(n) total operations, which makes it scale much better for large arrays. In practice, it runs faster than the Naive scan once the input gets big enough. The main drawback is that its memory access pattern is not always ideal — threads sometimes read and write from scattered locations instead of lined-up ones. Since the actual math per thread is very small, the performance ends up limited by memory bandwidth, rather than by computation. This makes it efficient, but still not as fast as Thrust.
+
+#### Thrust Scan  
+Thrust is by far the fastest method. Even with very large arrays like 2^23, it finishes in under 1 ms. It’s a library function that is highly optimized for the GPU. It seems to make very efficient use of memory and parallelism, so the runtime ends up being limited only by how fast data can be moved in and out of GPU memory. The math itself is so quick that memory bandwidth is the main bottleneck.
+
+## Test Output (blocksize = 128, arraysize = 1 << 23)
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  43  33  42  12  32  37  32  13  10  20  37   6  17 ...   2   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 13.384ms    (std::chrono Measured)
+    [   0  43  76 118 130 162 199 231 244 254 274 311 317 ... 205512735 205512737 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 14.9413ms    (std::chrono Measured)
+    [   0  43  76 118 130 162 199 231 244 254 274 311 317 ... 205512673 205512693 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 8.56035ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 8.88022ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 2.5047ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 2.1801ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.925696ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.0856ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   3   2   2   2   3   2   1   2   2   1   2   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 21.1336ms    (std::chrono Measured)
+    [   1   3   2   2   2   3   2   1   2   2   1   2   1 ...   1   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 20.2159ms    (std::chrono Measured)
+    [   1   3   2   2   2   3   2   1   2   2   1   2   1 ...   2   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 50.2771ms    (std::chrono Measured)
+    [   1   3   2   2   2   3   2   1   2   2   1   2   1 ...   1   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 11.0327ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 5.73254ms    (CUDA Measured)
+    passed
+
+```
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 23; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -69,14 +69,14 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
+    StreamCompaction::Efficient::scan(SIZE, c, a, true);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
+    StreamCompaction::Efficient::scan(NPOT, c, a, true);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,9 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int idx = blockIdx.x * blockDim.x + threadIdx.x;
+            if (idata[idx] != 0) bools[idx] = 1;
+            else bools[idx] = 0;
         }
 
         /**
@@ -33,6 +36,8 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int idx = blockIdx.x * blockDim.x + threadIdx.x;
+            if (bools[idx] == 1) odata[indices[idx]] = idata[idx];
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int k = 1; k < n; ++k) {
+                odata[k] = odata[k - 1] + idata[k-1];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +35,15 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int counter = 0;
+            for (int k = 0; k < n; ++k) {
+                if (idata[k] != 0) {
+                    odata[counter] = idata[k];
+                    counter++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return counter;
         }
 
         /**
@@ -43,8 +54,28 @@ namespace StreamCompaction {
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int counter = 0;
+            int* boolArray = new int[n];
+            for (int k = 0; k < n; ++k) {
+                if (idata[k] != 0) boolArray[k] = 1;
+                else boolArray[k] = 0;
+            }
+            int* scanArray = new int[n];
+            scanArray[0] = 0;
+            for (int k = 1; k < n; ++k) {
+                scanArray[k] = scanArray[k - 1] + boolArray[k - 1];
+            }
+            for (int k = 0; k < n; ++k) {
+                if (idata[k] != 0) {
+                    int idx = scanArray[k];
+                    odata[idx] = idata[k];
+                }
+            }
+            counter = scanArray[n - 1] + boolArray[n - 1];
+            delete[] boolArray;
+            delete[] scanArray;
             timer().endCpuTimer();
-            return -1;
+            return counter;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -15,10 +15,59 @@ namespace StreamCompaction {
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
+        __global__ void UpSweepKernel(int n, int d, int* data) {
+            int idx = blockIdx.x * blockDim.x + threadIdx.x;
+            int step = powf(2, (d+1));
+            int prevStep = powf(2, (d));
+            int k = idx * step;
+            int targetK = k + step - 1;
+            if (targetK < n) {
+                data[targetK] += data[k + prevStep - 1];
+            }
+        }
+        __global__ void DownSweepKernel(int n, int d, int* data) {
+            int idx = blockIdx.x * blockDim.x + threadIdx.x;
+            int step = 1 << (d + 1);
+            int prevStep = powf(2, (d));
+            int k = idx * step;
+            int targetK = k + step - 1;
+            if (targetK < n) {
+                int t = data[k + prevStep - 1];
+                data[k + prevStep - 1] = data[targetK];
+                data[targetK] += t;
+            }
+        }
+
+        void scan(int n, int *odata, const int *idata, bool shouldTime) {
             // TODO
-            timer().endGpuTimer();
+            int nextPowOf2 = powf(2, ilog2ceil(n)); //next power of 2 of n
+            int* devA;
+            cudaMalloc((void**)&devA, nextPowOf2 * sizeof(int));
+            cudaMemcpy(devA, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            cudaMemset(devA + n, 0, (nextPowOf2 - n) * sizeof(int));
+
+            dim3 blockSize(128);
+
+            if(shouldTime) timer().startGpuTimer();
+            for (int d = 0; d <= ilog2ceil(nextPowOf2); ++d) {
+                int newN = nextPowOf2 / (int)powf(2, d + 1);
+                dim3 blocks((newN + blockSize.x - 1) / blockSize.x);
+                UpSweepKernel << <blocks, blockSize >> > (nextPowOf2, d, devA);
+                cudaDeviceSynchronize();
+            }
+
+            cudaMemset(devA + nextPowOf2 - 1, 0, sizeof(int));
+
+            for (int d = ilog2ceil(nextPowOf2) - 1; d >= 0; --d) {
+                int newN = nextPowOf2 / (int)powf(2, d + 1);
+                dim3 blocks((newN + blockSize.x - 1) / blockSize.x);
+                DownSweepKernel << <blocks, blockSize >> > (nextPowOf2, d, devA);
+                cudaDeviceSynchronize();
+            }
+            if (shouldTime) timer().endGpuTimer();
+            cudaMemcpy(odata, devA, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(devA);
         }
 
         /**
@@ -31,10 +80,44 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
             // TODO
+            int* devA;
+            int* devB;
+            int* dev_bool;
+            int* dev_indices;
+            cudaMalloc((void**)&devA, n * sizeof(int));
+            cudaMalloc((void**)&devB, n * sizeof(int));
+            cudaMalloc((void**)&dev_bool, n * sizeof(int));
+            cudaMalloc((void**)&dev_indices, n * sizeof(int));
+            cudaMemcpy(devA, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            dim3 blockSize(128);
+            dim3 blocks((n + blockSize.x - 1) / blockSize.x);
+            timer().startGpuTimer();
+            //make array of bools
+            StreamCompaction::Common::kernMapToBoolean << <blocks, blockSize >> > (n, dev_bool, devA);
+
+            //scan
+            scan(n, dev_indices, dev_bool, false);
+
+            //scatter
+            StreamCompaction::Common::kernScatter << <blocks, blockSize >> > (n, devB, devA, dev_bool, dev_indices);
+
             timer().endGpuTimer();
-            return -1;
+
+            int dev_bool_last_element, dev_indices_last_element;
+            cudaMemcpy(&dev_bool_last_element, dev_bool + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            cudaMemcpy(&dev_indices_last_element, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            int counter = dev_bool_last_element + dev_indices_last_element;
+
+            cudaMemcpy(odata, devB, counter * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(devA);
+            cudaFree(devB);
+            cudaFree(dev_bool);
+            cudaFree(dev_indices);
+
+            return counter;
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
@@ -6,7 +6,7 @@ namespace StreamCompaction {
     namespace Efficient {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool shouldTime);
 
         int compact(int n, int *odata, const int *idata);
     }