CIS5650-Fall-2024 · hazard-10 · Sep 17, 2024 · Sep 17, 2024 · Sep 19, 2024 · Sep 19, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,62 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jiahang Mao
+  * [LinkedIn](https://www.linkedin.com/in/jay-jiahang-m-b05608192/)
+* Tested on: Windows 11, i5-13600kf @ 5.0GHz 64GB, RTX 4090 24GB, Personal Computer
 
-### (TODO: Your README)
+### Project Features
+* CPU implementations:
+  * Simple sequential scan
+  * Stream compaction without scan
+  * Stream compaction with scan
+* GPU implementations:
+  * Naive parallel scan
+  * Work-efficient parallel scan
+  * Thrust library-based scan
+  * Work-efficient stream compaction
+  * Performance timing for both CPU and GPU implementations
+* Support for both power-of-two and non-power-of-two input sizes
+
+
+### Questions
+
+* Roughly optimize the block sizes of each of your implementations for minimal
+  run time on your GPU.
+
+  With array size set to 256. I have tried block size to 256, 512 ,1024. Of which 512 block size showed best results on all cuda-enabled functions. Showing anywhere from 10% ( Naive po2) to nearly 250% (work efficient po2) improvement over blocksize == 256. 1024 Block size showed results on par with 512.
+
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
+  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+  (with array size on the independent axis).
+   ![Performance Graph](img/perf5.png)
+
+* Write a brief explanation of the phenomena you see here.
+  * CUDA Work Eficient being not efficient
+    The most obvious surprising result is that the supposedly work efficient is significantly slower than Naive or thrust. Diving into the profiler it seems to be due to too many kernel invokations. Both upsweep and downsweep have similar computation and memory throughput to naive kernels, but consistig twice the kernel invokations greatly hinder the performance. The gap could be narrowed with much larger array size.
+
+  *  Thrust implementation 
+    ![Performance Graph](img/nsight.png)
+    The memory throughput is between single naive kernel and work efficient up/down sweep kernel. It invoked the block size that match the input ( 256 in this case). The primary performance jump should come from a single kernel call and reduced back and forth communication with main thread.
+
+
+* Paste the output of the test program into a triple-backtick block in your
+  README.
+
+    Config: Block size 256, Test Array size 2048 
+
+    ![Performance Graph](img/perf1.png)
+
+    Config: Block size 256, Test Array size 256
+
+    ![Performance Graph](img/perf2.png)
+
+    Config: Block size 512, Test Array size 256
+
+    ![Performance Graph](img/perf3.png)
+
+    Config: Block size 1024, Test Array size 256
+
+    ![Performance Graph](img/perf4.png) 
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/img/nsight.png b/img/nsight.png
diff --git a/img/perf1.png b/img/perf1.png
diff --git a/img/perf2.png b/img/perf2.png
diff --git a/img/perf3.png b/img/perf3.png
diff --git a/img/perf4.png b/img/perf4.png
diff --git a/img/perf5.png b/img/perf5.png
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +36,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,10 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // assert idata[0] == 0
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +33,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int count = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[count] = idata[i];
+                    count += 1;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -42,9 +51,33 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int* binary = new int[n];
+            // construct 0 / 1 array
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    binary[i] = 1;
+                }
+                else {
+                    binary[i] = 0;
+                }
+            }
+            // scan
+            int* scanArray = new int[n];
+            scanArray[0] = 0;
+            for (int i = 1; i < n; i++) {
+                scanArray[i] = scanArray[i - 1] + binary[i - 1];
+            }
+            // scatter
+            int count = 0;
+            for (int i = 0; i < n; i++) {
+                if (binary[i] == 1) {
+                    odata[scanArray[i]] = idata[i];
+                    count += 1;
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -12,13 +12,57 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void upSweepKernel(int n, int d, int *data){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n){
+                return;
+            }
+            if ((index + 1) % (1 << (d + 1)) == 0){
+                data[index] += data[index - (1 << d)];
+            }
+        }
+
+        __global__ void downSweepKernel(int n, int d, int *data){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n){
+                return;
+            }
+            if ((index + 1) % (1 << (d + 1)) == 0){
+                int root = data[index];
+                int left_index = index - (1 << d);
+                data[index] += data[left_index];
+                data[left_index] = root;
+            }
+
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            int d_round = ilog2ceil(n);
+            int full_size = 1 << d_round;
+            int block_size = 512;
+            dim3 fullBlocksPerGrid((block_size + full_size - 1) / block_size);
+
+            int *d_data;
+            cudaMalloc((void **)&d_data, full_size * sizeof(int));
+            cudaMemset(d_data, 0, full_size * sizeof(int));
+            cudaMemcpy(d_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            // Up-sweep
             timer().startGpuTimer();
-            // TODO
+            for (int d = 0; d < d_round; d++){
+                upSweepKernel<<<fullBlocksPerGrid, block_size>>>(full_size, d, d_data);
+            }
+            // Down-sweep
+            cudaMemset(d_data + full_size - 1, 0, sizeof(int));
+            for (int d = d_round - 1; d >= 0; d--){
+                downSweepKernel<<<fullBlocksPerGrid, block_size>>>(full_size, d, d_data);
+            }
             timer().endGpuTimer();
+            // Copy result to odata
+            cudaMemcpy(odata, d_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+            cudaFree(d_data);
         }
 
         /**
@@ -31,10 +75,44 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            int *scatter_result = new int[n];
+            int *d_idata, *d_bools, *d_odata;
+            cudaMalloc((void **)&d_idata, n * sizeof(int));
+            cudaMalloc((void **)&d_bools, n * sizeof(int));
+            cudaMalloc((void **)&d_odata, n * sizeof(int));
+            cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            int block_size = 512;
+            dim3 fullBlocksPerGrid((block_size + n - 1) / block_size);
+            // efficient scan
+            int d_round = ilog2ceil(n);
+            int full_size = 1 << d_round;
+            dim3 scanBlocksPerGrid((block_size + full_size - 1) / block_size);
+            int *d_scan_buffer;
+            cudaMalloc((void **)&d_scan_buffer, full_size * sizeof(int));
+            cudaMemset(d_scan_buffer, 0, full_size * sizeof(int));
+
             timer().startGpuTimer();
-            // TODO
+            Common::kernMapToBoolean<<<fullBlocksPerGrid, block_size>>>(n, d_bools, d_idata);
+            cudaMemcpy(d_scan_buffer, d_bools, n * sizeof(int), cudaMemcpyDeviceToDevice);
+            for (int d = 0; d < d_round; d++){
+                upSweepKernel<<<scanBlocksPerGrid, block_size>>>(full_size, d, d_scan_buffer);
+            }
+            cudaMemset(d_scan_buffer + full_size - 1, 0, sizeof(int));
+            for (int d = d_round - 1; d >= 0; d--){
+                downSweepKernel<<<scanBlocksPerGrid, block_size>>>(full_size, d, d_scan_buffer);
+            }
+            // scatter
+            Common::kernScatter<<<fullBlocksPerGrid, block_size>>>(n, d_odata, d_idata, d_bools, d_scan_buffer);
             timer().endGpuTimer();
-            return -1;
+            // copy result
+            cudaMemcpy(odata, d_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+            cudaMemcpy(scatter_result, d_scan_buffer, n * sizeof(int), cudaMemcpyDeviceToHost);
+            cudaFree(d_idata);
+            cudaFree(d_bools);
+            cudaFree(d_odata);
+            cudaFree(d_scan_buffer);
+
+            return scatter_result[n - 1] + (idata[n - 1] != 0);
         }
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
@@ -12,14 +12,66 @@ namespace StreamCompaction {
             return timer;
         }
         // TODO: __global__
+        __global__ void naiveScanKernel(int n, int offset, int *odata, const int *idata){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n){
+                return;
+            }
+
+            if (index >= offset){
+                odata[index] = idata[index - offset] + idata[index];
+            }
+            else{
+                odata[index] = idata[index];
+            }
+        }
+
+         __global__ void naiveScanFirstRound(int n, int *odata, const int *idata){
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n){
+                return;
+            }
+            if (index == 0){
+                odata[index] = 0;
+            }
+            else if (index == 1){
+                odata[index] = idata[index - 1];
+            }
+            else{
+                odata[index] = idata[index - 1] + idata[index - 2];
+            }
+         }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
+        void scan(int n, int *odata, const int *idata){
+            int block_size = 512;
+            dim3 fullBlocksPerGrid((block_size + n - 1) / block_size);
+
             // TODO
+            int d_round = ilog2ceil(n);
+            int *dstFirst;
+            int *dstSecond;
+            cudaMalloc((void **)&dstFirst, n * sizeof(int));
+            cudaMalloc((void **)&dstSecond, n * sizeof(int));
+
+            cudaMemcpy(dstFirst, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            timer().startGpuTimer();
+            naiveScanFirstRound<<<fullBlocksPerGrid, block_size>>>(n, dstSecond, dstFirst);
+            std::swap(dstFirst, dstSecond);
+
+            for (int d = 1; d < d_round; d++){
+                int d_offset = 1 << d; // 2, 4, 8
+                naiveScanKernel<<<fullBlocksPerGrid, block_size>>>(n, d_offset, dstSecond, dstFirst);
+                std::swap(dstFirst, dstSecond);
+            }
             timer().endGpuTimer();
+            // setFirstAsZero<<<1, 1>>>(dstFirst);
+            cudaMemcpy(odata, dstFirst, n * sizeof(int), cudaMemcpyDeviceToHost);
+            cudaFree(dstFirst);
+            cudaFree(dstSecond);
         }
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
@@ -18,11 +18,17 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            // Create device vectors from input and output data
+            thrust::device_vector<int> d_in(idata, idata + n);
+            thrust::device_vector<int> d_out(n);
+
             timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            // Perform exclusive scan using Thrust
+            thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin());
             timer().endGpuTimer();
+
+            // Copy result back to output array
+            thrust::copy(d_out.begin(), d_out.end(), odata);
         }
     }
 }