Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 71 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,77 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Maya Diaz Huizar
* Tested on: Windows 10, R7-5800X @ 3.8GHz 32GB, RTX 3080 10GB

### (TODO: Your README)
### Questions
* Roughly optimize the block sizes of each of your implementations for minimal
run time on your GPU.
* Graphs:
* The optimal block size for the CPU implementation is N/A.
* ![image](<img/Naive GPU - Time (ms) vs Block Size (lower is better).png>)
* ![image](<img/Efficient GPU - Time (ms) vs Block Size (lower is better).png>)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).
* ![image](<img/Various Scans - Time (ms) vs Element Count (lower is better).png>)
* ![image](<img/Stream Compaction - Time (ms) vs Element Count (lower is better).png>)

* Write a brief explanation of the phenomena you see here.
* This generally makes sense, the efficient GPU scan and compact is much more efficient and more parallelizable, when compared to the naive approach. The CPU method is fast for small arrays and scales linearly, and thus is much worse at very large arrays when compared to the GPU implementation. Thrust almost certainly takes different approaches based on the size of the array, ensuring that it yields the best of both worlds, with fast small and large arrays. I also am wholly and entirely confident that the developers of the thrust library are more than capable of writing a faster library when compared to an undergrad CMPE major.
* Paste the output of the test program into a triple-backtick block in your README.
```
The below tests results are from scanning and steam compacting 2^29 element arrays.
****************
** SCAN TESTS **
****************
[ 42 34 32 22 8 16 34 39 37 30 7 2 14 ... 1 0 ]
==== cpu scan, power-of-two ====
elapsed time: 246.098ms (std::chrono Measured)
[ 0 42 76 108 130 138 154 188 227 264 294 301 303 ... 264144619 264144620 ]
==== cpu scan, non-power-of-two ====
elapsed time: 245.254ms (std::chrono Measured)
[ 0 42 76 108 130 138 154 188 227 264 294 301 303 ... 264144559 264144572 ]
passed
==== naive scan, power-of-two ====
elapsed time: 378.275ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 377.967ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 17.0086ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 16.9234ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 7.19872ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 7.27962ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 0 2 1 3 2 0 0 3 3 0 3 2 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 743.234ms (std::chrono Measured)
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 743.523ms (std::chrono Measured)
passed
==== cpu compact with scan ====
elapsed time: 2002.52ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 1130.99ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 875.497ms (CUDA Measured)
passed
```

* Extra Credit
* My efficient GPU scan was efficient from the onset, but I also wasn't following the slides very closely. (5pt GPU approach)
* I also implemented improvements for memory access to better align and thus prevent bank conflicts, based upon the overview provided by GPU Gems 3 Ch 39.2.3.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 11 additions & 6 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 29; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand All @@ -34,6 +34,7 @@ int main(int argc, char* argv[]) {
// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
Expand All @@ -46,6 +47,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);


zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
Expand All @@ -67,6 +69,7 @@ int main(int argc, char* argv[]) {
//printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);


zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
Expand Down Expand Up @@ -94,7 +97,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand All @@ -110,42 +113,44 @@ int main(int argc, char* argv[]) {

// initialize b using StreamCompaction::CPU::compactWithoutScan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.

zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedCount = count;
printArray(count, b, true);
//printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
printArray(count, c, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, b, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);


system("pause"); // stop Win32 console from closing on exit
delete[] a;
Expand Down
22 changes: 17 additions & 5 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,30 @@ namespace StreamCompaction {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
__global__ void kernMapToBoolean(int n, int* bools, const int* idata) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < n) {
// Map to 1 if idata[idx] is non-zero, else map to 0
bools[idx] = (idata[idx] != 0) ? 1 : 0;
}
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
__global__ void kernScatter(int n, int* odata, const int* idata, const int* bools, const int* indices) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < n) {
// Perform scatter only if bools[idx] is 1
if (bools[idx] == 1) {
odata[indices[idx]] = idata[idx];
}
}
}


}
}
77 changes: 66 additions & 11 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,87 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

// make sure elements exist
if (n <= 0) {
timer().endCpuTimer();
return;
}

// add identity for exclusive scan
odata[0] = 0;
for (int i = 1; i < n; i++) {
// do scan in one big for loop :(
odata[i] = odata[i - 1] + idata[i - 1];
}

timer().endCpuTimer();
}

/**
* CPU stream compaction without using the scan function.
*
* @param n number of elements in initial array
* @param idata input array, not modified
* @param odata output array
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int count = 0;
// loop over entire array as one big loop,
for (int i = 0; i < n; i++) {
// check if they are zero (throw out) or non-zero (keep)
if (idata[i] != 0) {
odata[count] = idata[i];
count++;
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
* CPU stream compaction using scan and scatter, like the parallel version.
*
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int* odata, const int* idata) {
timer().startCpuTimer();
// TODO

// create temporary array
int* temp = new int[n];

// loop over creating boolean array
for (int i = 0; i < n; ++i) {
temp[i] = (idata[i] != 0) ? 1 : 0;
}

// create array for scan result
int* scanResult = new int[n];
scanResult[0] = 0;
// loop, exclusive scan
for (int i = 1; i < n; ++i) {
scanResult[i] = scanResult[i - 1] + temp[i - 1];
}

// final loop, use scan result and boolean result to generate new array
int count = 0;
for (int i = 0; i < n; ++i) {
if (temp[i] == 1) {
odata[scanResult[i]] = idata[i];
count++;
}
}

// cleanup
delete[] temp;
delete[] scanResult;

timer().endCpuTimer();
return -1;

return count;
}

}
}
Loading