Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "87c7d8e5-c701-4120-a224-220b5eb50e0e",
"metadata": {},
"outputs": [],
"source": [
"#undef __noinline__"
]
},
{
"cell_type": "markdown",
"id": "4f2f841a-d999-41ba-9b76-1bce79012ac2",
"metadata": {},
"source": [
"# Introduction to CUDA\n",
"\n",
"This notebook is going to show you beginner friendly examples of CUDA syntax. \n",
"\n",
"Starting with declaring a __global__ kernel that runs on the GPU and adds two numbers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2d31952b-098b-4bae-ab94-51c7022f30a1",
"metadata": {},
"outputs": [],
"source": [
"__global__ void add( int a, int b, int *c ) {\n",
" *c = a + b;\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "69a6525c-f859-422f-8f18-40ac78634d97",
"metadata": {},
"source": [
"Next we have to run this kernel. The `<<< >>>` syntax helps us with that. The first number is how many blocks and the second is how many threads we are going to use.\n",
"\n",
"Before everything we need to allocate vram with cudaMalloc() on the gpu to store the result of both added numbers. Then we run the kernel with 2 and 7.\n",
"\n",
"Next we need to copy that vram memory to the host. We do that with cudaMemcpy(). And finally we print the final number and free the memory"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c33a2e6a-c940-4132-a9b5-2b18bda17059",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2 + 7 = 9\n"
]
}
],
"source": [
"#include <iostream>\n",
"\n",
"int c;\n",
"int *d_c;\n",
"\n",
"cudaMalloc( (void**)&d_c, sizeof(int) );\n",
"\n",
"add<<<1,1>>>( 2, 7, d_c );\n",
"\n",
"cudaMemcpy( &c, d_c, sizeof(int), cudaMemcpyDeviceToHost );\n",
"\n",
"printf( \"2 + 7 = %d\\n\", c );\n",
"cudaFree( d_c );"
]
},
{
"cell_type": "markdown",
"id": "4304a044-2d1e-4cfd-b798-001bfcbf1cd6",
"metadata": {},
"source": [
"Next we are going to try something a little bit more difficult. We are going to add two vectors index by index and store the result in a third vector.\n",
"\n",
"Before any code you can see a visualization of the addition.\n",
"\n",
"![Vector_Add_Model](assets/vectoradd.png)\n",
"\n",
"First we need the global function. As you can see the kernel performs only one addition. In order to add the whole vector we need multiple blocks running at the same time."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ac62ec4b-9ac0-40a3-99eb-bc58bc33e947",
"metadata": {},
"outputs": [],
"source": [
"const int N = 20;\n",
"\n",
"__global__ void vector_add( int *a, int *b, int *c ) {\n",
" int tid = blockIdx.x; // handle the data at this index\n",
" if (tid < N)\n",
" c[tid] = a[tid] + b[tid];\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "76c604c9-ff55-418c-8d3f-5053a3ec898e",
"metadata": {},
"source": [
"Parallel programming is what makes the GPU great for a big number of the same operations being performed simultaneously on different pieces of data.\n",
"\n",
"We can see that we call an **N** number of kernels to preform only one addition in the vector each. In this example we don`t save that much time but parallel programming works in our favour because when we go to in to the millions of operations we save a lot of time instead of adding them one by one with the CPU.\n",
"\n",
"Bellow you can see how blocks and threads are spaced out in the grid.\n",
"\n",
"![CudaGrid](assets/CudaGrid.png)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e46fa669-ada5-4b9c-9fc7-3ea26f1225dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 + 0 = 0\n",
"1 + 1 = 2\n",
"2 + 4 = 6\n",
"3 + 9 = 12\n",
"4 + 16 = 20\n",
"5 + 25 = 30\n",
"6 + 36 = 42\n",
"7 + 49 = 56\n",
"8 + 64 = 72\n",
"9 + 81 = 90\n",
"10 + 100 = 110\n",
"11 + 121 = 132\n",
"12 + 144 = 156\n",
"13 + 169 = 182\n",
"14 + 196 = 210\n",
"15 + 225 = 240\n",
"16 + 256 = 272\n",
"17 + 289 = 306\n",
"18 + 324 = 342\n",
"19 + 361 = 380\n"
]
}
],
"source": [
"int v_a[N], v_b[N], v_c[N];\n",
"int *dev_a, *dev_b, *dev_c;\n",
"\n",
"// allocate the memory on the GPU\n",
"cudaMalloc( (void**)&dev_a, N * sizeof(int) );\n",
"cudaMalloc( (void**)&dev_b, N * sizeof(int) );\n",
"cudaMalloc( (void**)&dev_c, N * sizeof(int) );\n",
"\n",
"// fill the arrays 'a' and 'b' on the CPU\n",
"for (int i=0; i<N; i++) {\n",
" v_a[i] = i;\n",
" v_b[i] = i * i;\n",
"}\n",
"\n",
"// copy the arrays 'a' and 'b' to the GPU\n",
"cudaMemcpy( dev_a, v_a, N * sizeof(int), cudaMemcpyHostToDevice );\n",
"cudaMemcpy( dev_b, v_b, N * sizeof(int), cudaMemcpyHostToDevice );\n",
"\n",
"vector_add<<<N,1>>>( dev_a, dev_b, dev_c );\n",
"\n",
"// copy the array 'c' back from the GPU to the CPU\n",
"cudaMemcpy( v_c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );\n",
"\n",
"// display the results\n",
"for (int i=0; i<N; i++) {\n",
" printf( \"%d + %d = %d\\n\", v_a[i], v_b[i], v_c[i] );\n",
"}\n",
"\n",
"// free the memory allocated on the GPU\n",
"cudaFree( dev_a );\n",
"cudaFree( dev_b );\n",
"cudaFree( dev_c );"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "C++23 CUDA",
"language": "cpp",
"name": "xcpp23-cuda"
},
"language_info": {
"codemirror_mode": "text/x-c++src",
"file_extension": ".cpp",
"mimetype": "text/x-c++src",
"name": "C++",
"nbconvert_exporter": "",
"pygments_lexer": "",
"version": "23"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
162 changes: 162 additions & 0 deletions cuda-tutorial/.ipynb_checkpoints/02_The Julia Set-checkpoint.ipynb

Large diffs are not rendered by default.

Loading