Minimal kernel with Google Colab : An addition
All the exercices of the module are to be done in the Google Colab
environment. So, start by logging you on
the Google Colab
webpage.
Google Colab allows you to write and execute code in an interactive
environment called a Colab notebook. You can write different kind of
codes, including CUDA codes. In order to tell Google Colab that you
want to use a GPU, you have to change the default runtime in the
menu Runtime>Change Runtime Type and set Runtime type
to Python 3 and Hardware accelerator
to GPU.
You will find
here
a notebook in which :
- the first cell corresponds to your program that is saved as
add.cu file in the Colab environment thanks to the flag %%writefile;
- the second cell compiles add.cu thanks to nvcc and generates an
executable called add. Note that in a notebook, shell commands start
with ! . Also, note that the
option -arch=sm_75 is specific to the Colab environment for
comptability reason;
- the third cell launches the CUDA program add.
The given program makes an addition on the GPU thanks to a kernel executed by a single thread.
- Read the program.
- Load, compile and then launch the program using the buttons "play" on the left of the
code blocks.
Remarks :
- Do not forget to specify the use of a GPU in Google Colab environment.
Minimal kernel with Error management
CUDA calls on GPU fail with silent error. In
this notebook,
we highlight the value of protecting your CUDA calls to detect
errors :
- In the first section named "Raw code", you can read a CUDA code written
without any precaution. Run the code and observe the result. You
may not be agree with the result of
2 + 2 = 0.
- In the second section, we introduce how
debugging this code with the debugger cuda-gdb.
For this purpose, you need to :
- compile adding the options"-g -G" so that debugging
symbols are included.
- write in a file the sequence of instructions to be followed by
the debugger. Indeed, cuda-gdb is interactive (you are
expected to type commands as you go along), but running
programs in the Colab environnement is not. Typical
commands would go like this:
- set the debugger up to check lots of possible errors:
- memory checks : memcheck on,
- stop in case of API failures : api_failures stop,
- stop on exceptions : catch throw,
- run the program (possibly with command line options) :
r option1 option2 ,
- show the kernel call stack (GPU) : bt,
- print all local variables : info locals,
- switch to the host thread : thread 1
- and show the host program call stack (CPU) : bt.
- call the debugger with your program and execute the
commands from debug_instructions.txt. If your program terminates fine,
cuda-gdb will complain that there is no stack (since the
program finished)
After running all cells of the "Debugging" notebook section, you
should get an exception and lots of information. There is an
illegal address detected in line 5 of add.cu, which is in kernel
add. You may identify and fix the problem by hand but it should have
been caught by the cuda errors management, object of the next section.
Note : If you do use printf to debug, be sure to flush the buffer by
adding a line break at the end. This applies to any C
program. Example: printf("Works up to here\n"); .
Nevertheless, the interface between the Jupyter Notebook and the
executed program is a little fragile. So if your program crashes,
there might not be ANY output at all, even if you have printf
everywhere.
- In the third section "Code with error management", we
instrument the code to check the return error code of calls to
CUDA. The program should fail now (and no longer crash and
give a wrong result).
As CUDA calls on GPU fail with silent error, it is required
to retrieve and check the error code of all of them. Since
kernel calls do not have a return value, first of all, you can check
for invalid launch argument with the error code of
cudaPeekAtLastError(); and then, you can check if errors
occurred during the kernel execution thanks to the error code of
cudaDeviceSynchronize, that forces the kernel completion.
Note that most of the time, a developper will use the cudaMemcpy
as synchronization primitive (the cudaDeviceSynchronize would be
in duplicate then). In this cas, the cudaMemcpy call can return
either errors which occurred during the kernel execution or
those from the memory copy itself.
- In the last section, we have outsourced the error management
code so that you can use it more easily in the rest of your
exercises.
Notice that the first line of the cell has
changed. Now, each cell is saved as a file and the compilation and
execution are launched explicitly in two additional cells with a
shell command. Note that in a notebook, shell commands start
with ! .
- Last but not least, it remains you to fix the problem.
First parallel computation kernel: SAXPY
We implement here the linear application y = ax + y on an
large vector whose CPU code is:
void saxpy(float *x, float *y, int len, float a){
for (int i = 0; i < len; ++i)
y[i] = a*x[i] + y[i];
}
- Open a new notebook whose starting point can
be this
one.
- First of all, set up the error management.
- Complete the program to
- copy data to the GPU memory,
- implement a calculation kernel, which you call
saxpy,
which processes one element x of an array passed as a parameter,
the processed element being identified from the
current thread identifiers,
- run the calculation kernel
saxpy in order to treat
all elements of the vector,
- copy the result from the GPU memory to memory
of the host,
- display the first 10 and last 10 results for verification.
- After you have modified your program to set the
vector size, experiment your program with an vector of
10.000 elements and 100.000 elements.
- Test several combinations of "number of blocks" / "number of
threads per block", for arrays of different sizes and
observe the achieved performance.
- Compare them also to those
obtained using a CPU thanks to timers.
Warning! Launch your experiments within the same program
so that the different tests can be performed on the same hardware
and are therefore comparable.
Reduction
This exercise consists of implementing the sum of the elements of an
array of integers following the different parallel versions
presented in thoses slides.
- Recursive call to the reduction kernel on the host:
To begin with, implement the recursive call in the host code to
enable the different levels of reductions.
In this version, each block of threads will reduce a maximum of 1024
elements. The number of stages in your reduction is therefore determined by
by the size of the array and the size of your blocks. Make sure you
do not forget to synchronize between calls so as to wait for the
previous kernel to finish executing.
- Divergent parallel version: write a parallel version that
calculates the reduction with the divergent version where only
index threads work.
Compare execution time.