CUDA - Labs
Bootstrap
The lab has to be done on the Arcadia gpu cluster.
Here is the configuration to add to your ~/.ssh/config file in order
to connect to it :
Host tsp-client
user <your_login>
Hostname IP_ADDRESS
ServerAliveInterval 60
ServerAliveCountMax 2
ProxyJump <LAB_GATEWAY>
In your config established during the MPI lab, LAB_GATEWAY is normally
"tsp". IP_ADDRESS is given during the session by your teacher.
Once connected on this frontal via ssh
make a reservation of a GPU thanks to slurm job scheduling system :
srun --time=1:00:00 --gres=gpu:1 --cpus-per-task=2 --mem=8G --pty bash
If a GPU is available, you will be directly connected to the one you registered on.
Use nvida-smi if you want to take a look on the machine topology.
Minimal kernel with error management
In bug_add.cu, you can read a CUDA
code written without any precautions. Run the code and observe the
result. You may not be agree with the 2 + 2 = 0.
The objective of this exercice is to highlight the value of protecting
your CUDA calls to detect silent errors.
Use the CUDA debugger in order to understand what
happens. For this purpose, you need to compile adding the options "-g
-G" so that debugging symbols are included and run the program within
cuda-gdb.
Now, in order to prevent those error types, instrument the code to
check the return error code of calls to CUDA.
As CUDA calls on GPU fail with silent error, it is required to
retrieve and check the error code of all of them. Since kernel calls
do not have a return value, first of all, you can check for invalid
launch argument with the error code of cudaPeekAtLastError(); and
then, you can check if errors occurred during the kernel execution
thanks to the error code of cudaDeviceSynchronize, that forces the
kernel completion. Note that most of the time, a developper will use
the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize
would be in duplicate then). In this case, the cudaMemcpy call can
return either errors which occurred during the kernel execution or
those from the memory copy itself.
The program should now fail, and no longer crash and give a wrong result.
To lighten the code, we propose you to outsource the error
management code thanks to the following macro :
and wrap each CUDA call like this :
Your program should still fail, similarly to the previous question.
/** Error checking,
* taken from https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
*/
#define gpuErrCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
gpuErrCheck(cudaMemcpy(C, gpu_C, nb_LineA * nb_ColB * sizeof(float), cudaMemcpyDeviceToHost));
Finally, fix the program.
First parallel computation kernel: SAXPY
We implement here the linear application y = ax + y on an
large vector whose CPU code is:
void saxpy(float *x, float *y, int len, float a){
for (int i = 0; i < len; ++i)
y[i] = a*x[i] + y[i];
}
Starting from saxpy.cu,
- allocate the data on the GPU
- implement a GPU kernel that processes an only one element of an array given as a parameter, the processed element being identified thanks to the current thread identifiers,
- copy the array to the GPU memory,
- run the calculation kernel saxpy in order to treat all elements of the vector,
- copy the result from the GPU memory to memory of the host,
- display the first 10 and last 10 results for checking.
- Finally, set up the error management.
Let begin by running your program with a vector size limited to the
block size, i.e. 1024 elements, before switching to a large one.
Compare performances to those obtained using a CPU thanks to timers.
Square Matrix Multiplication
Here you have to implement the multiplication of square matrices C = A x B. In sgemm.cu, you will find a code skeleton that will allow you to implement and compare the different matrix multiplication strategies we target.
Basic version : implement a version with independant
computation of each element done by doing a basic consecutive path on all
elements of A-line and B-column in global memory.
Tiled version :
By following the explanations given in the course,
implement the tiled version using the shared memory.
Compile with nvcc compiler and -lcublas option for linking the library we use in
the next question.
Compare the execution times of your implementations. Also
compare the execution time by varying the tile size (eg when TILE_WIDTH=16 and
TILE_WIDTH=32). Note: In order to obtain stable performance, run the
computation several times and consider the average execution time for
your comparisons.
cuBLAS version : Call on cublasSgemm
function from cuBLAS library to process matrix multiplication has been
made for you. Measures the execution time taking care to exclude the library
initialization time (warmup with 5 sgemm you exclude from the time measurements)
and by taking the average of several runs.
Compare the execution times obtained with those of the algorithms you wrote.