CUDA - Labs
Bootstrap
The lab is to be carried out on the jromero cluster on which you have
an account (your login is your name), so here is the configuration to
add to your ~/.ssh/config file in order to connect you :
Host tsp-client
user <your_login>
Hostname 157.159.104.229
ServerAliveInterval 60
ServerAliveCountMax 2
ProxyJump <LAB_GATEWAY>
Once connected on this frontal via ssh (use -X option in order to
forward the graphic windows of EasyPAP), ask for a GPU thanks to slurm
job scheduling system :
srun --x11 --time=2:00:00 --gres=gpu:1 --pty bash
You will be directly connected to the one you registered on.
Use nvida-smi if you want to take a look on the machine topology.
Minimal kernel with error management
In bug_add.cu, you can read a CUDA
code written without any precautions. Run the code and observe the
result. You may not be agree with the 2 + 2 = 0.
The objective of this exercice is to highlight the value of protecting
your CUDA calls to detect silent errors.
Use the CUDA debugger in order to understand what
happens. For this purpose, you need to compile adding the options "-g
-G" so that debugging symbols are included and run the program within
cuda-gdb.
Now, in order to prevent those error types, instrument the code to
check the return error code of calls to CUDA.
As CUDA calls on GPU fail with silent error, it is required to
retrieve and check the error code of all of them. Since kernel calls
do not have a return value, first of all, you can check for invalid
launch argument with the error code of cudaPeekAtLastError(); and
then, you can check if errors occurred during the kernel execution
thanks to the error code of cudaDeviceSynchronize, that forces the
kernel completion. Note that most of the time, a developper will use
the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize
would be in duplicate then). In this case, the cudaMemcpy call can
return either errors which occurred during the kernel execution or
those from the memory copy itself.
The program should now fail, and no longer crash and give a wrong result.
To lighten the code, we propose you to outsource the error
management code thanks to the following macro :
and wrap each CUDA call like this :
Your program should still fail, similarly to the previous question.
/** Error checking,
* taken from https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
*/
#define gpuErrCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
gpuErrCheck(cudaMemcpy(C, gpu_C, nb_LineA * nb_ColB * sizeof(float), cudaMemcpyDeviceToHost));
Finally, fix the program.
First parallel computation kernel: SAXPY
We implement here the linear application y = ax + y on an
large vector whose CPU code is:
void saxpy(float *x, float *y, int len, float a){
for (int i = 0; i < len; ++i)
y[i] = a*x[i] + y[i];
}
Starting from saxpy.cu,
- allocate the data on the GPU
- implement a GPU kernel that processes an only one element of an array given as a parameter, the processed element being identified thanks to the current thread identifiers,
- copy the array to the GPU memory,
- run the calculation kernel saxpy in order to treat all elements of the vector,
- copy the result from the GPU memory to memory of the host,
- display the first 10 and last 10 results for checking.
- Finally, set up the error management.
Let begin by running your program with a vector size limited to the
block size, i.e. 1024 elements, before switching to a large one.
Compare performances to those obtained using a CPU thanks to timers.
Square Matrix Multiplication
Here you have to implement the multiplication of square matrices C = A x B. In sgemm.cu, you will find a code skeleton that will allow you to implement and compare the different matrix multiplication strategies we target.
Basic version : implement a version with independant
computation of each element done by doing a basic consecutive path on all
elements of A-line and B-column in global memory.
Tiled version :
By following the explanations given in the course,
implement the tiled version using the shared memory.
Compile with nvcc compiler and -lcublas option for linking the library we use in
the next question.
Compare the execution times of your implementations. Also
compare the execution time by varying the tile size (eg when TILE_WIDTH=16 and
TILE_WIDTH=32). Note: In order to obtain stable performance, run the
computation several times and consider the average execution time for
your comparisons.
cuBLAS version : Call on cublasSgemm
function from cuBLAS library to process matrix multiplication has been
made for you. Measures the execution time taking care to exclude the library
initialization time (warmup with 5 sgemm you exclude from the time measurements)
and by taking the average of several runs.
Compare the execution times obtained with those of the algorithms you wrote.
Back to MPI : Mandelbrot In EasyPAP
NOTE : do not forget to load MPICH environment if you are using the
3A401 room.
EasyPAP makes an easy use of MPI by relieving you of calls to MPI_Init()/MPI_Finalize(), Makefile and to mpirun. To give parameters to the mpirun command, use the option -mpi followed by a string describing them. Note the debugging option -d M to display a window for each process. ./run -k mandel -v variante_mpi -mpi "-np 4" -d
All the processes attend to the computation. After its own one, the master process receives the data computed by the other processes and displays the result. The computation is performed in the mandel_compute_mpi() procedure. Communication is carried out by the mandel_refresh_img_mpi() function, which is called automatically by EasyPAP before each display.
Implement the MPI variant by integrating and completing the following code in
mandel.c.
In a first step, you can test the program without having completed the function
mandel_refresh_img_mpi() function. In this case, the processes
calculate their sub-image area independently, without
synchronisation. You can visually check that each process is
calculating its pixel area.
#ifdef MPI
static int rank, size;
void mandel_init_mpi ()
{
easypap_check_mpi (); // check if MPI was correctly configured
// TODO init rank and size
mandel_init ();
}
static int rankTop (int rank)
{
return 0; // index of the first line to be processed by rank
}
static int rankSize (int rank)
{
return 0; // number of lines to be computed by the process rank
}
void mandel_refresh_img_mpi ()
{
// master receives data
// others send the data
// &cur_img(line,0) is the address of a line in the image
}
////////// MPI basic variant
// Suggested cmdline:
// ./run -k mandel -v mpi -mpi "-np 4" -d M
unsigned mandel_compute_mpi (unsigned nb_iter)
{
for (unsigned it = 1; it <= nb_iter; it++) {
do_tile (0, rankTop (rank), DIM, rankSize (rank), 0);
zoom ();
}
return 0;
}
#endif
Using a tiled version parallelized with OpenMP, write a
variant mandel_compute_mpi_omp(). Also write the functions
mandel_init_mpi_omp() and mandel_refresh_img_mpi_omp().
Replace point-to-point communication by collective
communication (gather here) in mandel_refresh_img_mpi(). Compare the
execution times obtained with the point-to-point version.