CSC 5001– High Performance Systems

CUDA - Labs

Bootstrap

The lab is to be carried out on the jromero cluster on which you have an account (your login is your name), so here is the configuration to add to your ~/.ssh/config file in order to connect you : Host tsp-client user <your_login> Hostname 157.159.104.229 ServerAliveInterval 60 ServerAliveCountMax 2 ProxyJump <LAB_GATEWAY> Once connected on this frontal via ssh (use -X option in order to forward the graphic windows of EasyPAP), ask for a GPU thanks to slurm job scheduling system : srun --x11 --time=2:00:00 --gres=gpu:1 --pty bash You will be directly connected to the one you registered on. Use nvida-smi if you want to take a look on the machine topology.

Minimal kernel with error management

In bug_add.cu, you can read a CUDA code written without any precautions. Run the code and observe the result. You may not be agree with the 2 + 2 = 0. The objective of this exercice is to highlight the value of protecting your CUDA calls to detect silent errors.

Use the CUDA debugger in order to understand what happens. For this purpose, you need to compile adding the options "-g -G" so that debugging symbols are included and run the program within cuda-gdb.

Now, in order to prevent those error types, instrument the code to check the return error code of calls to CUDA. As CUDA calls on GPU fail with silent error, it is required to retrieve and check the error code of all of them. Since kernel calls do not have a return value, first of all, you can check for invalid launch argument with the error code of cudaPeekAtLastError(); and then, you can check if errors occurred during the kernel execution thanks to the error code of cudaDeviceSynchronize, that forces the kernel completion. Note that most of the time, a developper will use the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize would be in duplicate then). In this case, the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself. The program should now fail, and no longer crash and give a wrong result.

To lighten the code, we propose you to outsource the error management code thanks to the following macro :

/** Error checking,
 *  taken from https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
 */
#define gpuErrCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

and wrap each CUDA call like this :

gpuErrCheck(cudaMemcpy(C, gpu_C, nb_LineA * nb_ColB * sizeof(float), cudaMemcpyDeviceToHost));

Your program should still fail, similarly to the previous question.

Finally, fix the program.

First parallel computation kernel: SAXPY

We implement here the linear application y = ax + y on an large vector whose CPU code is:

  void saxpy(float *x, float *y, int len, float a){
    for (int i = 0; i < len; ++i)
      y[i] = a*x[i] + y[i];
  }

Starting from saxpy.cu,

allocate the data on the GPU
implement a GPU kernel that processes an only one element of an array given as a parameter, the processed element being identified thanks to the current thread identifiers,
copy the array to the GPU memory,
run the calculation kernel saxpy in order to treat all elements of the vector,
copy the result from the GPU memory to memory of the host,
display the first 10 and last 10 results for checking.
Finally, set up the error management.

Let begin by running your program with a vector size limited to the block size, i.e. 1024 elements, before switching to a large one.

Compare performances to those obtained using a CPU thanks to timers.

Square Matrix Multiplication

Here you have to implement the multiplication of square matrices C = A x B. In sgemm.cu, you will find a code skeleton that will allow you to implement and compare the different matrix multiplication strategies we target.

Basic version : implement a version with independant computation of each element done by doing a basic consecutive path on all elements of A-line and B-column in global memory.

Tiled version : By following the explanations given in the course, implement the tiled version using the shared memory.

Compile with nvcc compiler and -lcublas option for linking the library we use in the next question.

Compare the execution times of your implementations. Also compare the execution time by varying the tile size (eg when TILE_WIDTH=16 and TILE_WIDTH=32). Note: In order to obtain stable performance, run the computation several times and consider the average execution time for your comparisons.

cuBLAS version : Call on cublasSgemm function from cuBLAS library to process matrix multiplication has been made for you. Measures the execution time taking care to exclude the library initialization time (warmup with 5 sgemm you exclude from the time measurements) and by taking the average of several runs. Compare the execution times obtained with those of the algorithms you wrote.

Back to MPI : Mandelbrot In EasyPAP

NOTE : do not forget to load MPICH environment if you are using the 3A401 room.

EasyPAP makes an easy use of MPI by relieving you of calls to MPI_Init()/MPI_Finalize(), Makefile and to mpirun. To give parameters to the mpirun command, use the option -mpi followed by a string describing them. Note the debugging option -d M to display a window for each process. ./run -k mandel -v variante_mpi -mpi "-np 4" -d

All the processes attend to the computation. After its own one, the master process receives the data computed by the other processes and displays the result. The computation is performed in the mandel_compute_mpi() procedure. Communication is carried out by the mandel_refresh_img_mpi() function, which is called automatically by EasyPAP before each display.

Implement the MPI variant by integrating and completing the following code in mandel.c. In a first step, you can test the program without having completed the function mandel_refresh_img_mpi() function. In this case, the processes calculate their sub-image area independently, without synchronisation. You can visually check that each process is calculating its pixel area. #ifdef MPI static int rank, size; void mandel_init_mpi () { easypap_check_mpi (); // check if MPI was correctly configured // TODO init rank and size mandel_init (); } static int rankTop (int rank) { return 0; // index of the first line to be processed by rank } static int rankSize (int rank) { return 0; // number of lines to be computed by the process rank } void mandel_refresh_img_mpi () { // master receives data // others send the data // &cur_img(line,0) is the address of a line in the image } ////////// MPI basic variant // Suggested cmdline: // ./run -k mandel -v mpi -mpi "-np 4" -d M unsigned mandel_compute_mpi (unsigned nb_iter) { for (unsigned it = 1; it <= nb_iter; it++) { do_tile (0, rankTop (rank), DIM, rankSize (rank), 0); zoom (); } return 0; } #endif

Using a tiled version parallelized with OpenMP, write a variant mandel_compute_mpi_omp(). Also write the functions mandel_init_mpi_omp() and mandel_refresh_img_mpi_omp().

Replace point-to-point communication by collective communication (gather here) in mandel_refresh_img_mpi(). Compare the execution times obtained with the point-to-point version.