CSC 5001– High Performance Systems

CUDA - Labs

Bootstrap

The lab has to be done on the Arcadia gpu cluster. Here is the configuration to add to your ~/.ssh/config file in order to connect to it : Host tsp-client user <your_login> Hostname IP_ADDRESS ServerAliveInterval 60 ServerAliveCountMax 2 ProxyJump <LAB_GATEWAY> In your config established during the MPI lab, LAB_GATEWAY is normally "tsp". IP_ADDRESS is given during the session by your teacher. Once connected on this frontal via ssh make a reservation of a GPU thanks to slurm job scheduling system : srun --time=1:00:00 --gres=gpu:1 --cpus-per-task=2 --mem=8G --pty bash If a GPU is available, you will be directly connected to the one you registered on. Use nvida-smi if you want to take a look on the machine topology.

Minimal kernel with error management

In bug_add.cu, you can read a CUDA code written without any precautions. Run the code and observe the result. You may not be agree with the 2 + 2 = 0. The objective of this exercice is to highlight the value of protecting your CUDA calls to detect silent errors.

Use the CUDA debugger in order to understand what happens. For this purpose, you need to compile adding the options "-g -G" so that debugging symbols are included and run the program within cuda-gdb.

Now, in order to prevent those error types, instrument the code to check the return error code of calls to CUDA. As CUDA calls on GPU fail with silent error, it is required to retrieve and check the error code of all of them. Since kernel calls do not have a return value, first of all, you can check for invalid launch argument with the error code of cudaPeekAtLastError(); and then, you can check if errors occurred during the kernel execution thanks to the error code of cudaDeviceSynchronize, that forces the kernel completion. Note that most of the time, a developper will use the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize would be in duplicate then). In this case, the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself. The program should now fail, and no longer crash and give a wrong result.

To lighten the code, we propose you to outsource the error management code thanks to the following macro :

/** Error checking,
 *  taken from https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
 */
#define gpuErrCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

and wrap each CUDA call like this :

gpuErrCheck(cudaMemcpy(C, gpu_C, nb_LineA * nb_ColB * sizeof(float), cudaMemcpyDeviceToHost));

Your program should still fail, similarly to the previous question.

Finally, fix the program.

First parallel computation kernel: SAXPY

We implement here the linear application y = ax + y on an large vector whose CPU code is:

  void saxpy(float *x, float *y, int len, float a){
    for (int i = 0; i < len; ++i)
      y[i] = a*x[i] + y[i];
  }

Starting from saxpy.cu,

allocate the data on the GPU
implement a GPU kernel that processes an only one element of an array given as a parameter, the processed element being identified thanks to the current thread identifiers,
copy the array to the GPU memory,
run the calculation kernel saxpy in order to treat all elements of the vector,
copy the result from the GPU memory to memory of the host,
display the first 10 and last 10 results for checking.
Finally, set up the error management.

Let begin by running your program with a vector size limited to the block size, i.e. 1024 elements, before switching to a large one.

Compare performances to those obtained using a CPU thanks to timers.

Square Matrix Multiplication

Here you have to implement the multiplication of square matrices C = A x B. In sgemm.cu, you will find a code skeleton that will allow you to implement and compare the different matrix multiplication strategies we target.

Basic version : implement a version with independant computation of each element done by doing a basic consecutive path on all elements of A-line and B-column in global memory.

Tiled version : By following the explanations given in the course, implement the tiled version using the shared memory.

Compile with nvcc compiler and -lcublas option for linking the library we use in the next question.

Compare the execution times of your implementations. Also compare the execution time by varying the tile size (eg when TILE_WIDTH=16 and TILE_WIDTH=32). Note: In order to obtain stable performance, run the computation several times and consider the average execution time for your comparisons.

cuBLAS version : Call on cublasSgemm function from cuBLAS library to process matrix multiplication has been made for you. Measures the execution time taking care to exclude the library initialization time (warmup with 5 sgemm you exclude from the time measurements) and by taking the average of several runs. Compare the execution times obtained with those of the algorithms you wrote.