CUDA - Labs
Minimal kernel with Google Colab : An addition
All the exercices of the module are to be done in the Google Colab
environment. So, start by logging you on
the Google Colab
webpage.
Google Colab allows you to write and execute code in an interactive environment called a Colab notebook. You can write different kind of codes, including CUDA codes. In order to tell Google Colab that you want to use a GPU, you have to change the default runtime in the menu Runtime>Change Runtime Type and set Runtime type to Python 3 and Hardware accelerator to GPU.
You will find here a notebook in which :
Google Colab allows you to write and execute code in an interactive environment called a Colab notebook. You can write different kind of codes, including CUDA codes. In order to tell Google Colab that you want to use a GPU, you have to change the default runtime in the menu Runtime>Change Runtime Type and set Runtime type to Python 3 and Hardware accelerator to GPU.
You will find here a notebook in which :
- the first cell retrieves and installs the nvcc plugin for Jupyter, nvcc being the CUDA compiler;
- the second cell loads the nvcc plugin;
- the third cell corresponds to our CUDA program.
Read the program.
Compile it thanks to nvcc compiler.
Then launch the program using the buttons "play" on the left of the
code blocks.
Minimal kernel with Error management
CUDA calls on GPU fail with silent error. In
this notebook,
we highlight the value of protecting your CUDA calls to detect
errors :
In the first section named "Raw code", you can read a CUDA code written
without any precautions. Run the code and observe the result. You
may not be agree with the result of 2 + 2 = 0.
In the second section, we introduce how
debugging this code with the debugger cuda-gdb.
For this purpose, you need to :
Note : If you do use printf to debug, be sure to flush the buffer by adding a line break at the end. This applies to any C program. Example: printf("Works up to here\n"); . Nevertheless, the interface between the Jupyter Notebook and the executed program is a little fragile. So if your program crashes, there might not be ANY output at all, even if you have printf everywhere.
For this purpose, you need to :
- compile adding the options"-g -G" so that debugging symbols are included. For this purpose, you need to save your code in a file by beginning it with "%%writefile file_name.cu" (instead of "%%cu") and compile it explicitly in a separate cell. Note that in a notebook, shell commands start with ! .
- write in a file the sequence of instructions to be followed by
the debugger. Indeed, cuda-gdb is interactive (you are
expected to type commands as you go along), but running
programs in the Colab environnement is not. Typical
commands would go like this:
- set the debugger up to check lots of possible errors:
- memory checks : memcheck on,
- stop in case of API failures : api_failures stop,
- stop on exceptions : catch throw,
- run the program (possibly with command line options) : r option1 option2 ,
- show the kernel call stack (GPU) : bt,
- print all local variables : info locals,
- switch to the host thread : thread 1
- and show the host program call stack (CPU) : bt.
- set the debugger up to check lots of possible errors:
- call the debugger with your program and execute the commands from debug_instructions.txt. If your program terminates fine, cuda-gdb will complain that there is no stack (since the program finished)
Note : If you do use printf to debug, be sure to flush the buffer by adding a line break at the end. This applies to any C program. Example: printf("Works up to here\n"); . Nevertheless, the interface between the Jupyter Notebook and the executed program is a little fragile. So if your program crashes, there might not be ANY output at all, even if you have printf everywhere.
In the third section "Code with error management", we
instrument the code to check the return error code of calls to
CUDA. The program should fail now (and no longer crash and
give a wrong result).
As CUDA calls on GPU fail with silent error, it is required to retrieve and check the error code of all of them. Since kernel calls do not have a return value, first of all, you can check for invalid launch argument with the error code of cudaPeekAtLastError(); and then, you can check if errors occurred during the kernel execution thanks to the error code of cudaDeviceSynchronize, that forces the kernel completion. Note that most of the time, a developper will use the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize would be in duplicate then). In this cas, the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself.
As CUDA calls on GPU fail with silent error, it is required to retrieve and check the error code of all of them. Since kernel calls do not have a return value, first of all, you can check for invalid launch argument with the error code of cudaPeekAtLastError(); and then, you can check if errors occurred during the kernel execution thanks to the error code of cudaDeviceSynchronize, that forces the kernel completion. Note that most of the time, a developper will use the cudaMemcpy as synchronization primitive (the cudaDeviceSynchronize would be in duplicate then). In this cas, the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself.
In the last section, we have outsourced the error management
code so that you can use it more easily in the rest of your
exercises.
Notice that the first line of the cell has changed. Now, each cell is saved as a file and the compilation and execution are launched explicitly in two additional cells with a shell command. Note that in a notebook, shell commands start with ! .
Notice that the first line of the cell has changed. Now, each cell is saved as a file and the compilation and execution are launched explicitly in two additional cells with a shell command. Note that in a notebook, shell commands start with ! .
Last but not least, it remains you to fix the problem.
Minimal kernel with error management
SAXPY
Il s'agit ici de réaliser une opération de type SAXPY c'est-à-dire multiplier le vecteur X par une constante A puis ajouter le résultat au vecteur Y ==> Y = A * X + Y.
Dans un programme classique avec un seul CPU, il faut utiliser une répétition et parcourir successivement tous les éléments des deux vecteurs. Avec le GPU, il n'y aura qu'une seule opération par thread.
Pour cela, il faut :
Dans un programme classique avec un seul CPU, il faut utiliser une répétition et parcourir successivement tous les éléments des deux vecteurs. Avec le GPU, il n'y aura qu'une seule opération par thread.
Pour cela, il faut :
- initialiser les deux vecteurs X et Y,
- les transférer dans la mémoire du GPU,
- effectuer le calcul sur le GPU,
- récupérer le résultat en mémoire centrale.
Produit de matrices
Le but de cet exercice est d'obtenir en sortie du GPU le produit de deux matrices carrées C = AxB.
Les algorithmes de produit de matrices les plus performants décomposent les algorithmes en tuile (tile). Les étapes de l'algorithmes sur GPU sont décrites à la fin de l'énoncé. Des explications plus détaillées vous seront fournies en cours.
.
Etapes de l'algorithme du produit de matrices par tuile :
.
Dans un premier temps, écrivez une version utilisant uniquement la mémoire globale en utilisant des blocks deux dimensions de taille 32. Pour vous aider, vous pouvez vous appuyer sur le squelette suivant (squelette, (Makefile), qui initialise les matrices. Comparez le temps d'exécution avec une version openMP que vous devez écrire.
Ecrivez l'algorithme par tuile puis comparer les temps d'exécution des deux implémentations (avec et sans tuile). Comparez également le temps d'exécution lorsque TILE_WIDTH=16 et TILE_WIDTH=32.
- Chaque thread de chaque block va calculer un élément la matrice C.
- Tant que le calcul n'est pas fini, le thread met à jour une variable temporaire qui contient le résultat du calcul
- Dans chaque block, une tuile de la matrice A et une tuile de la matrice B sont stockées en mémoire partagée (dans des vecteurs 2 dimensions [TILE_WIDTH][TILE_WIDTH])
- Chaque thread copie l'élément des matrices A et B en mémoire partagée puis effectue le calcul et le stocke dans la variable temporaire.
- La valeur temporaire est recopié en mémoire globale dans la matrice C lorsque tous les thread du block ont fini.
Comparez le temps d'exécution du code que vous avez écrit à la question 2 au temps d'exécution fourni ici qui utilise cuBLAS. Notez que pour compiler ce code, vous devez ajouter -lcublas à la compilation.