CSC 5001– High Performance Systems

Portail informatique

Inside MPI - Lab

First, download the tarball mpi-4.tgz that contains various programs for this lab.

bibw

Analyze the program bibw.c to understand what it does. Then, run the program and observe its behavior.

The program only works for small messages. Find the message size that causes a problem. Does the problem occurs similarly when you run the program with 2 MPI ranks on two machines, and with 2 MPI ranks on one machine ?
$ make mpicc bibw.c -o bibw $ mpirun -np 2 ./bibw #iter size latency bandwidth 300 1 2.612905 300 2 8.897391 300 4 18.367862 300 8 36.877406 300 16 73.910115 300 32 148.103009 300 64 285.321544 300 128 524.774767 300 256 1162.456294 300 512 2232.728515 300 1024 3974.371001 300 2048 6096.235988 300 4096 7895.041185 300 8192 9690.479171 300 16384 8975.177750 300 32768 10183.911576 150 65536 8124.230378 ^C

The program does not work with message larger than 64KB.

This is due to this part of the program, where both MPI rank send a message, and then receive the other rank's message:

for(i = 0; i< WARMUP; i++) { MPI_Send(main_buffer, len, MPI_CHAR, dest, 0, MPI_COMM_WORLD); MPI_Recv(main_buffer, len, MPI_CHAR, dest, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); }

Since MPI uses a rendez-vous protocol for large messages, both MPI ranks are stuck in MPI_Send, waiting for the other rank to reach MPI_Recv.

This can be fixed by sending the message with a non-blocking send:

for(i = 0; i< WARMUP; i++) { MPI_Request req; MPI_Isend(main_buffer, len, MPI_CHAR, dest, 0, MPI_COMM_WORLD, &req); MPI_Recv(main_buffer, len, MPI_CHAR, dest, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Wait(&req, MPI_STATUS_IGNORE); }

pingpong

Analyze the program pingpong.c, and run it.

Explain the behavior of the program for large messages.

$ mpirun -np 2 ./pingpong #iter size latency bandwidth 10 1 0.218350 10 2 1.338900 10 4 0.181100 10 8 0.542750 10 16 0.180300 10 32 0.212100 10 64 0.623850 10 128 0.209750 10 256 0.192150 10 512 0.568850 10 1024 0.229150 10 2048 0.286050 10 4096 2.289050 10 8192 4.014250 10 16384 8.560750 10 32768 16.012000 10 65536 29.484350 10 131072 50046.859050 10 262144 50046.776750 10 524288 50050.497650 10 1048576 50064.942150 10 2097152 50139.030150 10 4194304 50336.753450 10 8388608 50736.090900

The program measures the duration of MPI_Send. Small messages are sent in Eager mode, and the MPI_Send is fast. For large messages, MPI uses a rendez-vous protocol, which makes the sending process synchronize with the receiving process. Since the receiving rank is busy computing, it does not "see" the rendez-vous request, and only reply wait it reaches the MPI_Wait.

stencil

Analyze and run (with 2 MPI ranks) the program stencil_mpi.c while varying the value of N.

  • With large values of N, the program stalls. Find the value of N that causes the problem.
  • Compute the size of MPI messages for this value N.
  • Explain the cause of the problem, and fix the program.
For some values of N, the application works fine:
$ mpirun -np 2 ./stencil_mpi 8695 Initialization (problem size: 8695) Start computing (50 steps) STEP 0... STEP 1... STEP 2... STEP 3...
For larger values, the application is stuck during the first step:
$ mpirun -np 2 ./stencil_mpi 8696 Initialization (problem size: 8696) Start computing (50 steps) STEP 0...
This is due to the calls to MPI_Send during the computation:
for(i=1; i < N-1; i++) { for(j=1; j < N-1; j++) { next_step[i][j] = (cur_step[i-1][j] + cur_step[i+1][j] + cur_step[i][j-1] + cur_step[i][j+1] + cur_step[i][j]) / 5; } /* Send data as soon as possible */ if(i == 1 && comm_rank > 0) { /* there's an upper neighbour */ MPI_Send(next_step[1], N, MPI_DOUBLE, comm_rank-1, 0, MPI_COMM_WORLD); } if(i == N-2 && comm_rank < comm_size-1) { /* there's a lower neighbour */ MPI_Send(next_step[N-2], N, MPI_DOUBLE, comm_rank+1, 0, MPI_COMM_WORLD); } }
For large messages, MPI uses a rendez vous protocol, which makes MPI ranks wait indefinitely. This can be fixed by calling non-blocking send functions:
MPI_Request req1, req2; for(i=1; i < N-1; i++) { for(j=1; j < N-1; j++) { next_step[i][j] = (cur_step[i-1][j] + cur_step[i+1][j] + cur_step[i][j-1] + cur_step[i][j+1] + cur_step[i][j]) / 5; } /* Send data as soon as possible */ if(i == 1 && comm_rank > 0) { /* there's an upper neighbour */ MPI_Isend(next_step[1], N, MPI_DOUBLE, comm_rank-1, 0, MPI_COMM_WORLD, &req1); } if(i == N-2 && comm_rank < comm_size-1) { /* there's a lower neighbour */ MPI_Isend(next_step[N-2], N, MPI_DOUBLE, comm_rank+1, 0, MPI_COMM_WORLD, &req2); } } DEBUG_PRINTF("rank #%d: compute done\n", comm_rank); if(comm_rank > 0) { /* there's an upper neighbour */ MPI_Recv(next_step[0], N, MPI_DOUBLE, comm_rank-1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } if(comm_rank < comm_size-1) { /* there's a lower neighbour */ MPI_Recv(next_step[N-1], N, MPI_DOUBLE, comm_rank+1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } if(comm_rank > 0) { MPI_Wait (&req1, MPI_STATUS_IGNORE); } if(comm_rank < comm_size-1) { MPI_Wait(&req2, MPI_STATUS_IGNORE); }

MPI+OpenMP

  • Parallelize the program stencil_mpi.c with OpenMP.
  • Since you are mixing MPI with threads, you should initialize MPI properly.
  • Your program may run fine while being incorrect (because of a race condition, for example)

MPI+CUDA

  • Parallelize the program stencil_mpi.c with CUDA.