Operating systems

Architecture

Branch prediction

Download the archive branch_prediction.tgz and extract it. Study it program branch_prediction.c.

Run (several times) the program with argument 0, then with argument 1.

Use perf stat in order to measure the number of mispredicted branches:

$ perf stat ./branch_prediction 0

In order to collect performance counters with perf, you may have to set /proc/sys/kernel/perf_event_paranoid to -1:

trahay $ perf stat ./branch_prediction 1
is random is set
100000000 iterations in 3978.398000 ms
result=199999998

 Performance counter stats for './branch_prediction 1':

   <not supported>      cpu-cycles:u                     #      nan GHz  cycles_frequency     

       3,980508175 seconds time elapsed

       3,943397000 seconds user
       0,003975000 seconds sys

trahay $ sudo su
root $ echo -1 > /proc/sys/kernel/perf_event_paranoid
is random is set
100000000 iterations in 3754.591000 ms
result=199999998

 Performance counter stats for './branch_prediction 1':

               746      context-switches                 #    200,0 cs/sec  cs_per_second     
                 3      cpu-migrations                   #      0,8 migrations/sec  migrations_per_second
                71      page-faults                      #     19,0 faults/sec  page_faults_per_second
          3 729,88 msec task-clock                       #      1,0 CPUs  CPUs_utilized       
        53 520 012      branch-misses                    #      2,1 %  branch_miss_rate         (88,92%)
     2 552 753 394      branches                         #    684,4 M/sec  branch_frequency     (88,92%)
     5 274 464 379      cpu-cycles                       #      1,4 GHz  cycles_frequency       (88,71%)
    10 160 172 118      instructions                     #      1,9 instructions  insn_per_cycle  (88,93%)
                        TopdownL1                        #      2,2 %  tma_backend_bound      
                                                         #     43,5 %  tma_bad_speculation      (88,93%)
                                                         #     21,3 %  tma_frontend_bound       (77,86%)
                                                         #     32,9 %  tma_retiring             (88,93%)

       3,757173081 seconds time elapsed

       3,721874000 seconds user
       0,007961000 seconds sys
	    

Compare the execution times, and the number of mispredicted branches. How does this number relates to the number of iterations ?

SIMD

Download the archive simd.tgz and extract it.

Study simd.c. Run (several times) the program by varying the argument from 0 to 2. Compare the execution times.

Some computers have heterogeneous cores, that means that some cores run at a lower frequency than others, and measuring the performance of a program may be complex. In this case, you should bind the simd program to a specific core (or set of cores) by running taskset --cpu-list 0 ./simd 0.

Let's play with the caches

Download the archive cache.tgz and extract it.

Cache lines

Study and run the cache_line.c program.

Estimate the number of memory accesses required for this program. Run the program with valgrind --tool=cachegrind. Is the program making efficient use of processor caches? How does the number of cache misses relates to the matrix size ?

If Valgring fails to collect cache misses counts, you may try to use perf instead:

perf stat -e cache-misses ./cache_line

The event (here cache-misses depends on your CPU micro-architecture. You can find the list of available performance counter by running perf list

Modify the program to reduce the number of cache misses.

False sharing

Run the lstopo command (included in the hwloc package) to see the number of cores in your machine as well as the different cache memory available.

Run the cache program by varying the placement of the threads and compare the execution times.

Modify the program so that the performance is the same for any thread placement. Estimate the size of the cache lines on the machine you are using.

Hyper-threading

Download the archive smt.tgz and extract it. Run the program by varying the placement of threads as well as the type of calculation unit exploited.

Optimization of a program

Download the archive program_to_optimize.tgz and extract it.

This program creates 4 threads that process a set of jobs. The result of each job is stored in an event structure. When all the jobs of a thread have been processed, a function (analyze_events()) analyzes the results.

Performance analysis

Compile the program without compiler optimization (with option -O0) and compare its execution time with the version optimized by the compiler (option -O3). The aim of the exercise is to optimize the program by hand in order to obtain performance approaching the performance of the version optimized by the compiler.

Improving the use of caches

Analyze the program and modify it to fix the cache problem.

There is a false sharing problem . We can correct it by adding padding.

program_to_optimize.corrige.5.2.c

Vectorization

Modify the analyze_events function to to vectorize (using AVX instructions by example) the loop.

You can help yourself for this from the guide Intel intrinsics which documents the intrinsics available according to your processor instruction sets (SSE, AVX, etc.)

program_to_optimize.corrige.5.3.c