Operating systems

Portail informatique

Architecture

Branch prediction

Download the archive branch_prediction.tgz and extract it. Study it program branch_prediction.c.

Run (several times) the program with argument 0, then with argument 1.

Use perf stat in order to measure the number of mispredicted branches:

$ perf stat ./branch_prediction 0

Compare the execution times, and the number of mispredicted branches. How does this number relates to the number of iterations ?

SIMD

Download the archive simd.tgz and extract it.

Study simd.c. Run (several times) the program by varying the argument from 0 to 2. Compare the execution times.

Let's play with the caches

Download the archive cache.tgz and extract it.

Cache lines

Study and run the cache_line.c program.

Estimate the number of memory accesses required for this program. Run the program with valgrind --tool=cachegrind. Is the program making efficient use of processor caches? How does the number of cache misses relates to the matrix size ?

Modify the program to reduce the number of cache misses.

False sharing

Run the lstopo command (included in the hwloc package) to see the number of cores in your machine as well as the different cache memory available.

Run the cache program by varying the placement of the threads and compare the execution times.

Modify the program so that the performance is the same for any thread placement. Estimate the size of the cache lines on the machine you are using.

Hyper-threading

Download the archive smt.tgz and extract it. Run the program by varying the placement of threads as well as the type of calculation unit exploited.

Optimization of a program

Download the archive program_to_optimize.tgz and extract it.

This program creates 4 threads that process a set of jobs. The result of each job is stored in an event structure. When all the jobs of a thread have been processed, a function (analyze_events()) analyzes the results.

Performance analysis

Compile the program without compiler optimization (with option -O0) and compare its execution time with the version optimized by the compiler (option -O3). The aim of the exercise is to optimize the program by hand in order to obtain performance approaching the performance of the version optimized by the compiler.

Improving the use of caches

Analyze the program and modify it to fix the cache problem.

There is a false sharing problem . We can correct it by adding padding.

Vectorization

Modify the analyze_events function to to vectorize (using AVX instructions by example) the loop.

You can help yourself for this from the guide Intel intrinsics which documents the intrinsics available according to your processor instruction sets (SSE, AVX, etc.)