CSC5101 – Advanced Programming of Multicore Architectures

Parallel and Distributed Systems track

PAAM – Exam 2021-2022

Duration: 2 hours, any document allowed

Course questions (6 points)

NUMA (2 points)

We assume an NUMA architecture consisting of four nodes A, B, C, and D. Each node contains 10 cores and is connected to a memory with a maximum throughput of 5Gb/s. We suppose 4 interconnect links: between A and B, B and C, C and D, and D and A. Each link has a maximum throughput of 1Gb/s. We assume an application consisting of 40 threads, in which each thread generates memory accesses with a constant rate of 0.1Gb/s. The application allocates and initializes the memory used during computation in the initial thread. In this setting, which policy will give the best performance, first-touch or interleaved? Justify. For each policy, you can represent the throughput generated on the different links on a figure.

We suppose that memory access is uniformly distributed among the virtual address space. We have 40 threads that generate each a load of 0.1 GB/s. We have a total generated load of 4GB/s.
  • If we use an interleaved policy, we distribute equally the load of 4GB/s among each interconnect link, which means 1 GB/s per link, which do not saturate. The memory controllers, with their throughput of 5GB/s don't saturate either. The actual throughput is thus 4GB/s.
  • If we use a first touch policy, the maximal load possible generated by the application to node A (4GB/s) does not saturate the memory controller of node A. Since the maximum capacity of the links AB and AC is equal to 2GB/s, node A will only process 3GB/s: the 2GB/s received from the links AB and AC (from B, C and D), and 1GB/s generated by the node A itself (10 threads at 0.1GB/s). The total throughput of the application is thus 3GB/s, which is lower than the 4GB/s of the interleaved policy.

Modèle mémoire (2 points)

We consider four variables ok1, ok2, msg1 and msg2, all of which are initialized to 0. We also consider the following code in which reads and writes are atomic:

Thread 1 a. msg1 = 42; b. ok1 = 1;
Thread 2 c. while(ok1 == 0) { } d. msg2 = msg1; e. ok2 = 1
Thread 3 f. while(ok2 == 0) { } g. printf("%d %d\n", msg1, msg2);

If we consider a relaxed memory model, what are all the possible displays? What are also the possible displays if the machine does not reorder the writes between them, nor the reads between them?

With a machine that neither reorders reads between them and writes between them, if "<" means "executed before", we necessarily have a < b and d < e (a, b, d and e are writes). We also necessarily have f < g (f and g are reads). However, we can still have d < c and e < c because c is a read, and d and e are writes. For this reason, we can have:

  • "0 0" with d < e < f < g (then a < b < c)
  • "42 0" with d < a < e < f < g (then b < c)
  • "42 42" with a < d < e < f < g (then b < c)

With the relaxed memory model, we can have the same output.

Note that "0 42" is impossible: if msg2 = 42, then necessarily, msg1 = 42 because of line d.

Persistent memory (2 points, more difficult)

We assume that our computer has the latest eADR technology. This technology ensures that, in case of a failure, there is enough power to propagate the data from the caches into persistent memory. On such a machine, is it necessary to have the pwb and pfence instructions? Justify.

pwb is not required anymore: any dirty cache line will eventually be propagated to memory in case of failure. We still need pfence because eADR is not related to out of order execution.

The radix tree (14 points)

In this exercise, we implement a radix tree. A radix tree associates keys with values. It mainly provides an interface with two functions:

  • void* insert(char* key, void* value): associates the key with the value. If there is already a value associated with the key, insert returns the old value associated with the key and replaces the old value with the new value. Otherwise, insert creates an association between key and value, and returns NULL,
  • void* get(char* key): returns the value associated with the key if it exists, and NULL otherwise.

A node of a radix tree is represented by the following structure:

struct node { void* value; struct node* nexts[256]; };

When searching for a key in the tree, we go down the tree letter by letter starting from the root. Technically, the root of the tree represents the empty key (string with 0 letters). The nexts field of the root is used to find the nodes associated with single letter strings. For example, if root is a pointer to the root of the tree, then root->nexts['a'] points to the node representing the string "a". When root->nexts['a']->value is not NULL, it gives the value associated to "a". Otherwise, it means that the key "a" is not in the tree. The nexts field of root->nexts["a"] is used to retrieve two-letter strings starting with the string "a". For example, root->nexts['a']->nexts['b'] points to the node representing the string "ab", and root->nexts['a']->nexts['b']->value, if not NULL, gives its value.

Mono-threaded implementation

In this first part, you have to implement the radix tree without considering concurrent accesses: you can consider that there is only a single thread in your process.

Jumping one level in the tree (1 points)

To start, write a function struct node* get_next(struct node** p). If *p is NULL, the function has to allocate a struct node, to store it in *p, and to return it. If *p is not NULL, this function has simply to return *p.

The insert function (3 points)

Implement the insert function. In this question, you have to use get_next to go down the tree and create the intermediate nodes. In details, you can use a struct node** cur to traverse the tree. At the beginning of insert, you can simply initialize it to &root, where root is a global variable of type struct node* pointing to the root of the radix tree.

The get function (2 points)

Write the get function.

Multi-threaded implementation

Lock-based synchronization (3 points)

Write a main method that creates 10 threads and terminates once all 10 threads are finished. Each thread must insert 10 different words (threads can insert the same words 10 times) before retrieving them one after another by calling get. Don't forget to handle concurrent accesses.

Lock-free synchronization (4 points, more difficult)

Rewrite the get_next, insert and get methods in such a way that your code does not use locks anymore but remains correct with multiple threads. You may notice that, in get_next, if two threads create a node at the same time, the second writer has simply to use the node created by the first writer and to free its already allocated node.

You will have to make sure that you read and write the variables potentially accessed by several threads with atomic_load and atomic_store. You will also have to use atomic_compare_exchange_strong.

Transactional memory (1 point)

Explain briefly the changes that you should make to the single-threaded code of insert and get (questions 2.b and 2.c) to make it correct using transactional memory.