CSC5101 – Advanced Programming of Multicore Architectures

Parallel and Distributed Systems track

Non-blocking algorithms

You have to send the exercise 4 by email before October 14 at 11:59 pm to gael.thomas@inria.fr.

The lock free queue (not graded)

Write a main function that creates N threads, where N is a parameter of the application. The main function has to wait for the termination of the threads before exiting.

Implements a queue of integer protected by a pthread_mutex_lock (enqueue and dequeue functions). Modify your application in order to enqueue K nodes and then dequeue K nodes in each thread, where K is a parameter of the application. Runs the application with N = 10 and K = 1'000'000 and verify that the queue is empty at the end of the run.

Implements a lock-free queue of integer by using the algorithm presented during the lecture. Don't forget to identify the atomic operations and to use the adequate C constructs. Runs the application with N = 10 and K = 1'000'000 and verify that the lock-free queue is also empty at the end of the run.
For the moment, don't try to free the nodes when they are removed from the queue. You will see that freeing the nodes complicate the algorithm later.
In order to simplify debugging, you can add a function to print the state of the queue.

By using clock_gettime in CLOCK_MONOTONIC mode, identify which implementation (lock-free or protected by a pthread_mutex_lock) gives the best result.

Now, try to free a node in the lock-free queue implementation of dequeue. If your code is correct, the application should crash lamentably with a segmentation fault. Try to understand why by taking look at the ABA problem. When you think that you understand, check with your professor that your explanation is correct.

optional and difficult

In order to avoid the ABA problem, we have to ensure that a node A is not reallocated when a thread uses it. For that purpose, we will use hazard pointers (Maged Michael, "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects", TPDS 2004).

At high-level, when a thread reads a pointer X from a location L and wants to perform a CAS to change the value of L from X to another value, it has to ensure that X cannot be reallocated between the read and the CAS. For that purpose, the thread registers X in a thread-local variable called hazard while it uses X, and the algorithm has to ensure that X cannot be freed if one of the hazard variables points to X.

With the queue, when a thread reads the pointer X in head in dequeue, it writes thus X in hazard. Then, it re-reads head and if head is not equal to X, it restarts from the beginning. Finally, if the compare-and-swap succeed, instead of freeing the node, the thread writes 0 in its hazard pointer and adds X in a lock-free stack called free_nodes.

Similarly, in enqueue, the thread has also to ensure that the tail is not freed during one of the operations that modify tail. For this reason, similarly, when the thread reads tail, it saves its value in its hazard pointer, re-read tail in order to ensure that it was not yet potentially freed, and writes 0 in its hazard location after the CAS.

Finally, in enqueue, in order to allocate a node, a thread has now first to try to allocate the node from free_nodes before allocating a new node with malloc. In details:

  • the thread pops (with a lock-free algorithm) a pointer Z from the free_nodes stack,
  • If Z is null (i.e., the stack is empty), the thread allocates a node with malloc,
  • otherwise
    • if the thread finds the Z pointer in one of the hazard variables of the other threads, it means that Z is still used by another thread. In this case, it has to re-push Z and to use malloc to allocate a new node,
    • otherwise, the node is free and not used anymore, the thread can thus directly reuse it to enqueue a new value without calling malloc.

As an additional question, why a node can not appear twice in the free_nodes stack?

The radix tree (not graded, easy)

Try to implement a lock free version of a radix tree. You can find a description of a radix tree here. In this exercise, consider that you cannot remove elements from the radix tree.

The linked list (not graded, optional and difficult)

Restart the first exercice, but by considering a linked list, and without trying to free the nodes. If you still have time and was able to implement hazard pointers for the queue, you can try to use hazard pointers to handle the ABA problem for linked lists (this question is difficult).

A simple memory allocator (graded)

In this exercice, we will implement a memory allocator. Internally, the allocator manages chunks. A chunk represents a memory region, which can be allocated or not. A chunk consists of this structure:

struct chunk { size_t size; union { struct chunk* next; /* next node when the chunk is free */ char content[0]; /* the content of the chunk when allocated */ }; };

The interface consists then in two functions:

  • struct chunk* alloc_chunk(size_t size): allocate a chunk of size size,
  • void free_chunk(struct chunk* c): free the chunk c.

When a chunk c is free, it is stored in a global linked list named struct chunk* free_list. In this case, c->next points to the next free chunk in the list free_list. When a chunk c is allocated, it is not in free_list. In this case, the next field is not used, and the user can store the content of an object in c->content.

In order to allocate a chunk, alloc_chunk finds the first chunk c large enough in the list free_chunk, and remove c from the list. If c->chunk + sizeof(struct chunk) is greater than size, returning the whole chunk c would lead to a memory waste. In order to avoid the waste, in this case, alloc_chunk has to break the chunk in two by creating a new chunk inside the content c.

At a high level, the algorithm of alloc_chunk consists of:
  • Find the first chunk c in free_list such that size < c->size,
  • Remove c from the list
  • If size + sizeof(struct chunk) < c->size, then:
    • Creates a new chunk struct chunk* d = (struct chunk*)((uintptr_t)c + size),
    • Set d->size to c->size - size, and c->size to size,
    • Free the chunk d,
    • Return c.
  • Otherwise (i.e., if size + sizeof(struct chunk) >= c->size), simply returns c.
Lock-based version

In this first part of the exercice, you will implement a lock-based version of the allocator. For that, you need a global lock named pthread_mutex_t mutex.

free_chunk

Give the code of the free_chunk(struct chunk* c), which adds c to free_list.

create_large_chunk

Give the code of a struct chunk* create_large_chunk() function. This function allocate a large chunk of 64*1024*1024 bytes, adequatly initializes its size field and returns it. This function does not add the new chunk in the list: the function simply returns the new chunk. To allocate the chunk, you just have to use malloc.

alloc_chunk

Give the code of alloc_chunk. If free_list does not contain any chunk larger than c, then, alloc_chunk uses create_large_chunk to create a new large chunk. In this case, alloc_chunk adds the new large chunk to the free_list, and restarts to try to allocate a chunk.
Lock-free version

In this second part of the exercice, you will implement a lock-free version of the algorithm step by step. For that, we will start by simply considering free_list as a lock-free stack.

pop

Give the code of a struct chunk* pop() function. If free_list is NULL, pop return NULL, otherwise, the function pops the first element from free_list.

push

Give the code of a void push(struct chunk* head, struct chunk* tail) function. head is a list that ends with tail. The function pushes all the elements from head to tail (both included). Note that it should not be necessary to traverse the list from head to tail.

free_chunk

Give the code of the free_chunk function.

alloc_chunk

Give the code of the alloc_chunk function. By leveraging push, pop and create_large_chunk, you can avoid using any atomic operation in alloc_chunk.