Non volatile main memory
A simple write cache
This exercise has the goal of implementing a write cache in non volatile memory in order to boost IO performance. The code that you will implement is incomplete because it cannot handle reads.
At thigh level, we will implement the int my_pwrite(int fd, const void* buf, size_t n, off_t off) function. This function behaves as the classical pwrite function: it writes the n bytes at buf in the file fd at the offset off. However, instead of writing the buffer in the file fd, my_pwrite will instead add the write operation in a log stored in NVMM. Asynchronously, a thread, called the cleanup thread, will take the operation from the log and use the normal pwrite function to propagate them to the disk.
Since you probably don't have a NVMM, we will simulate a NVMM by mapping a persistent file in memory. Write an application that creates a file called nvmm.dat of 128 megabytes, maps this file in memory and increments the first byte of the file. Check that the first byte is actually incremented each time you run your application by launching hexdump -C nvmm.dat.
In order to use the NVMM instructions (pwb, pfence and psync), you first have to compile with the -march=native option of gcc. Then, you have to download the following file and to include it:
Add a my_pwrite function. For the moment, my_pwrite has simply to call the normal pwrite function. In the main function, adds a code that writes "hello" 1024 times at the offsets 0, 5 etc. of the file out.dat, and checks that your application behaves correctly by inspecting out.dat with hexdump -C out.dat.
At this step, we define the main structures and variables. For that purpose, you just have download and include this file:
The struct operation describes a write operation. In details, a struct operation contains two cache lines (the first one defined with a union, the second one is the field buf). In the first cache-line, we can find the meta data associated to the write:
- commit is true if the entry contains a write operation,
- fd, n and off describe the operation.
In the second cache line, we can find the written buffer.
Then, the NVMM consists in a log (struct nvmm). In this log, we can find:
- An array of operations (operation),
- A tail index (tail), which indicates where is the least recently added operation.
Finally, in volatile memory, you can find:
- A pointer to the nvmm (nvmm),
- A volatile head index (head) where the new operations are added.
Now that we have the main structures and variables, we can implement our booster. In my_pwrite, you have to add the new operations at head. If head + 1 is equal to tail, it means that the log is full. In this case, my_pwrite has to wait.
Additionally to the my_pwrite function, you also have to implement the cleanup thread. This thread inspects the commit flag of the operation at tail. When the commit flag becomes true, the cleanup thread has to call the normal pwrite function in order to propagate the write to the disk. Then, the cleanup thread has to call fsync in order to ensure that the data is actually written to disk. Finally, when the fsync returns, it means that the entry in the log becomes useless. The cleanup thread can thus increment tail, and then flushes it to NVMM.
By using clock_gettime in CLOCK_MONOTONIC mode, compare the performance of:
- a simple call to pwrite,
- a call to pwrite followed by a call to fsync in order to ensure that the write is durable,
- a call to my_pwrite.
Now, we want to handle crashes. In case of crash, when your application recover, it has to applies the last writes of the log. In details, the recovery procedure has to inspect all the operations by starting at tail and to apply all the committed operations. In order to simplify, since the old file descriptor does not make sense after the crash, you can consider that fd is necessarily a file descriptor associated to the file out.dat.
If the log contains multiple entries, calling fsync for each write in the cleanup thread is inefficient. Modify your code in order to call fsync only once when you propagate multiple entries.
Because of batching, our application contains a small bug. Let's imagine that the log is full, i.e., that head + 1 is equal to tail. If the cleanup thread consumes two entries, it will add 2 to tail. At this step, the main thread can thus create two new entries: e1 and e2. Unfortunately, a crash may happen while the two new entries are propagated to NVMM, but not the tail index. During recovery, the cleanup thread will thus consider that e2 is the oldest write (e1 is at head which is just before tail, while e2 is at tail). If e2 and another entry e0 in the log write at the same place, e0 will overwrite the write of e2, which is incorrect: e2 is the most recent write and should overwrite e0.
In order to correct this small bug, you should use two tail variables:
- a persistent_tail stored in NVMM and used by the cleanup thread and during recovery.
- a volatile_tail stored in volatile memory and only incremented if the incrementation of persistent_tail in the cleanup thread is actually propagated to NVMM (i.e., by using pwb/pfence). If the main thread waits when head + 1 is equal to volatile_tail, we are sure that the main thread can only start to add the entries e1 and e2 if the persistent tail is consistent (after e2), which corrects the bug.