CSC5101 – Advanced Programming of Multicore Architectures

Parallel and Distributed Systems track

Non-Uniform Memory Access Architecture

NUMA architecture

In this exercice, we will try to measure the time to access memory on a NUMA machine.

Before everything, you will have to connect to the test machine "arya". For that, ask your professor.

Explore the NUMA topology with the tool lstopo-no-graphics. Try to understand the output in order to know the number of NUMA nodes, the memory on each NUMA node, the number of processing units (logical processors) per NUMA node. What are the size of the L1, L2 and L3 caches?

In order to measure the time to read memory, a thread will read several times a large buffer. We call this thread the test thread. Then, we will also add some noise in order to measure the memory access performance when you have memory traffic generated by other threads, which we call the noisy threads. Write an application in C that takes the following arguments:
  • The first argument, which we call test_pu, is an integer that gives the processing unit on which the test thread is running,
  • The second argument, which we call test_node, is an integer that gives the NUMA node where we will allocate the buffer accessed by the test thread,
  • The third argument, which we call test_buffer, gives the size of the buffer accessed by the test thread in kilo-bytes,
  • The fourth argument, which we call test_workload, gives the number of kilo bytes accessed by the test thread. For example, if test_buffer is equal to 1 and test_workload is equal to 2, the test thread will access twice the entire buffer.
  • The fifth argument, which we call noisy_config configures the noisy threads. This argument can have the following values:
    • none: we don't have noisy threads,
    • spread: we run a noisy thread on each processing unit, except the one used by the test thread, and each noisy thread access a buffer located in the NUMA node (n + 1) % number_of_nodes, where n is the NUMA node of the processing unit of the noisy thread, and number_of_nodes the total number of nodes.
    • overload: as in spread, except that all the noisy threads access buffers located in a given NUMA node (see next argument below),
  • The sixth argument, which we call noisy_node is an integer that gives the NUMA node accessed by the noisy threads in the overload configuration.

At this step, you just to have to ensure that you have the adequate number of arguments and to extract their values.

We will now extract the numa topology with the hwloc library. For that, you can reuse the code below, to extract the numa topology. You have to link your executable with the hwloc library, i.e., with -lhwloc. If your application is correct, it should output your numa topology. The central data structure that you will use is the numa topology. It contains an array with the numa nodes, and an array with the processing units. Each processing unit is associated to a numa node, and the numa node contains an array with its processing units.

Create the test thread and ensure that the test thread runs on the test_pu processing unit, To identify the processing unit test_pu, you have to find it in the numa->pus array by comparing test_pu with numa->pus[i].obj->logical_index. With hwloc, you can bind a thread to a given processing unit by using the hwloc_set_cpubind function:

hwloc_set_cpubind(topology, numa->pus[0].obj->cpuset, HWLOC_CPUBIND_THREAD);
Instead of the processing unit at index 0, you have to use the one with the test_pu logical index.

If noisy_config is equal to spread or overload, starts the noisy threads on the different PUs (except the one used by the test thread).

At this step, we allocate the buffers accessed by the noisy threads. We consider that the noisy threads continuously access sequentially buffers of 32 megabytes because a buffer of 32 megabytes does not fit in a L3 cache, which ensures that a noisy thread will actually generate traffic to the memory controllers. The buffer have to be allocated as follow:

  • when the application runs in spread mode, on the node N + 1, where N is the node on which the noisy thread runs,
  • when the application runs in overload mode, on the noisy_node node.

To allocate memory from a given memory node, you can reuse this code:

char* buf = hwloc_alloc_membind(topology, 32 * 1024 * 1024, numa->pus[0].node->obj->nodeset, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_BYNODESET | HWLOC_MEMBIND_NOCPUBIND );
Instead of the processing unit at index 0, you have to use the one with the test_pu logical index.

The noisy threads have to continuously access their buffers of 32MB. For the moment, the program will thus not terminate. In order to ensure that the noisy threads actually access memory and not the L1 cache, a noisy thread has to access each cache line of its buffer only once. For this reason, we consider that the buffer as an array of struct cache_line where this structure is defined as follow:

struct cache_line { union { char _content[64]; // 64 bytes in each cache line int value; }; };

A noisy thread has thus to access execute an infinite number of times a loop that accesses the 32 * 1024 * 1024 / sizeof(struct cache_line) cache lines of the buffer. For that purpose, the thread can simply read the field value.

Note that if you compile your application with in optimized mode (-02 or -O3), the compiler will detect that the loop does not compute anything and will simply eliminate it. For this reason, you can either compile your code without any optimization, or you can compute something useful. For example, you can define a variable r and add the value of each cache line to r at each step of the loop. In this case, you will also have to write r in a global variable at the end of the loop in ensure that the compiler will not eliminate the computation of r itself.

Now that our noisy threads are making noise, we can implement the test thread. The test thread has to allocate a buffer of test_buffer kilobytes from the NUMA node test_node. Modify your code accordingly (you can reuse some part of the code given at the previous question.

At this step, we want to actually access the buffer. First, instead of accessing each byte of the buffer, we want to access complete cache lines in order to measure the time to access memory. For this reason, we will reuse the struct cache_line defined in the previous question.

Then, we will consider the buffer as an array of struct cache_line and we will access the integer value of each cache line. We consider that by accessing value, we access 64 bytes: accessing the other bytes will only add noise to our measures. For this reason, in order to access test_workload kilobytes, we have to access test_workload * 1024 / 64 cache lines. We define this value as num_access.

Modify your application in order to access num_access times the cache lines of the buffer sequentially (just read the value field of each cache line). Don't forget to restart from the beginning of the buffer when num_access becomes bigger than test_buffer * 1024 / 64.

Use also clock_gettime in CLOCK_MONOTONIC to repport the time to access a cache line. You can simply measure the time taken by the loop and divides this time by num_access.

Currently, we are not really measuring the time to access memory because we access the buffer sequentially. Because of this sequential access, the processor quickly detects the pattern and starts to prefetch the cache lines in advances.

In order to avoid the prefetching mechanism, we will randomly access to different locations inside the buffer. However, if we execute the rand() function in our loop that accesses the buffer, instead of measuring the time to access memory, we will mainly measure the performance of rand(). For this reason, we will pre-compute the random locations and directly store them in the buffer in the value field. Then, in the access loop, you will start with the cache line at index cur = 0, and at each step of the loop, you will continue with the cache line at index cache_lines[cur], where cache_lines is a pointer to the buffer.

When you prepare the buffer with the random indexes, you have to ensure that each cache line is visited once.

You can now measure the memory access performance of the machine:

  • For the L1 cache, you have to ensure that test_buffer is largely smaller than the L1 cache (for example 2 kilobytes), and to configure the noisy thread in none mode,
  • For the L2 cache, you can use the same configuration, but with a larger buffer (128 kilobytes for example),
  • For the L3 cache, you can use the same configuration, but with a larger buffer (1 megabyte for example),
  • For a local access, you can use the same configuration, but with a larger buffer (32 megabytes for example), and you have to ensure that the memory accessed by the test thread comes from the memory node of the test thread,
  • For a remote access, you just have to change the allocation node of the buffer used by the test thread,
  • If you want to add noise, you can activate the noisy threads in spread or overload modes.
You can find an approximate solution that directly relies on pthread_setaffinity and mbind instead of hwloc here (last exercise).