CSC4508 – Operating Systems

François Trahay

Gaël Thomas

2024

Presentation of the class

Presentation of the class


Organization


Kernel sessions: XV6

During the [K]sessions, you will develop an OS


Evaluation


Evaluation of the class


Threads

Execution context of a process


Duplicating a process


Execution flows


Multithreaded process

In a multi-threaded process, each thread has a context (registers + stack). The rest of the memory (code, data, etc.) and resources (open files, etc.) are shared between threads.

The stacks of the different threads are located in memory so that they can grow. However, if a thread’s stack grows too much, it might overflow onto the stack of another thread. To avoid this problem, the size of the stack is limited (the command ulimit -s gives the maximum stack size). This size limit can be changed using command line (by example ulimit -s 32768), or from a program (in using the setrlimit function).


Creating a Pthread

int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);

We present here the Pthread API (POSIX thread) which is the most used in C. The C11 standard defines another interface for manipulating threads. However, there are only few implementations of this interface. The de facto standard therefore remains Pthread.

Unlike the creation of processes which generates a hierarchy (ie. each process has a parent process), there is no hierarchy between threads.


Other Pthread functions

int pthread_exit(void* retval);
int pthread_join(pthread_t tid, void **retval);

Sharing data

Technically, all the memory space is shared between the threads. It is therefore possible to share all the variables, including local variables.


Thread-safe source code


Reentrant source code

Example: strtok

Another example of a non-reentrant function is the char *strtok(char * str, char * delim) function. This function extracts substrings from a string.

For example, the following code displays the different directories of the PATH variable:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

void extract_path() {
  char* string = getenv("PATH");
  printf("Parsing '%s'\n", string);

  
  for(char* token = strtok(string, ":") ;
      token ;
      token = strtok(NULL, ":") ){
    printf("\t %s\n", token);
  }
}

int main(int argc, char**argv) {
  extract_path();
  return 0;
}

Here is an example of result obtained with this program:

Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
         /usr/local/bin
         /usr/bin
         /bin
         /usr/local/games
         /usr/games

The strtok function is not reentrant because it is based on a previous state (a pointer to the last character tested in the string). Thus, in this example, the processing applied to each token cannot use strtok. For example:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

void extract_path() {
  char* string = getenv("PATH");
  printf("Parsing '%s'\n", string);
  // string should contain a list of directories separated with :
  // eg. /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
  
  // Extract the directories
  // eg. /usr/local/bin, /usr/bin, /bin, /usr/local/games, /usr/games
  for(char* token = strtok(string, ":") ;
      token ;
      token = strtok(NULL, ":") ){
    // token contains a directory (eg. /usr/local/bin)
    printf("\t %s contains: ", token);
    
    // Extract the subdirectories
    // eg. usr, local, bin
    for(char* word = strtok(token, "/ ") ;
	word ;
	word = strtok(NULL, "/") ){
      printf("%s ", word);
    }
    printf("\n");
  }
}

int main(int argc, char**argv) {
  extract_path();
  return 0;
}

Will result in:

Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
         /usr/local/bin contains: usr local bin

Here the first token (/usr/local/bin) is split into words (usr, local, bin) by successive calls to strtok which modify the previous state of strtok, which prevents subsequent calls to token = strtok (NULL, ":") to iterate over the string string.


Making a function reentrant

It is possible to make a non-reentrant function reentrant by adding a parameter corresponding to the state of the function. For example, the reentrant version of char* strtok(char *str, const char *delim); is char* strtok_r(char *str, const char *delim, char **saveptr );

Thus, the previous program can be corrected:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

void extract_path() {
  char* string = getenv("PATH");
  char* saveptr = NULL;
  printf("Parsing '%s'\n", string);

  
  for(char* token = strtok_r(string, ":", &saveptr) ;
      token ;
      token = strtok_r(NULL, ":", &saveptr) ){
    printf("\t %s contains: ", token);

    char* saveptr_word = NULL;
    for(char* word = strtok_r(token, "/ ", &saveptr_word) ;
	word ;
	word = strtok_r(NULL, "/", &saveptr_word) ){
      printf("%s ", word);
    }
    printf("\n");
  }
}

int main(int argc, char**argv) {
  extract_path();
  return 0;
}

Which will result in:

Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
         /usr/local/bin contains: usr local bin 
         /usr/bin contains: usr bin 
         /bin contains: bin 
         /usr/local/games contains: usr local games 
         /usr/games contains: usr games

TLS – Thread-Local Storage

TLS variables in C99

pthread_key


Synchronization

The following program illustrates the problem of simultaneous access to shared variables. Here, two threads each increment 1 000 000 000 times the same variable:

/*
 * compteurBOOM.c
 *
 * Synchronization problem
 *
 *
 */

#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>

/* INT_MAX / 2 */
#define NBITER 1000000000

int counter = 0;

void *start_routine(void *arg) {
  int i;

  for (i = 0; i < NBITER; i++) {
      /* OOPS: WRONG ! Access to an unprotected shared variable */
      counter ++;
    }
  pthread_exit(NULL);
}

int main (int argc, char *argv[]) {
  int rc;
  pthread_t thread1, thread2;
  
  rc = pthread_create(&thread1, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_create(&thread2, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_join(thread1, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");
  rc = pthread_join(thread2, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");

  if (counter != 2 * NBITER)
    printf("BOOM! counter = %d\n", counter);
  else
    printf("OK counter = %d\n", counter);

  exit(EXIT_SUCCESS);
}

While the counter should be 2 * 1 000 000 000 = 2 000 000 000, running this program gives another result, for example:

$ ./compteurBOOM 
BOOM! compteur = 1076588402

Mutex

Using a mutex, we can correct the BOOM counter program by ensuring that the counter increments are done in mutual exclusion:

/*
 * compteurBOOM.c
 *
 * Synchronization problem
 *
 *
 */

#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>

/* INT_MAX / 2 */
#define NBITER 1000000000

int counter = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void *start_routine(void *arg) {
  int i;

  for (i = 0; i < NBITER; i++) {
    pthread_mutex_lock(&mutex);
    counter ++;
    pthread_mutex_unlock(&mutex);
  }
  pthread_exit(NULL);
}

int main (int argc, char *argv[]) {
  int rc;
  pthread_t thread1, thread2;
  
  rc = pthread_create(&thread1, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_create(&thread2, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_join(thread1, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");
  rc = pthread_join(thread2, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");

  if (counter != 2 * NBITER)
    printf("BOOM! counter = %d\n", counter);
  else
    printf("OK counter = %d\n", counter);

  exit(EXIT_SUCCESS);
}

While the result is correct, the use of a mutex significantly slows down the program (144s with mutex, against 4.1s without mutex).


Atomic operations

We can fix the counterBOOM program by using atomic operations. To do this, all we have to do is declare the counter like _Atomic int. The counter increment then uses the atomic operation atomic_fetch_add.

/*
 * compteurBOOM.c
 *
 * Synchronization problem
 *
 *
 */

#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>

/* INT_MAX / 2 */
#define NBITER 1000000000

_Atomic int counter = 0;

void *start_routine(void *arg) {
  int i;

  for (i = 0; i < NBITER; i++) {
    counter ++;
  }
  pthread_exit(NULL);
}

int main (int argc, char *argv[]) {
  int rc;
  pthread_t thread1, thread2;
  
  rc = pthread_create(&thread1, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_create(&thread2, NULL, start_routine, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_create");

  rc = pthread_join(thread1, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");
  rc = pthread_join(thread2, NULL);
  if (rc)
    error(EXIT_FAILURE, rc, "pthread_join");

  if (counter != 2 * NBITER)
    printf("BOOM! counter = %d\n", counter);
  else
    printf("OK counter = %d\n", counter);

  exit(EXIT_SUCCESS);
}

Here, the result is correct and the program is much faster than when using a mutex:


Concurrent programming

Introduction


Inter-process synchronization


Pipes

You have already handled pipes without necessarily realizing it: in bash, the sequence of commands linked by pipes is done via anonymous pipes created by the bash process.

So when we run cmd1 | cmd2 | cmd3, bash creates 2 anonymous pipes and 3 processes, then redirects (thanks to the dup2 system call, see Lecture #11) standard input and output of processes to the different tubes.


Shared memory

We will see later (during lecture 11 on I/O) another use of mmap.


Semaphore


Intra-process synchronization


Mutex


Inter-process mutex

It is possible to synchronize threads from several processes with a pthread_mutex_t if it is in a shared memory area. For this, it is necessary to position the PTHREAD_PROCESS_SHARED attribute of the mutex with the function int pthread_mutexattr_setpshared(pthread_mutexattr_t *attr, int pshared);


Monitors

pthread_mutex_lock(&l);
  while(!condition) {
    pthread_cond_wait(&c, &l);
  }
  process_data();
pthread_mutex_unlock(&l);
pthread_mutex_lock(&l);
  produce_data();
  pthread_cond_signal(&c);
pthread_mutex_unlock(&l);

Here are the prototypes of the functions associated with the conditions:

The mutex ensures that between testing for the condition ( while (! condition)) and wait (pthread_cond_wait()), no thread performs the condition.

Inter-process monitors

To synchronize multiple processes with a monitor, it is necessary to set the following attributes:


Barrier

Once all the threads have reached the barrier, they are all unblocked and pthread_barrier_wait returns 0 except for one thread which returns PTHREAD_BARRIER_SERIAL_THREAD.


Inter-process barrier

To synchronize threads from multiple processes with a barrier, it is necessary to set the attribute PTHREAD_PROCESS_SHARED with int pthread_barrierattr_setpshared(pthread_barrierattr_t *attr, int pshared);


Read-Write lock


Classic synchronization patterns

In the literature, these problems are usually solved by using semaphores. This is because these problems have been theorized in the 1960s and 1970s by Dijkstra based on semaphores. In addition, semaphores have the advantage of being able to be used for inter-process synchronizations or intra-process.

However, modern operating systems implement many synchronization primitives which are much more efficient than semaphores. In the next slides, we will therefore rely on these mechanisms rather than semaphores.


Mutual exclusion synchronization pattern

   Prog1
mutex_lock(m)
 x=read (account) 
 x = x + 10
 write (account=x)
mutex_unlock(m)
   Prog2            
mutex_lock(m)   
 x=read (account)
 x = x - 100        
 write(account=x)
mutex_unlock(m)

Intra-process implementation

In a multi-threaded process, we just need to use a mutex of type pthread_mutex_t.

Inter-process implementation

To implement a mutual exclusion between several processes, several solutions exist


Cohort synchronization pattern


Producer / Consumer synchronization pattern

Implementation of a Producer / Consumer pattern

Producer:                        Consumer:
repeat                           repeat
...                               ...

mutex_lock(available_spots);         mutex_lock(ready_info);
  while(available_spots<=0)            while(ready_info<=0)
    cond_wait(available_spots);          cond_wait(ready_info);
  reserve_slot();                      extract(info)
mutex_unlock(available_spots);       mutex_unlock(ready_info); 

calcul(info)                      mutex_lock(available_spots);
                                    free_slot();
mutex_lock(ready_info);             cond_signal(available_spots)
  push(info);                     mutex_unlock(available_spots);
  cond_signal(ready_info);
mutex_unlock(ready_info);         ...
...                               endRepeat
endRepeat

Inter-process Producer / Consumer

It is of course possible to implement a producer / consumer scheme between processes using conditions and mutexes. Another simpler solution is to use a pipe: since writing in a pipe being atomic, the deposit of a data boils down to writing into the pipe, and reading from the pipe extracts the data.


Reader / Writer pattern


Implementation of a Reader / Writer synchronization pattern

Implementation with a mutex

It is possible to implement the reader / writer synchronization pattern using a mutex instead of rwlock: read and write operations are protected by a mutex. However, this implementation does not not allow multiple readers to work in parallel.

Implementation with a monitor

The implementation of the monitor-based reader / writer is more complex. It mainly requires: * an integer readers which counts the number of threads reading * a boolean writing which indicates that a thread is writing * a cond condition to notify changes to these variables * a mutex mutex to protect concurrent access

Here is an implementation of the reader / writer using a monitor:

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>

// This program simulates operations on a set of bank accounts
// Two kinds of operations are available:
// - read operation: compute the global balance (ie. the sum of all accounts)
// - write operation: transfer money from one account to another
//
// Here's an example of the program output:
//
// $ ./rw_threads_condition 
// Balance: 0 (expected: 0)
// 3982358 operation, including:
//         3581969 read operations (89.945932 % )
//         400389 write operations (10.054068 % )

#define N 200
int n_loops = 1000000;
int accounts[N];

int nb_read = 0;
int nb_write = 0;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

int readers=0;
int writing=0;

/* read all the accounts */
int read_accounts() {
  pthread_mutex_lock(&mutex);
  while(writing)
    pthread_cond_wait(&cond, &mutex);
  readers++;
  pthread_mutex_unlock(&mutex);

  nb_read++;
  int sum = 0;
  for(int i=0; i<N; i++) {
    sum += accounts[i];
  }

  pthread_mutex_lock(&mutex);
  readers--;
  if(!readers) {
    pthread_cond_signal(&cond);
  }
  pthread_mutex_unlock(&mutex);
  return sum;
}

/* transfer amount units from account src to account dest */
void transfer(int src, int dest, int amount) {
  pthread_mutex_lock(&mutex);
  while(writing || readers)
    pthread_cond_wait(&cond, &mutex);
  writing = 1;
  pthread_mutex_unlock(&mutex);

  nb_write++;
  accounts[dest] += amount;
  accounts[src] -= amount;


  pthread_mutex_lock(&mutex);
  writing=0;
  pthread_cond_signal(&cond);
  pthread_mutex_unlock(&mutex);
}

void* thread_function(void*arg) { 
  for(int i=0; i<n_loops; i++) {

    /* randomly perform an operation
     * threshold sets the proportion of read operation.
     * here, 90% of all the operations are read operation
     * and 10% are write operations
     */
    int threshold = 90;
    int x = rand()%100;
    if(x < threshold) {
      /* read */
      int balance = read_accounts();
      if(balance != 0) {
	fprintf(stderr, "Error : balance = %d !\n", balance);
	abort();
      }
    } else {
      /* write */
      int src = rand()%N;
      int dest = rand()%N;
      int amount = rand()%100;
      transfer(src, dest, amount);
    }
  }
  return NULL;
}

int main(int argc, char**argv) {
  for(int i = 0; i<N; i++) {
    accounts[i] = 0;
  }

  int nthreads=4;
  pthread_t tid[nthreads];

  for(int i=0; i<nthreads; i++) {
    pthread_create(&tid[i], NULL, thread_function, NULL);
  }

  for(int i=0; i<nthreads; i++) {
    pthread_join(tid[i], NULL);
  }

  int balance = read_accounts();
  printf("Balance: %d (expected: 0)\n", balance);

  int nb_op = nb_read+nb_write;
  printf("%d operation, including:\n",nb_op);
  printf("\t%d read operations (%f %% )\n", nb_read, 100.*nb_read/nb_op);
  printf("\t%d write operations (%f %% )\n", nb_write, 100.*nb_write/nb_op);

  return EXIT_SUCCESS;
}

Synchronization

Introduction

If you want to study further synchronization primitives, and to understand memory models, the blog post We Make a std::shared_mutex 10 Times Faster <https://www.codeproject.com/Articles/1183423/We-Make-a-std-shared-mutex-10-Times-Faster> discusses in details atomic operations, instruction reordering, C++ memory model and various synchronization primitives.


Atomic operations

Motivation

\(\rightarrow\) Problem if the variable is modified by a other thread simultaneously


Can’t we just use volatile ?

Here is an example of a program that may suffer from overly aggressive optimization by the compiler:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>


#if USE_VOLATILE
volatile int a = 0;
#else
int a = 0;
#endif

void* thread1(void*arg) {
  while(a == 0) ;
  printf("Hello\n");
  return NULL;
}

void* thread2(void*arg) {
  a = 1;
  return NULL;
}

int main(int argc, char**argv) {
  pthread_t t1, t2;
  pthread_create(&t1, NULL, thread1, NULL);
  pthread_create(&t2, NULL, thread2, NULL);

  pthread_join(t1, NULL);
  pthread_join(t2, NULL);
  return EXIT_SUCCESS;
}

When compiled with the optimization level -O0 (i.e. without any optimization), thread1 spins waiting, and when thread2 modifies the variable a, it unlocks thread1 which displays Hello:

$ gcc -o volatile volatile.c -Wall -pthread -O0
$ ./volatile 
Hello
$

When compiled with the optimization level -O1, the generated code no longer works:

$ gcc -o volatile volatile.c -Wall -pthread -O1
$ ./volatile 
[waits indefinitely]
^C
$

Analyzing the code generated by the compiler reveals the problem:

$ gcc -o volatile volatile.c -Wall -pthread -O2
$ gdb ./volatile 
[...]
(gdb) disassemble thread1
Dump of assembler code for function thread1:
   0x0000000000000756 <+0>:  auipc  a5,0x2
   0x000000000000075a <+4>:  lw a5,-1778(a5) # 0x2064 <a>
   0x000000000000075e <+8>:  bnez   a5,0x762 <thread1+12>
   0x0000000000000760 <+10>:    j   0x760 <thread1+10>
   0x0000000000000762 <+12>:    add sp,sp,-16
   0x0000000000000764 <+14>:    auipc   a0,0x0
   0x0000000000000768 <+18>:    add a0,a0,36 # 0x788
   0x000000000000076c <+22>:    sd  ra,8(sp)
   0x000000000000076e <+24>:    jal 0x620 <puts@plt>
   0x0000000000000772 <+28>:    ld  ra,8(sp)
   0x0000000000000774 <+30>:    li  a0,0
   0x0000000000000776 <+32>:    add sp,sp,16
   0x0000000000000778 <+34>:    ret
nd of assembler dump.
$

We see here that at the address 0x760, the program jumps to the address 0x760. So it jumps in place indefinitely.

This is explained by the fact that the variable a is not volatile. The compiler therefore thinks it can optimize access to this variable: since the thread1 function only accesses the variable in read-mode, the program loads the variable in a register (here, the a5 register, see the instruction 0x75a), then consults the registry. When thread2 modifies the variable a, the modification is therefore not perceived by thread1!

Declaring the variable as volatile forces the compiler to read the variable each time:

$ gcc -o volatile volatile.c -Wall -pthread -O2 -DUSE_VOLATILE=1
$ gdb volatile
(gdb) disassemble thread1 
Dump of assembler code for function thread1:
   0x0000000000000756 <+0>:  add    sp,sp,-16
   0x0000000000000758 <+2>:  sd ra,8(sp)
   0x000000000000075a <+4>:  auipc  a4,0x2
   0x000000000000075e <+8>:  add    a4,a4,-1782 # 0x2064 <a>
   0x0000000000000762 <+12>:    lw  a5,0(a4)
   0x0000000000000764 <+14>:    beqz    a5,0x762 <thread1+12>
   0x0000000000000766 <+16>:    auipc   a0,0x0
   0x000000000000076a <+20>:    add a0,a0,34 # 0x788
   0x000000000000076e <+24>:    jal 0x620 <puts@plt>
   0x0000000000000772 <+28>:    ld  ra,8(sp)
   0x0000000000000774 <+30>:    li  a0,0
   0x0000000000000776 <+32>:    add sp,sp,16
   0x0000000000000778 <+34>:    ret
End of assembler dump. 

Here, the loop while (a == 0) is translated to the lines from 0x762 to 0x764. At each loop iteration, the value of a is loaded, then tested.


Atomic operations


Test and set

_Bool atomic_flag_test_and_set(volatile atomic_flag* obj);

Performs atomically:

int atomic_flag_test_and_set(int* flag) {
  int old = *flag;
  *flag = 1;
  return old;
}

Implementing a lock:

void lock(int* lock) {
  while(atomic_flag_test_and_set(lock) == 1) ;
}

Here is an example of a program using a test_and_set based lock:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>

#define NITER 1000000
#define NTHREADS 4

volatile int lock=0;

int x = 0;
#ifdef NOT_THREAD_SAFE

/* thread-unsafe version */
void do_lock() {
  while(lock) ;
  lock = 1;
}

void do_unlock() {
  lock = 0;
}

#else

/* thread-safe version */
void do_lock() {
  while(atomic_flag_test_and_set(&lock)) ;
}

void do_unlock() {
  lock = 0;
}

#endif	/* NOT_THREAD_SAFE */

void* thread_function(void* arg) {
  for(int i=0; i<NITER; i++) {
    do_lock();
    x++;
    do_unlock();
  }
  return NULL;
}

int main(int argc, char**argv) {
  pthread_t tids[NTHREADS];
  int ret;
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_create(&tids[i], NULL, thread_function, NULL);
    assert(ret == 0);
  }
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_join(tids[i], NULL);
    assert(ret == 0);
  }

  printf("x = %d\n", x);
  return EXIT_SUCCESS;
}

Compare And Swap (CAS)

_Bool atomic_compare_exchange_strong(volatile A* obj, C* expected, C desired);
bool CAS(int* obj, int* expected, int desired) {
  if(*obj != *expected) {
    *expected = *obj;
    return false;
  } else {
    *obj = desired;
    return true;
  }
}

Here is an example of a program handling a lock-free list thanks to compare_and_swap:


#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>

#define NITER 1000000
#define NTHREADS 4

struct node {
  int value;
  struct node* next;
};

struct node *stack = NULL;

#ifdef NOT_THREAD_SAFE

/* thread-unsafe version */
void push(int value) {
  struct node* n = malloc(sizeof(struct node));
  n->value = value;
  n->next = stack;
  stack = n;
}

int pop() {
  struct node* n = stack;
  int value = 0;
  if(n) {
    value = n->value;
    stack = n->next;
    free(n);
  }
  return value;
}

#else

/* thread-safe version */
void push(int value) {
  struct node* n = malloc(sizeof(struct node));
  n->value = value;
  n->next = stack;

  int done = 0;
  do {
    done = atomic_compare_exchange_strong(&stack, &n->next, n);
  } while(!done);
}

int pop() {
  int value = 0;
  struct node* old_head = NULL;
  struct node* new_head = NULL;
  int done = 0;

  do {
    /* Warning: this function still suffers a race condition (search for
     * "ABA problem" for more information).
     * Fixing this would be too complicated for this simple example.
     */
    old_head = stack;
    if(old_head)
      new_head = old_head->next;
    done = atomic_compare_exchange_strong(&stack, &old_head, new_head);
  } while (!done);

  if(old_head) {
    value = old_head->value;
    free(old_head);
  }
  return value;
}

#endif	/* NOT_THREAD_SAFE */


_Atomic int sum = 0;
void* thread_function(void* arg) {
  for(int i=0; i<NITER; i++) {
    push(1);
  }

  int value;
  while((value=pop()) != 0) {
    sum+=value;
  }

  return NULL;
}

int main(int argc, char**argv) {
  pthread_t tids[NTHREADS];
  for(int i = 0; i<NTHREADS; i++) {
    pthread_create(&tids[i], NULL, thread_function, NULL);
  }
  for(int i = 0; i<NTHREADS; i++) {
    pthread_join(tids[i], NULL);
  }
  printf("sum = %d\n", sum);
  return EXIT_SUCCESS;
}

Fetch and Add

int fetch_and_add(int* obj, int value) {
  int old = *obj;
  *obj = old+value;
  return old;
}

Here is an example of a program using fetch_and_add to atomically increment a variable:


#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>

#define NITER 1000000
#define NTHREADS 4

volatile int x = 0;

#ifdef NOT_THREAD_SAFE

/* thread-unsafe version */
void inc(volatile int * obj) {
  *obj = (*obj)+1;
}

#else

/* thread-safe version */
void inc(volatile int * obj) {
  atomic_fetch_add(obj, 1);
}

#endif 	/* NOT_THREAD_SAFE */

void* thread_function(void* arg) {
  for(int i=0; i<NITER; i++) {
    inc(&x);
  }
  return NULL;
}

int main(int argc, char**argv) {
  pthread_t tids[NTHREADS];
  for(int i = 0; i<NTHREADS; i++) {
    pthread_create(&tids[i], NULL, thread_function, NULL);
  }
  for(int i = 0; i<NTHREADS; i++) {
    pthread_join(tids[i], NULL);
  }

  printf("x = %d\n", x);
  return EXIT_SUCCESS;
}

Memory Fence (Barrière mémoire)


Synchronization primitives


Busy-waiting synchronization

int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);

It is also possible to implement a spinlock using an atomic operation:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <stdatomic.h>
#include <assert.h>

#define NITER 1000000
#define NTHREADS 4

struct lock {
  /* if flag=0, the lock is available
   * if flag=1, the lock is taken
   */
  volatile int flag;
};
typedef struct lock lock_t;


void lock(lock_t *l) {
  /* try to set flag to 1.
   * if the flag is already 1, loop and try again
   */
  while(atomic_flag_test_and_set(&l->flag)) ;
}

void unlock(lock_t *l) {
  l->flag = 0;
}

void lock_init(lock_t *l) {
  l->flag = 0;
}


lock_t l;
int x;

void* thread_function(void*arg){
  for(int i=0; i<NITER; i++) {
    lock(&l);
    x++;
    unlock(&l);
  }
  return NULL;
}

int main(int argc, char**argv) {
  lock_init(&l);

  pthread_t tids[NTHREADS];
  int ret;
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_create(&tids[i], NULL, thread_function, NULL);
    assert(ret == 0);
  }
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_join(tids[i], NULL);
    assert(ret == 0);
  }

  printf("x = %d\n", x);
  printf("expected: %d\n", NTHREADS*NITER);
  return EXIT_SUCCESS;

}

Futex


Implementing a mutex using a futex

Here is an example of a program implementing a mutex using futex:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <stdatomic.h>
#include <linux/futex.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <errno.h>
#include <assert.h>

#define NITER 1000000
#define NTHREADS 4

struct lock {
  int flag;
};
typedef struct lock lock_t;


static int futex(int *uaddr, int futex_op, int val,
		 const struct timespec *timeout, int *uaddr2, int val3) {
  return syscall(SYS_futex, uaddr, futex_op, val,
		 timeout, uaddr2, val3);
}

void lock(lock_t *l) {  
  while (1) {
    /* Is the futex available? */
    int expected = 1;
    if (atomic_compare_exchange_strong(&l->flag, &expected, 0))
      return;      /* Yes */

    /* Futex is not available; wait */
    int s = futex(&l->flag, FUTEX_WAIT, 0, NULL, NULL, 0);
    if (s == -1 && errno != EAGAIN) {
      perror("futex_wait failed");
      abort();
    }
  }
}

void unlock(lock_t *l) {
  int expected = 0;
  atomic_compare_exchange_strong(&l->flag, &expected, 1);
  int s = futex(&l->flag, FUTEX_WAKE, 1, NULL, NULL, 0);
  if (s  == -1) {
    perror("futex_wake failed");
    abort();
  }
}

void lock_init(lock_t *l) {
  l->flag = 1;
}


lock_t l;
int x;

void* thread_function(void*arg){
  for(int i=0; i<NITER; i++) {
    //    printf("%d\n", i);
    lock(&l);
    x++;
    unlock(&l);
  }
  return NULL;
}

int main(int argc, char**argv) {
  lock_init(&l);

  pthread_t tids[NTHREADS];
  int ret;
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_create(&tids[i], NULL, thread_function, NULL);
    assert(ret == 0);
  }
  for(int i = 0; i<NTHREADS; i++) {
    ret = pthread_join(tids[i], NULL);
    assert(ret == 0);
  }

  printf("x = %d\n", x);
  printf("expected: %d\n", NTHREADS*NITER);
  return EXIT_SUCCESS;

}

Implementing a monitor using a futex

struct cond {
  int cpt;
};

void cond_wait(cond_t *c, pthread_mutex_t *m) {
  int value = atomic_load(&c->value);
  pthread_mutex_unlock(m);
  futex(&c->value, FUTEX_WAIT, value);
  pthread_mutex_lock(m);
}

void cond_signal(cond_t *c) {
  atomic_fetch_add(&c->value, 1);
  futex(&c->value, FUTEX_WAKE, 0);
}

Here is an example of a program implementing a condition using futex:

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/syscall.h>
#include <linux/futex.h>
#include <stdatomic.h>
#include <assert.h>

#define N 10

int n_loops = 20;

struct cond {
  int cpt;
};
typedef struct cond cond_t;

static int futex(int *uaddr, int futex_op, int val) {
  return syscall(SYS_futex, uaddr, futex_op, val, NULL, uaddr, 0);
}

void cond_init(cond_t *c) {
  c->cpt = 0;
}

void cond_wait(cond_t *c, pthread_mutex_t *m) {
  int cpt = atomic_load(&c->cpt);
  pthread_mutex_unlock(m);
  futex(&c->cpt, FUTEX_WAIT, cpt);
  pthread_mutex_lock(m);
}

void cond_signal(cond_t *c) {
  atomic_fetch_add(&c->cpt, 1);
  futex(&c->cpt, FUTEX_WAKE, 0);
}



struct monitor{
  int value;
  pthread_mutex_t mutex;
  cond_t cond;
};

int infos[N];
int i_depot, i_extrait;
int nb_produits = 0;
struct monitor places_dispo;
struct monitor info_prete;


void* function_prod(void*arg) {
  static _Atomic int nb_threads=0;
  int my_rank = nb_threads++;
  
  for(int i=0; i<n_loops; i++) {
    int cur_indice;
    int product_id;
    usleep(100);
    pthread_mutex_lock(&places_dispo.mutex);
    while(places_dispo.value == 0) {
      cond_wait(&places_dispo.cond, &places_dispo.mutex);
    }
    places_dispo.value--;
    cur_indice = i_depot++;
    i_depot = i_depot % N;

    product_id = nb_produits++;
    pthread_mutex_unlock(&places_dispo.mutex);

    usleep(500000);
    printf("P%d produit %d dans %d\n", my_rank, product_id, cur_indice);

    pthread_mutex_lock(&info_prete.mutex);
    infos[cur_indice] = product_id;
    info_prete.value ++;
    cond_signal(&info_prete.cond);
    pthread_mutex_unlock(&info_prete.mutex);
  }
  return NULL;
}


void* function_cons(void*arg) {
  static _Atomic int nb_threads=0;
  int my_rank = nb_threads++;
  
  for(int i=0; i<n_loops; i++) {
    int cur_indice;
    int product_id;
    usleep(100);
    pthread_mutex_lock(&info_prete.mutex);
    while(info_prete.value == 0) {
      cond_wait(&info_prete.cond, &info_prete.mutex);
    }
    info_prete.value--;
    product_id = infos[i_extrait];
    cur_indice = i_extrait;
    i_extrait = (i_extrait+1) % N;
    pthread_mutex_unlock(&info_prete.mutex);

    usleep(100000);
    printf("C%d consomme %d depuis %d\n", my_rank, product_id, cur_indice);

    pthread_mutex_lock(&places_dispo.mutex);
    places_dispo.value ++;
    cond_signal(&places_dispo.cond);
    pthread_mutex_unlock(&places_dispo.mutex);
  }
  return NULL;
}

void init_monitor(struct monitor *m, int value) {
  m->value = value;
  pthread_mutex_init(&m->mutex, NULL);
  cond_init(&m->cond);
}

int main(int argc, char**argv) {
  init_monitor(&places_dispo, N);
  init_monitor(&info_prete, 0);
  i_depot = 0;
  i_extrait = 0;

  
  int nthreads_prod=2;
  int nthreads_cons=2;
  pthread_t tid_prod[nthreads_prod];
  pthread_t tid_cons[nthreads_cons];
  int ret;
  
  for(int i=0; i<nthreads_prod; i++) {
    ret = pthread_create(&tid_prod[i], NULL, function_prod, NULL);
    assert(ret == 0);
  }
  for(int i=0; i<nthreads_cons; i++) {
    ret = pthread_create(&tid_cons[i], NULL, function_cons, NULL);
    assert(ret == 0);
  }

  for(int i=0; i<nthreads_prod; i++) {
    ret = pthread_join(tid_prod[i], NULL);
    assert(ret == 0);
  }
  for(int i=0; i<nthreads_cons; i++) {
    ret = pthread_join(tid_cons[i], NULL);
    assert(ret == 0);
  }

  return EXIT_SUCCESS;
}

Using synchronization


Deadlock


Lock granularity


Scalability of a parallel system

The notion of scalability is discussed in more detail in the module CSC5001 High Performance Systems.

Bibliography


System calls

Operating systems


Operating systems (2/2)

The operating system is responsible for operating various hardware. It, therefore, includes drivers capable of interacting with a particular material. The different drivers for the same type of peripheral offer the same interface, which allows the upper layers of the OS to use the hardware interchangeably.

The transition from user space to kernel space is done via a system call (syscall). The kernel processes the request for the application and returns a positive or zero integer on success, and -1 on failure.

From the application point of view, system calls are exposed as functions (defined in libc) in charge of executing the system call.


Testing the return value of system calls and functions

Testimony of a former ASR student: “Without insistence from [the CSC4508 teachers], it would not have jumped out to us so quickly that the problems (in the robotics championship) came from a lack of errors handling on a code that had not been carefully proofread”.

How to check the return value of a function and handle errors?

The macro void assert (scalar expression) tests the expression passed in parameter and, if false, displays a message error and terminates the program (with the abort () function):

  struct stat buf;
  int rc = stat(file, &buf);
  assert(rc>=0);
  // -> in case of an error, prints:
  //   appli: appli.c:12: main: Assertion `rc>=0' failed.
  //   Abandon

However, the macro should be used with caution because it is disabled when the program is compiled in optimized mode (with gcc -O3 for example).

So it is better to test the return code, display a message describing the error, and possibly terminate the process.

struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
  fprintf(stderr, "Error\n");
  exit(EXIT_FAILURE); // or abort();
}

Displaying the cause of an error

The errno.h file lists standard errors. The manual of each system call (see man 2 function), and of each function (man 3 function) indicates, in the ERRORS section, the different error codes that may be returned.

The error message associated with a value of errno can be obtained with strerror () or perror ():

struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
  fprintf(stderr, "Error while accessing file '%s': %s\n", file, strerror());
  // -> message "Error while accessing file 'plop': No such file or directory"
  exit(EXIT_FAILURE);
}

or

struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
  perror("Error while accessing file");
  // -> message: "Error while accessing file: No such file or directory"
  exit(EXIT_FAILURE);
}

Generic error handling

It is possible to define a macro displaying an error message and indicating where the error occurred. For example:

#define FATAL(errnum, ...) do {                               \
    fprintf(stderr, "Error in %s:%d:\n", __FILE__, __LINE__); \
    fprintf(stderr, __VA_ARGS__);                             \
    fprintf(stderr, ": %s\n", strerror(errnum));              \
    abort();                                                  \
  } while(0)

int main(int argc, char**argv) {
  char *file = argv[1];
  struct stat buf;
  int rc = stat(file, &buf);
  if(rc < 0) {
    FATAL(errno, "Cannot access file '%s'", file);
  }
  return EXIT_SUCCESS;
}
// affiche:
//  Error in fatal.c:21:
//  Cannot access file 'plop': No such file or directory
//  Abandon

Debugger

When a program calls the abort () function in order to terminate the process, a core dump file (that describes the process when the error occured) can be generated in order to debug the program with gdb.

To activate the generation of a core dump, run the command ulimit -c unlimited. Therefore, the function abort () generates a core dump which can be supplied to gdb:

$ ./fatal  plop
Error in fatal.c:21:
Cannot access file 'plop': No such file or directory
Abandon (core dumped)

$ gdb ./fatal core 
GNU gdb (Debian 8.1-4+b1) 8.1
[...]
Reading symbols from ./fatal...(no debugging symbols found)...done.
[New LWP 11589]
Core was generated by `./fatal plop'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7dfb535 in __GI_abort () at abort.c:79
#2  0x0000555555555232 in main (argc=2, argv=0x7fffffffdcd8) at fatal.c:21

On Linux distribution running systemd, the core dumps are managed by coredumpctl:


Stack frames


Content of a stack frame

Function call convention

Depending on the CPU architecture (and sometimes the compiler), the way of making a function call may vary.

x86 32 bits

On 32-bit x86 architectures, parameters are placed on the stack so that the first argument is located at address ebp + 8, the second at address ebp + 12 (if the first argument is stored on 4 bytes), etc.

The return address (i.e. the address of the instruction to run after function) is stored on the stack at the address ebp+4.

Stack frame on 32-bit x86 architectures

x86 64 bits

On 64-bit x86 architectures, the parameters are passed via the rdi, rsi, rdx, rcx, r8 and r9 registers. If there are more than 6 parameters, the next parameters are placed on the stack.

Stack frame on 64-bit x86 architectures

Arm

On Arm architectures, parameters are passed via registers (x0 to x7 on Arm 64 bits). The return address is also stored in a register.

Stack frame on 64-bit Arm architectures

RISC-V

On RISC-V architectures, parameters are passed via registers (a0 to a7) like Arm. If there are more parameters, or their values do not fit in 64 bits registers, they are placed on the stack. The return address is also stored in a register. The address of the previous stack frame is not kept, the compiler issues an instruction to increment the stack pointer back to the previous stack frame. It means the preamble and epilogue of compiled functions are slightly more complicated~.


Buffer overflow

Here is an example of buffer overflow:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char**argv) {

  int N = 4;
  char tab[N];
  int a = 17;

  for(int i=0; i<=N ; i++) {
    tab[i] = 'a'+i;
  }

  printf("tab = {%c, %c, %c, %c}\n", tab[0], tab[1], tab[2], tab[3]);
  printf("a = %d\n", a);
  return 0;
}

Example

Here, the bug comes from the loop in charge of filling the array which iterates too many times (because of <=. After the first 4 iterations, here is the memory status:

During the fifth iteration, the modification of tab [4] may modify one byte of the variable a:

The variable a is therefore no longer equal to 17, but 69 (or 0x45).

Security vulnerabilities

Buffer overflow bugs are potentially serious for the security of a system, because depending on an input (e.g. a string entered by the user), the bug may modify the behavior of the application (without necessarily crashing the program). In our example, if the variable a matches the username, the bug could allow attackers to pretend to be someone else (for example, an administrator)!

Buffer overflows are among the most common security vulnerabilities. To be convinced of this, just look for the vulnerability announcements that mention “buffer overflow” (around 780 faults in 2017)


Stack overflow

Example

Here is an example of stack overflow:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void foo(char* str) {
  char new_str[16];
  strcpy(new_str, str);  
  printf("new_str = %s\n", new_str);
}

int main(int argc, char**argv) {

  foo(argv[1]);
  printf("Back in main()\n");
  return 0;
}

Here, the foo function does not check that new_str is large enough to hold str. So if str is too long, strcpy overflows and may overwrite the return address of foo.

Here is an example of execution leading to an stack overflow:

  $ gdb ./stack_overflow
  (gdb) r coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: stack_overflow coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
new_str = coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Program received signal SIGSEGV, Segmentation fault.
0x000055555555518e in foo (str=0x7fffffffe03e "coucou", 'A' <repeats 83 times>) at stack_overflow.c:9
9       }
(gdb) bt
#0  0x000055555555518e in foo (str=0x7fffffffe03e "coucou", 'A' <repeats 83 times>) at stack_overflow.c:9
#1  0x4141414141414141 in ?? ()
#2  0x4141414141414141 in ?? ()
#3  0x4141414141414141 in ?? ()
#4  0x4141414141414141 in ?? ()
#5  0x4141414141414141 in ?? ()
#6  0x4141414141414141 in ?? ()
#7  0x4141414141414141 in ?? ()
#8  0x4141414141414141 in ?? ()
#9  0x0000555555550041 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) 

Here, we observe that when exiting the foo function, the program tries to execute the instruction located at the address 0x4141414141414141 (0x41 is the hexadecimal value of 'A'), which generates an error.

We could exploit the bug by inserting in argv [1] the address of the function void bar (int a, int b) as well as its parameters [@aleph1996smashing].


How to prevent buffer / stack overflow?

Hardware privilege levels

The implementation of the two operating modes is dependent on the processor architecture.

x86

On x86, there are four privilege levels called protection rings. Today, only two are used:

Two intermediate ones were used in a model where device drivers would run separately from the kernel.

Privilege levels for x86. (c) Hertzsprung at English Wikipedia.

RISC-V

Depending on the implemented platform, RISC-V uses up to three levels (sometimes called privilege modes).

From most privileged to less privileged:


User/system interface


User/system interface


User/system interface

Depending on the type of processor, the way of making a system call may vary. The way to pass the parameters is a convention which can vary from one OS to another. For example, for Linux:

x86_32

x86_64

ARM 64 bits

RISC-V

Bibliography


Interrupts and communication

Communication buses

Communication buses


The memory bus


DMA: Direct Memory Access


MMIO: Memory-Mapped IO


The input / output bus


The interrupt bus - principle

Interrupts

Receiving an interrupt: simple routing


Receiving an interrupt: example

  1. A block device on IRQ line 0X14 signals a data block is available
  2. The PLIC reads the configured priority of IRQ 0x14: 0x2
  3. The PLIC signals all processors with priority threshold \(<\) 0x2
  4. All signaled processors compete to serve the interrupt

Interrupt routing in NUMA architectures

In Non Uniform Memory Access architectures (NUMA), a device is linked to only one NUMA node. On RISC-V architectures, this means a device is linked to only one PLIC, as there is one PLIC per NUMA node. So only a processor from this NUMA node can serve interrupts from this device.

Interrupt routing on x86 architecture

On x86, interrupt routing goes through two tables configured by the kernel:

  1. Routing table: associates an IRQ with an IDT number
  2. IDT table (interrupt descriptor table): associates an IDT number to a interrupt handler

Two tables allow more flexibility than a single table which associates an IRQ number directly with a manager. This is different from RISC-V architecture where there can only be one interrupt handler, that must check the kind of the interrupt to serve it (e.g., a device interrupt, a timer interrupt, etc.).

Example of routing:

This is with only one processor; on multicore x86 systems:


Receiving an interrupt: simple routing (continued)


MSI: Message Signaling Interrupt for advanced interrupt management

MSIs and PLIC in RISC-V architectures

The PLIC (usually an APLIC, Advanced PLIC) remains used by devices that do not (need to) support MSI. When the platform supports MSIs, the APLIC converts wired (i.e., non message signaled) interrupts into MSIs. This is configured by the OS, as if they were direct MSIs from external devices.

MSIs on x86

On x86 systems, MSIs work roughly the same:


Inter-core communication

IPIs on x86 architectures


Other interruptions: system calls and exceptions

On x86 systems, the IDT table is used for every possible interruption:

The IDT table is therefore the table that contains all of the entry points to the kernel:


Time management: two sources


Virtual memory

Introduction

Regarding the principle of inclusion, in an Intel architecture, the L1 cache (Level 1) is included in L2 cache (Level 2), which is itself included in RAM, which is included in the swap (disk).

Here are the typical access times to data located in the different types of memory on a “classic” machine (Intel Core i5 Skylake processor) in 2017 :

The following table shows the cost difference between the types of memory (and the evolution over the years):

Year 2008 2009 2010 2014 2019 2023
Hard disk drive, 7200 tr/mn (in €/GiB) 0,50 0,32 0,10 0,04 0.027 0.023
SSD disk (in €/GiB) 0,50 0.17 0.09
USB key (in €/GiB) 1,64 0.62 0.27 0.06
NVMe (in €/GiB) 0.21 0.10
RAM (in €/GiB) 37,00 21,85 8.75 7.23 2.75

Paging


Overview


Status of memory pages

In Linux, page frames are 4KB in size (defined size by the constants PAGE_SIZE and PAGE_SHIFT in the file page.h).


Logical (or virtual) address

On 64-bit Intel (x86_64) or ARM 64 bits architectures (ARMv8), the addresses are stored on 64 bits (i.e. size (void *) is 8 bytes), but only 48 bits are usable for virtual addresses.

On RISC-V architectures, a system can choose from four virtual address sizes, including 48 bits as other architectures, and also 57 bits.

Huge pages

Some applications use large amounts of data (sometimes several GiB) that must be placed in a large number of 4 KiB memory pages. In order to limit the number of pages to handle, some architectures (especially x86_64 and RISC-V) allow the use of larger memory pages (typically 2~MiB and 1~GiB, but also bigger) which are called huge pages in the Linux kernel (sometimes megapages and gigapages).


Page table


Implementation of a page table
uint64_t cur = %satp3;            // cur = root table physical address
for(int i=0; i<3; i++)
  cur = ((uint64_t*)cur)[n[i]]; // physical memory access, next entry
return cur + offset;            // add the offset


Translation Lookaside Buffer (TLB)

Intel architectures have Translation Look-aside Buffers (TLB) with 32, 64, or even 256 entries. TLB are sometimes called address translation cache.


User point of view


Memory space of a process


Memory mapping

$ cat /proc/self/maps
5572f3023000-5572f3025000 r--p 00000000 08:01 21495815   /bin/cat
5572f3025000-5572f302a000 r-xp 00002000 08:01 21495815   /bin/cat
5572f302e000-5572f302f000 rw-p 0000a000 08:01 21495815   /bin/cat
5572f4266000-5572f4287000 rw-p 00000000 00:00 0          [heap]
7f33305b4000-7f3330899000 r--p 00000000 08:01 22283564   /usr/lib/locale/locale-archive
7f3330899000-7f33308bb000 r--p 00000000 08:01 29885233   /lib/x86_64-linux-gnu/libc-2.28.so
7f33308bb000-7f3330a03000 r-xp 00022000 08:01 29885233   /lib/x86_64-linux-gnu/libc-2.28.so
[...]
7f3330ab9000-7f3330aba000 rw-p 00000000 00:00 0 
7ffe4190f000-7ffe41930000 rw-p 00000000 00:00 0          [stack]
7ffe419ca000-7ffe419cd000 r--p 00000000 00:00 0          [vvar]
7ffe419cd000-7ffe419cf000 r-xp 00000000 00:00 0          [vdso]


Memory allocation

However:

The following program illustrates how setting a pointer to NULL allows to crash immediatly and how using a debugger allows to quickly find the origin of the error.

/**********************/
/* resetToNULL.c */
/**********************/

/* This program illustrates the value of assigning a variable to NULL
   which contains a pointer to an unallocated buffer.
   Indeed, it makes a segmentation fault, which makes it possible to
   identify that we have executed an illegal operation. Using a
   debugger allows to understand the problem.
*/
/*                                                                    */
/* 1) Compile the program with the option -g                          */
/*    cc -g -o resetToNULL resetToNULL.c                              */
/* 2) ./resetToNULL                                                   */
/*    ==> Segmentation fault                                          */
/* 3) ulimit -c unlimited                                             */
/* 4) ./resetToNULL                                                   */
/*    ==> Segmentation fault (core dumped)                            */
/* 5) ddd ./resetToNULL core                                          */

#include <stdlib.h>
#include <assert.h>

void h(char *p){
  *p = 'a';
}

void g(char *p){
  h(p);
}

void f(char *p){
  g(p);
}

int main(){
  char *p = NULL;

  f(p);

  p = malloc(1);
  assert(p != NULL);

  f(p);

  free(p);
  p = NULL;

  f(p);

  return EXIT_SUCCESS;
}

Memory alignment

Memory alignment applies to variables as well as to members of data structures.

The following program illustrates how alignement affects the size of a data structure:

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>

struct plop {
  int a;
  char b;
  int c;
};

struct plop_packed {
  int a;
  char b;
  int c;
} __attribute__((packed));


int main(void) {

  struct plop p1;
  struct plop_packed p2;
  printf("struct plop -- size: %lu bytes, address: %p\n",
	 sizeof(struct plop), &p1);
  printf("\t.a -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p1.a), &p1.a, offsetof(struct plop, a));
  printf("\t.b -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p1.b), &p1.b, offsetof(struct plop, b));
  printf("\t.c -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p1.c), &p1.c, offsetof(struct plop, c));
  printf("\n");

  printf("struct plop_packed -- size: %lu bytes, address: %p\n",
	 sizeof(struct plop_packed), &p2);
  printf("\t.a -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p2.a), &p2.a, offsetof(struct plop_packed, a));
  printf("\t.b -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p2.b), &p2.b, offsetof(struct plop_packed, b));
  printf("\t.c -- size: %lu bytes, address: %p, offset: %lu\n",
	 sizeof(p2.c), &p2.c, offsetof(struct plop_packed, c));
  printf("\n");
  return 0;
}
$ ./memory_alignment
struct plop -- size: 12 bytes, address: 0x7ffc05ad8184
        .a -- size: 4 bytes, address: 0x7ffc05ad8184, offset: 0
        .b -- size: 1 bytes, address: 0x7ffc05ad8188, offset: 4
        .c -- size: 4 bytes, address: 0x7ffc05ad818c, offset: 8

struct plop_packed -- size: 9 bytes, address: 0x7ffc05ad817b
        .a -- size: 4 bytes, address: 0x7ffc05ad817b, offset: 0
        .b -- size: 1 bytes, address: 0x7ffc05ad817f, offset: 4
        .c -- size: 4 bytes, address: 0x7ffc05ad8180, offset: 5

The libc point of view


Memory allocation strategies


Non-Uniform Memory Access

\(\rightarrow\) Non-Uniform Memory Access \(\rightarrow\) On which memory bank to allocate data?


First touch allocation strategy

  double *array = malloc(sizeof(double)*N);

  for(int i=0; i<N; i++) {
    array[i] = something(i);
  }

  #pragma omp parallel for
  for(int i=0; i<N; i++) {
    double value = array[i];
    /* ... */
  }


Interleaved allocation strategy

  double *array =
    numa_alloc_interleaved(sizeof(double)*N);

  for(int i=0; i<N; i++) {
    array[i] = something(i);
  }

  #pragma omp parallel for
  for(int i=0; i<N; i++) {
    double value = array[i];
    /* ... */
  }

It is also possible to use set_mempolicy in order to choose an allocation strategy for future memory allocations.


mbind

  double *array = malloc(sizeof(double)*N);
  mbind(&array[0], N/4*sizeof(double),
	MPOL_BIND, &nodemask, maxnode,
	MPOL_MF_MOVE);

  #pragma omp parallel for
  for(int i=0; i<N; i++) {
    double value = array[i];
    /* ... */
  }


Architecture

Introduction

In fact, the compiler generally manages to generate a binary which exploits all the capacities of the processor. But the compiler sometimes fails and generates non-optimized code. We must therefore be able to detect the problem, and be able to write code that the compiler can optimize.


Moore’s Law


Evolution of processors performance

Source

Sequential processor

The number of steps required to execute an instruction depends on the processor type (Pentium 4: 31 steps, Intel Haswell: 14-19 steps, ARM9: 5 steps, etc.)


Instruction pipeline

At each stage, several circuits are used

\(\rightarrow\) One instruction is executed at each cycle

Execution of instructions on a processor with pipeline


Micro architecture of a pipeline

Micro-architecture of a pipeline

Superscalar processors

\(\implies\) several instructions executed simultaneously!

Micro-architecture of a superscalar processor


Superscalar processors throughput


Dependence between instructions

Limitations of the superscalar:

a = b * c;
d = a + 1;


Branching

    cmp a, 7    ; a > 7 ?
    ble L1
    mov c, b    ; b = c
    br L2
L1: mov d, b    ; b = d
L2: ...

\(\implies\) waste of time

The cost of a wrong choice when loading a branch depends on pipeline depth: the longer the pipeline, the longer it takes to empty it (and therefore wait before executing an instruction). For this reason (among others), the depth of the pipeline in a processor is limited.


Branch prediction

0x12 loop:
         ...
0x50     inc eax
0x54     cmpl eax, 10000
0x5A     jl loop
0x5C end_loop:
         ...

The branch prediction algorithms implemented in modern processors are very advanced and reach a efficiency greater than 98 % (on the SPEC89 benchmark suite).

To know the number of good / bad predictions, we can analyze the hardware counters of the processor. With the PAPI library http://icl.cs.utk.edu/projects/papi/, the PAPI_BR_PRC and PAPI_BR_MSP counters give the number of conditional jumps correctly and incorrectly predicted.

Linux perf also allows collects this information (among others). For example:

$ perf stat -e branches,branch-misses ./branch_prediction  0
is random is not set
100000000 iterations in 1178.597000 ms
result=199999996

 Performance counter stats for './branch_prediction 0':

      2447232697      branches
         6826229      branch-misses            #    0,28% of all branches

       1,179914189 seconds time elapsed

       1,179784000 seconds user
       0,000000000 seconds sys

Vector instructions

for(i=0; i<size; i++) {
   C[i] = A[i] * B[i];
}
for(i=0; i<size; i+= 8) {
   *pC = _mm_mul_ps(*pA, *pB);
   pA++; pB++; pC++;
}

Vector instructions were democratized at the end of the years 1990 with the MMX (Intel) and 3DNow! (AMD) instruction sets that allow to work on 64 bits (for example to process 2 32-bit operations at once). Since then, each generation of x86 processors brings new extension to the instruction set: SSE2, SSSE3 (128 bit), SSE4, AVX, AVX2 (256 bit), AVX512 (512 bit). The other types of processors also provide vector instructions sets (eg NEON [128 bits], Scalable Vector Extension [SVE] on ARM), or the Vector Extension of RISC-V.

Vector instruction sets are specific to certain processors. The /proc/cpuinfo file contains (among others) the instructions sets that are available on the processor of a machine. For example, on an Intel Core i7:

$ cat  /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 69
model name  : Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
stepping    : 1
microcode   : 0x1d
cpu MHz     : 1484.683
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
 pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
 xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64
 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
 f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi
 flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms
 invpcid xsaveopt
bugs        :
bogomips    : 5387.82
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
[...]

The flags field contains the list of all the capabilities of the processor, especially the available instructions sets: mmx, sse, sse2, ssse3, sse4_1, sse4_2, avx2.

Vector instruction can be used directly in assembler or by exploiting the intrinsics provided by compilers. However, because of the number of available instruction sets and since each new processor generation provides new instructions sets, it is recommended to leave the compiler optimize the code, for example using the -O3 option.


Parallel Processing


Hyperthreading / SMT

SMT is an inexpensive way to increase the performance of a processor: by duplicating the “small” circuits (ALU, registers, etc.) and by pooling the “big” circuits (FPU, prediction of branches, caches), we can execute several threads simultaneously. The additional cost in terms of manufacturing is light and the gain in performance can be significant.

Since the dispatcher schedules the instructions of several threads, a branch miss-prediction becomes less serious since while the pipeline of the thread is emptied, another thread can be scheduled.

The performance gain when multiple threads are running is not systematic since some circuits remain shared (by example, the FPU).


Multi-core processors

\(\rightarrow\) Duplicate all the circuits

It is of course possible to combine multi-core with SMT. Most semiconductor foundries produce multi-core SMT processors: Intel Core i7 (4 cores x 2 threads), SPARC T3 Niagara-3 (16 cores x 8 threads), IBM POWER 7 (8 cores x 4 threads).


Symmetric Multi-Processing (SMP)


NUMA architectures

\(\rightarrow\) Non-Uniform Memory Architecture

The first NUMA machines (in the 1990s) were simply sets of machines linked by a proprietary network responsible for managing memory transfers. Since 2003, some motherboards allow to plug several Opteron processors (AMD) connected with a HyperTransport link. Intel subsequently developed a similar technology (Quick Path Interconnect, QPI) to connect its Nehalem processors (released in 2007).


Memory hierarchy


Memory wall

Until the 1990s, performance was limited by the performance of the processor. From the software point of view, developers had to minimize the number of instructions to be executed in order to achieve the best performance.

As the performance of processors increases, the bottleneck is now the memory. On the software side, we therefore seek to minimize the number of costly memory accesses. This pressure on memory is exacerbated by the development of multi-core processors.

For example, an Intel Core i7 processor can generate up to 2 memory access per clock cycle. A 18-core processor with hyper-threading (ie 36 threads) running at 3.1 Ghz can therefore generate \(2 \times 36 \times 3.1 \times 10 ^ 9 = 223.2\) billion memory references per second. If we consider access to 64-bit data, this represents 1662 GiB/s (1.623 TiB/s). In addition to these data accesses, the memory access to the instructions (up to 128 bits per instruction) also have to be taken into account. We thus arrive to a 3325 GiB/s (therefore 3.248 TiB/s !) maximum flow.

For comparison, in 2023 a DDR5 RAM DIMM has a maximum throughput of around 70 GiB/s. It is therefore necessary to set up mechanisms to prevent the processor from spending all its time waiting for memory.


Cache memory

To visualize the memory hierarchy of a machine, you can use the lstopo tool provided by the hwloc project.

Source: https://gist.github.com/jboner/2841832

Memory Management Unit (MMU)


Fully-associative caches

\(\rightarrow\) Mainly used for small caches (ex: TLB)

The size of a cache line depends on the processor (usually between 32 and 128 bytes). You can find this information in /proc/cpuinfo:

$ cat /proc/cpuinfo  |grep cache_alignment
cache_alignment : 64

Direct-mapped caches

\(\rightarrow\) Direct access to the cache line


Set-associative caches

\(\rightarrow\) K-way associative cache (in French: Cache associatif K-voies)

Nowadays, caches (L1, L2 and L3) are generally associative to 4 (ARM Cortex A9 for example), 8 (Intel Sandy Bridge), or even 16 (AMD Opteron Magny-Cours) ways.


Cache consistency

To detail this course a little more, we recommend this page web: Modern microprocessors – A 90 minutes guide! http://www.lighterra.com/papers/modernmicroprocessors/.

For (many) more details, read the books [@bryant] and [@patterson2013computer] which describe in detail the architecture of computers. If you are looking for specific details, read [@patterson2011computer].

Bibliography


Input/Output


In this lecture, we mainly talk about files, as this is the easiest example of I/O to manipulable. However, note that the content of the first 3 sections apply to I/O other than files (eg sockets).

Reminder on files:

On Unix, the commands hexdump -C filename, bless filename or xxd filename show the exact content of a file. Use them to

  1. compare the contents of helloWorldUnix.c and helloWorldWindows.c

  2. see that the file default_names_fichierIssuDuTP10DuModuleCSC4103.txt is not quite a text file (and, see also how are the accented characters stored in a file)

The Linux system and the C library provide sequential and direct access modes. For an indexed sequential access mode, other libraries are required (Unix NDBM, GDBM, Oracle Berkeley DB, …).


Buffered / non-buffered IO

\(\dag\) To be exact, an “unbuffered” I/O generates a system call. The OS can then decide to cache the data or no.


I/O primitives


File open / close

About the O_SYNC option in open:

Note that we can also create a file using the creat primitive:


Reading on a file descriptor

In the case where the read function is used on a descriptor other than a file (e.g. a pipe, or a socket), the fact that the number of bytes read may not equal count may have other meanings:


Writing on a file descriptor

Writing to disk is atomic: if two processes \(P_1\) and \(P_2\) simultaneously write to the same file in the same location, when the two processes have finished their writing, we will find:

Note that when the file is opened with the option O_APPEND, if \(P_1\) and \(P_2\) write simultaneously (at the end of the file, because of O_APPEND), when the two processes will have finished their writing, we will find at the end of file:

No writing is therefore lost! Attention, this concurrent write at the end of file is not equivalent to two processes simultaneously performing the following operations:

lseek(fd,0,SEEK_END); /* move the cursor to the end of file */
write(fd,data,taille);

In fact, in the latter case, one of the written data may by overwritten by the other.

The copy.c file on the next page illustrates the use of open, read, write and close.

/************/
/* copy.c */
/************/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>

#define USAGE "USAGE: copy src dest\n"
#define WRITE_ERROR "write error (no space left on device ?)\n"

int source, dest;
int buf;
int nb_read, nb_written;

int main(int argc, char *argv[]) {
  if (argc != 3) {
    write(STDERR_FILENO, USAGE, strlen(USAGE));
    return EXIT_FAILURE;
  }
  source = open(argv[1], O_RDONLY);
  if (source < 0) {
    perror(argv[1]);
    return EXIT_FAILURE;
  }
  dest = open(argv[2],
              O_WRONLY|O_CREAT|O_TRUNC,
              S_IRWXU|S_IRWXG|S_IRWXO);
  if (dest < 0) {
    perror(argv[2]);
    return EXIT_FAILURE;
  }
  while ((nb_read = read(source, (void*)&buf, sizeof(buf))) > 0) {
    nb_written = write(dest, (void*)&buf, nb_read);
    if (nb_written <= 0) {
      if (nb_written == 0) {
        write(STDERR_FILENO, WRITE_ERROR, strlen(WRITE_ERROR));
      }
      else {
        perror("write");
      }
      return EXIT_FAILURE;
    }
  }
  if (nb_read < 0) {
    perror("read");
    return EXIT_FAILURE;
  }
  if (close(source) < 0) {
    perror(argv[1]);
    return EXIT_FAILURE;
  }
  if (close(dest) < 0) {
    perror(argv[2]);
    return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}

This operation of copying the contents of one file to another descriptor is an operation frequently performed in web servers. Indeed, these servers must in particular send the content of files to client who have requested them. This is why the linux system offers the sendfile primitive (ssize_t sendfile (int out_fd, int in_fd, off_t * offset, size_t count)). It reads count bytes of in_fd and write them to out_fd (which must match an socket). sendfile is more more efficient than the combination read / write.

The fallocate function is the Linux specific version of the portable function posix_fallocate.


File descriptor duplication


I/O and concurrence


Locking a file

struct flock {
  short l_type;
  short l_whence;
  off_t l_start;
  off_t l_len;
};

int fcntl(int fd, F_SETLK, struct flock*lock);

The exclusive-lock.c file illustrates exclusive file locking:

/***********/
/* exclusive_lock.c */
/***********/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

int main(){
  int fd;
  struct flock lock;

  fd = open("/tmp/ficTest",O_RDWR|O_CREAT, S_IRWXU|S_IRWXG|S_IRWXO);
  if (fd < 0) {
    perror("open");
    exit(EXIT_FAILURE);
  }

  /* Exclusive lock on the 15th byte */
  lock.l_type = F_WRLCK;
  lock.l_whence = SEEK_SET;
  lock.l_start = 15;
  lock.l_len = 1;

  /* Because of the F_SETLKW parameter, we get stuck on the fcntl if */
  /* the lock cannot be acquired                                   */
  printf("attempt to acquire an exclusive lock by process %d...\n",
	 getpid());
  if (fcntl(fd, F_SETLKW, &lock) < 0){
    perror("Acquiring lock");
    exit(EXIT_FAILURE);
  }
  printf("... Exclusive lock acquired by process %d\n", getpid());

  /* Here we could do the processing that needed to be protected */
  /* by the lock                                                 */
  sleep(10);

  /* Release the lock */
  printf("Releasing the lock by process %d...\n", getpid());
  lock.l_type = F_UNLCK;
  lock.l_whence = SEEK_SET;
  lock.l_start = 15;
  lock.l_len = 1;
  if (fcntl(fd, F_SETLK, &lock) < 0){
    perror("Releasing lock");
    exit(EXIT_FAILURE);
  }
  printf("...OK\n");

  return EXIT_SUCCESS;
}

The shared-lock.c file illustrates the shared locking:

/*****************/
/* shared_lock.c */
/*****************/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

int main(){
  int fd;
  struct flock lock;

  fd = open("/tmp/ficTest",O_RDWR|O_CREAT, S_IRWXU|S_IRWXG|S_IRWXO);
  if (fd < 0) {
    perror("open");
    exit(EXIT_FAILURE);
  }

  /* Shared lock on the 15th byte */
  lock.l_type = F_RDLCK;
  lock.l_whence = SEEK_SET;
  lock.l_start = 15;
  lock.l_len = 1;

  /* Because of the F_SETLKW parameter, we get stuck on the fcntl if */
  /* the lock cannot be acquired                                   */
  printf("attempt to acquire a shared lock by process %d...\n",
	 getpid());
  if (fcntl(fd, F_SETLKW, &lock) < 0){
    perror("Acquiring lock");
    exit(EXIT_FAILURE);
  }
  printf("... shared lock acquired by process %d\n", getpid());

  /* Here we could do the processing that needed to be protected */
  /* by the lock                                                 */
  sleep(10);

  /* Release the lock */
  printf("Releasing the lock by process %d...\n", getpid());
  lock.l_type = F_UNLCK;
  lock.l_whence = SEEK_SET;
  lock.l_start = 15;
  lock.l_len = 1;
  if (fcntl(fd, F_SETLK, &lock) < 0){
    perror("Releasing lock");
    exit(EXIT_FAILURE);
  }
  printf("...OK\n");

  return EXIT_SUCCESS;
}

Offset manipulation


Improving the I / O performance


Giving advices to the kernel

Since January 2011, we know that this function is used in Firefox to reduce startup time by 40 % to 50 % by loading more efficiently GUI libraries xul.dll and mozjs.dll (more information here <https://bugzilla.mozilla.org/show_bug.cgi?id=627591>).


Asynchronous I/O

int aio_read(struct aiocb *aiocbp);
int aio_write(struct aiocb *aiocbp);
int aio_suspend(const struct aiocb * const aiocb_list[],
                int nitems,
                const struct timespec *timeout);
int aio_error(const struct aiocb *aiocbp);

For more information on asynchronous I/O, refer to the documentation (man 7 aio).

The current implementation of AIO Posix is provided in user-land by libc and can cause scalability issues. Another solution is to use the Asynchronous I/O interface provided by the Linux kernel (see the system calls io_submit, io_setup, etc.), or the libaio library which provides an overlay to Linux system calls.


mmap

void *mmap(void *addr, 
           size_t length,
           int prot,
           int flags,
           int fd,
           off_t offset);
int munmap(void *addr, size_t length);

To ensure that the memory accesses have been passed on to the disk, you can use the msync function.


File systems

Device and device driver


Device and device driver


Devices in UNIX


2 types of peripherals


Block devices in xv6


Principle of the virtio_disk_rw algorithm

xv6 is written to run on a virtual machine, i.e., on a special environment where devices are indeed virtualized. One interface designed to perform best with those virtual devices is the virtio interface. While the virtio protocol is different from the one used by real, physical block devices (e.g., IDE or SATA), in both cases, DMA and interruptions are used.


The I / O cache


Principle of an I/O cache


The xv6 buffer cache


How the buffer cache works: buffer management (1/3)


How the buffer cache works: read buffer (2/3)


How the buffer cache works: write buffer (3/3)


The log


Operation versus writing to disk


Consistency issues


Bad solutions


First idea: transactions


Second idea: log


Third idea: parallel log


Log structure

\(\rightarrow\) The system can therefore manage up to 3 copies of a block


Log algorithm principle


Using the log

begin_op();
b = bread(...);
// Modify data of b
...
log_write(b2);
...
end_op();

Implementation in xv6 (1/3)


Implementation in xv6 (2/3)

The log controls block writes and releases through log_write() and end_op(). System calls that implement access to blocks never use bwrite() and brelse() directly. Instead, the log keeps track of blocks that must be written to disk: they are called dirty blocks, because their content cached in the buffer cache is different from their content in the filesystem on the disk.


Implementation in xv6 (3/3)


Partitions and file systems


File system


Principle of a file system


Partitions


Disk image


UFS/xv6 file system


Overall file system structure


Dinode


Data blocks of a file


Adding a block to a file


Directories


From path to inode

cur = 1
For i in  [0 .. n]
    Look for the association [inum, name] in the data blocks of
        the cur dinode such that name is ei
    cur = inum

File creation and deletion


xv6 I/O stack


Inode


Main functions of inodes (1/3)


Main functions of inodes (2/3)


Main functions of inodes (3/3)


Open files


File descriptors


What you must remember

Bibliography