2024
During the [K]sessions, you will develop an OS
Based on the xv6 OS
On the computer architecture RISC-V
Development of new OS mechanisms
sprint sessions:
a0
register (where the return value is
stored)
rax
Execution flow ! = Resources
In a multi-threaded process, each thread has a context (registers + stack). The rest of the memory (code, data, etc.) and resources (open files, etc.) are shared between threads.
The stacks of the different threads are located in memory so that
they can grow. However, if a thread’s stack grows too much, it might
overflow onto the stack of another thread. To avoid this problem, the
size of the stack is limited (the command ulimit -s
gives
the maximum stack size). This size limit can be changed using command
line (by example ulimit -s 32768
), or from a program (in
using the setrlimit
function).
int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
attr
(in): attributes of the thread to be createdstart_routine
(in): function to be executed once the
thread is createdarg
(in): parameter to pass to the functionthread
(out): identifier of the created threadWe present here the Pthread API (POSIX thread) which is the most used in C. The C11 standard defines another interface for manipulating threads. However, there are only few implementations of this interface. The de facto standard therefore remains Pthread.
Unlike the creation of processes which generates a hierarchy (ie. each process has a parent process), there is no hierarchy between threads.
int pthread_exit(void* retval);
retval
int pthread_join(pthread_t tid, void **retval);
tid
thread to terminate and get its return
valueTechnically, all the memory space is shared between the threads. It is therefore possible to share all the variables, including local variables.
fread
depends on
the position of the stream cursorstrtok
Another example of a non-reentrant function is the
char *strtok(char * str, char * delim)
function. This
function extracts substrings from a string.
For example, the following code displays the different directories of
the PATH
variable:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
void extract_path() {
char* string = getenv("PATH");
("Parsing '%s'\n", string);
printf
for(char* token = strtok(string, ":") ;
;
token = strtok(NULL, ":") ){
token ("\t %s\n", token);
printf}
}
int main(int argc, char**argv) {
();
extract_pathreturn 0;
}
Here is an example of result obtained with this program:
Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
/usr/local/bin
/usr/bin
/bin
/usr/local/games
/usr/games
The strtok
function is not reentrant because it is based
on a previous state (a pointer to the last character tested in the
string). Thus, in this example, the processing applied to each
token
cannot use strtok
. For example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
void extract_path() {
char* string = getenv("PATH");
("Parsing '%s'\n", string);
printf// string should contain a list of directories separated with :
// eg. /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
// Extract the directories
// eg. /usr/local/bin, /usr/bin, /bin, /usr/local/games, /usr/games
for(char* token = strtok(string, ":") ;
;
token = strtok(NULL, ":") ){
token // token contains a directory (eg. /usr/local/bin)
("\t %s contains: ", token);
printf
// Extract the subdirectories
// eg. usr, local, bin
for(char* word = strtok(token, "/ ") ;
;
word = strtok(NULL, "/") ){
word ("%s ", word);
printf}
("\n");
printf}
}
int main(int argc, char**argv) {
();
extract_pathreturn 0;
}
Will result in:
Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
/usr/local/bin contains: usr local bin
Here the first token
(/usr/local/bin
) is
split into words (usr
, local
,
bin
) by successive calls to strtok
which
modify the previous state of strtok
, which prevents
subsequent calls to token = strtok (NULL, ":")
to iterate
over the string
string.
It is possible to make a non-reentrant function reentrant by adding a
parameter corresponding to the state of the function. For example, the
reentrant version of
char* strtok(char *str, const char *delim);
is
char* strtok_r(char *str, const char *delim, char **saveptr );
Thus, the previous program can be corrected:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
void extract_path() {
char* string = getenv("PATH");
char* saveptr = NULL;
("Parsing '%s'\n", string);
printf
for(char* token = strtok_r(string, ":", &saveptr) ;
;
token = strtok_r(NULL, ":", &saveptr) ){
token ("\t %s contains: ", token);
printf
char* saveptr_word = NULL;
for(char* word = strtok_r(token, "/ ", &saveptr_word) ;
;
word = strtok_r(NULL, "/", &saveptr_word) ){
word ("%s ", word);
printf}
("\n");
printf}
}
int main(int argc, char**argv) {
();
extract_pathreturn 0;
}
Which will result in:
Parsing '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
/usr/local/bin contains: usr local bin
/usr/bin contains: usr bin
/bin contains: bin
/usr/local/games contains: usr local games
/usr/games contains: usr games
errno
_Thread_local int variable = 0;
Before the C11 standard, using thread-local storage was supported by some compilers using compiler-dependant keywords:
__thread int variable = 0;
__declspec(thread) int variable = 0;
pthread_key
pthread_key
:
int pthread_key_create(pthread_key_t *key, void (*destructor)(void*));
int pthread_key_delete(pthread_key_t *key););
void *pthread_getspecific(pthread_key_t key);
int pthread_setspecific(pthread_key_t key, const void *value);
int pthread_once(pthread_once_t *once_control, void (*init_routine) (void));
x++
is not atomic (consisting of load
,
update
, store
)swap(a, b){ tmp=a; a=b; b=tmp; }
The following program illustrates the problem of simultaneous access to shared variables. Here, two threads each increment 1 000 000 000 times the same variable:
/*
* compteurBOOM.c
*
* Synchronization problem
*
*
*/
#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
/* INT_MAX / 2 */
#define NBITER 1000000000
int counter = 0;
void *start_routine(void *arg) {
int i;
for (i = 0; i < NBITER; i++) {
/* OOPS: WRONG ! Access to an unprotected shared variable */
++;
counter }
(NULL);
pthread_exit}
int main (int argc, char *argv[]) {
int rc;
, thread2;
pthread_t thread1
= pthread_create(&thread1, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_create(&thread2, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_join(thread1, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error= pthread_join(thread2, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error
if (counter != 2 * NBITER)
("BOOM! counter = %d\n", counter);
printfelse
("OK counter = %d\n", counter);
printf
(EXIT_SUCCESS);
exit}
While the counter should be
2 * 1 000 000 000 = 2 000 000 000
, running this program
gives another result, for example:
$ ./compteurBOOM
BOOM! compteur = 1076588402
Type: pthread_mutex_t
Initialisation:
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int pthread_mutex_init(ptread_mutex_t *m, const pthread_mutexattr_t *attr);
Usage:
int pthread_mutex_lock(pthread_mutex_t *mutex));
int pthread_mutex_trylock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
Terminaison:
int pthread_mutex_destroy(pthread_mutex_t *mutex);
Using a mutex, we can correct the BOOM counter program by ensuring that the counter increments are done in mutual exclusion:
/*
* compteurBOOM.c
*
* Synchronization problem
*
*
*/
#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
/* INT_MAX / 2 */
#define NBITER 1000000000
int counter = 0;
= PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex
void *start_routine(void *arg) {
int i;
for (i = 0; i < NBITER; i++) {
(&mutex);
pthread_mutex_lock++;
counter (&mutex);
pthread_mutex_unlock}
(NULL);
pthread_exit}
int main (int argc, char *argv[]) {
int rc;
, thread2;
pthread_t thread1
= pthread_create(&thread1, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_create(&thread2, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_join(thread1, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error= pthread_join(thread2, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error
if (counter != 2 * NBITER)
("BOOM! counter = %d\n", counter);
printfelse
("OK counter = %d\n", counter);
printf
(EXIT_SUCCESS);
exit}
While the result is correct, the use of a mutex significantly slows down the program (144s with mutex, against 4.1s without mutex).
Operation executed atomically
C11 defines a set of functions that perform atomic operations
C atomic_fetch_add(volatile A *object, M operand);
_Bool atomic_flag_test_and_set(volatile atomic_flag *object);
C11 defines atomic types
_Atomic int var;
or
_Atomic(int) var;
We can fix the counterBOOM
program by using atomic
operations. To do this, all we have to do is declare the counter like
_Atomic int
. The counter increment then uses the atomic
operation atomic_fetch_add
.
/*
* compteurBOOM.c
*
* Synchronization problem
*
*
*/
#include <error.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
/* INT_MAX / 2 */
#define NBITER 1000000000
_Atomic int counter = 0;
void *start_routine(void *arg) {
int i;
for (i = 0; i < NBITER; i++) {
++;
counter }
(NULL);
pthread_exit}
int main (int argc, char *argv[]) {
int rc;
, thread2;
pthread_t thread1
= pthread_create(&thread1, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_create(&thread2, NULL, start_routine, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_create");
error
= pthread_join(thread1, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error= pthread_join(thread2, NULL);
rc if (rc)
(EXIT_FAILURE, rc, "pthread_join");
error
if (counter != 2 * NBITER)
("BOOM! counter = %d\n", counter);
printfelse
("OK counter = %d\n", counter);
printf
(EXIT_SUCCESS);
exit}
Here, the result is correct and the program is much faster than when using a mutex:
Content of this lecture
int pipe(int pipefd[2]);
pipefd[0]
for reading, pipefd[1]
for
writingint mkfifo(const char *pathname, mode_t mode);
lseek
is impossible
You have already handled pipes without necessarily realizing it: in
bash
, the sequence of commands linked by pipes is
done via anonymous pipes created by the bash
process.
So when we run cmd1 | cmd2 | cmd3
, bash
creates 2 anonymous pipes and 3 processes, then redirects (thanks to the
dup2
system call, see Lecture #11) standard input and
output of processes to the different tubes.
int shm_open(const char *name, int oflag, mode_t mode);
name
is a key of the form /key
int ftruncate(int fd, off_t length);
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
flags
must contain MAP_SHARED
We will see later (during lecture 11 on I/O) another use of
mmap
.
sem_t *sem_open(const char *name, int oflag, mode_t mode, unsigned int value);
name
is a key of the form /key
int sem_init(sem_t *sem, int pshared, unsigned int value);
pshared != 0
, ca be used by several processes (using
a shared memory segment)int sem_wait(sem_t *sem);
int sem_trywait(sem_t *sem);
int sem_timedwait(sem_t *sem, const struct timespec *abs_timeout);
int sem_post(sem_t *sem);
pthread_mutex_t
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int pthread_mutex_init(ptread_mutex_t *m, const pthread_mutexattr_t *attr);
int pthread_mutex_lock(pthread_mutex_t *mutex));
int pthread_mutex_trylock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
int pthread_mutex_destroy(pthread_mutex_t *mutex);
It is possible to synchronize threads from several processes with a
pthread_mutex_t
if it is in a shared memory area. For this,
it is necessary to position the PTHREAD_PROCESS_SHARED
attribute of the mutex with the function
int pthread_mutexattr_setpshared(pthread_mutexattr_t *attr, int pshared);
(&l);
pthread_mutex_lockwhile(!condition) {
(&c, &l);
pthread_cond_wait}
();
process_data(&l); pthread_mutex_unlock
(&l);
pthread_mutex_lock();
produce_data(&c);
pthread_cond_signal(&l); pthread_mutex_unlock
Here are the prototypes of the functions associated with the conditions:
int pthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr);
int pthread_cond_destroy(pthread_cond_t *cond);
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
int pthread_cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *mutex, const struct timespec *abstime);
int pthread_cond_signal(pthread_cond_t *cond);
int pthread_cond_broadcast(pthread_cond_t *cond);
The mutex ensures that between testing for the condition (
while (! condition)
) and wait
(pthread_cond_wait()
), no thread performs the
condition.
To synchronize multiple processes with a monitor, it is necessary to set the following attributes:
PTHREAD_MUTEX_SHARED
of the mutex (using
int pthread_mutexattr_setpshared(pthread_mutexattr_t *attr, int pshared)
).PTHREAD_PROCESS_SHARED
of the condition
(using
int pthread_condattr_setpshared(pthread_condattr_t *attr, int pshared)
).int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *restrict attr, unsigned count);
int pthread_barrier_wait(pthread_barrier_t *barrier);
count
threads reach
pthread_barrier_wait
count
threads
Once all the threads have reached the barrier, they are all unblocked
and pthread_barrier_wait
returns 0 except for one thread
which returns PTHREAD_BARRIER_SERIAL_THREAD
.
To synchronize threads from multiple processes with a barrier, it is
necessary to set the attribute PTHREAD_PROCESS_SHARED
with
int pthread_barrierattr_setpshared(pthread_barrierattr_t *attr, int pshared);
pthread_rwlock_t
int pthread_rwlock_rdlock(pthread_rwlock_t* lock)
int pthread_rwlock_wrlock(pthread_rwlock_t* lock)
int pthread_rwlock_unlock(pthread_rwlock_t* lock)
In the literature, these problems are usually solved by using semaphores. This is because these problems have been theorized in the 1960s and 1970s by Dijkstra based on semaphores. In addition, semaphores have the advantage of being able to be used for inter-process synchronizations or intra-process.
However, modern operating systems implement many synchronization primitives which are much more efficient than semaphores. In the next slides, we will therefore rely on these mechanisms rather than semaphores.
m
initializedmutex_lock(m)
at the start of the critical
sectionmutex_unlock(m)
at the end of the critical
sectionm
initialized
Prog1(m)
mutex_lock=read (account)
x= x + 10
x (account=x)
write (m) mutex_unlock
Prog2 (m)
mutex_lock=read (account)
x= x - 100
x (account=x)
write(m) mutex_unlock
In a multi-threaded process, we just need to use a mutex of type
pthread_mutex_t
.
To implement a mutual exclusion between several processes, several solutions exist
using a pthread_mutex_t
in a shared memory segment
between processes. For this, it is necessary to set the attribute
PTHREAD_MUTEX_SHARED
in the mutex (using
pthread_mutexattr_setpshared
);
using a semaphore initialized to 1. The entry in section critical
is protected by sem_wait
, and we call sem_post
when leaving the critical section.
N
, and a monitor
m
to protect the counter Prog Vehicule
...
mutex_lock(m);
while(cpt == 0){ cond_wait(m); }
cpt--;
mutex_unlock(m);
|...
mutex_lock(m);
cpt++;
cond_signal(m);
mutex_unlock(m);
N
blocks buffer
Produc
: produces info0
Produc
: produces info1
Conso
: consumes info0
Produc
: produces info2
available_spots
monitor initialized to
N
ready_info
monitor initialized to 0
Producer: Consumer:
repeat repeat
... ...
mutex_lock(available_spots); mutex_lock(ready_info);
while(available_spots<=0) while(ready_info<=0)
cond_wait(available_spots); cond_wait(ready_info);
reserve_slot(); extract(info)
mutex_unlock(available_spots); mutex_unlock(ready_info);
calcul(info) mutex_lock(available_spots);
free_slot();
mutex_lock(ready_info); cond_signal(available_spots)
push(info); mutex_unlock(available_spots);
cond_signal(ready_info);
mutex_unlock(ready_info); ...
... endRepeat
endRepeat
It is of course possible to implement a producer / consumer scheme between processes using conditions and mutexes. Another simpler solution is to use a pipe: since writing in a pipe being atomic, the deposit of a data boils down to writing into the pipe, and reading from the pipe extracts the data.
pthread_rwlock_t
int pthread_rwlock_rdlock(pthread_rwlock_t* lock)
to
protect read operationsint pthread_rwlock_wrlock(pthread_rwlock_t* lock)
to
protect write operationsint pthread_rwlock_unlock(pthread_rwlock_t* lock)
to
release the lock
It is possible to implement the reader / writer synchronization
pattern using a mutex instead of rwlock
: read and write
operations are protected by a mutex. However, this implementation does
not not allow multiple readers to work in parallel.
The implementation of the monitor-based reader / writer is more
complex. It mainly requires: * an integer readers
which
counts the number of threads reading * a boolean writing
which indicates that a thread is writing * a cond
condition
to notify changes to these variables * a mutex mutex
to
protect concurrent access
Here is an implementation of the reader / writer using a monitor:
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
// This program simulates operations on a set of bank accounts
// Two kinds of operations are available:
// - read operation: compute the global balance (ie. the sum of all accounts)
// - write operation: transfer money from one account to another
//
// Here's an example of the program output:
//
// $ ./rw_threads_condition
// Balance: 0 (expected: 0)
// 3982358 operation, including:
// 3581969 read operations (89.945932 % )
// 400389 write operations (10.054068 % )
#define N 200
int n_loops = 1000000;
int accounts[N];
int nb_read = 0;
int nb_write = 0;
= PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex = PTHREAD_COND_INITIALIZER;
pthread_cond_t cond
int readers=0;
int writing=0;
/* read all the accounts */
int read_accounts() {
(&mutex);
pthread_mutex_lockwhile(writing)
(&cond, &mutex);
pthread_cond_wait++;
readers(&mutex);
pthread_mutex_unlock
++;
nb_readint sum = 0;
for(int i=0; i<N; i++) {
+= accounts[i];
sum }
(&mutex);
pthread_mutex_lock--;
readersif(!readers) {
(&cond);
pthread_cond_signal}
(&mutex);
pthread_mutex_unlockreturn sum;
}
/* transfer amount units from account src to account dest */
void transfer(int src, int dest, int amount) {
(&mutex);
pthread_mutex_lockwhile(writing || readers)
(&cond, &mutex);
pthread_cond_wait= 1;
writing (&mutex);
pthread_mutex_unlock
++;
nb_write[dest] += amount;
accounts[src] -= amount;
accounts
(&mutex);
pthread_mutex_lock=0;
writing(&cond);
pthread_cond_signal(&mutex);
pthread_mutex_unlock}
void* thread_function(void*arg) {
for(int i=0; i<n_loops; i++) {
/* randomly perform an operation
* threshold sets the proportion of read operation.
* here, 90% of all the operations are read operation
* and 10% are write operations
*/
int threshold = 90;
int x = rand()%100;
if(x < threshold) {
/* read */
int balance = read_accounts();
if(balance != 0) {
(stderr, "Error : balance = %d !\n", balance);
fprintf();
abort}
} else {
/* write */
int src = rand()%N;
int dest = rand()%N;
int amount = rand()%100;
(src, dest, amount);
transfer}
}
return NULL;
}
int main(int argc, char**argv) {
for(int i = 0; i<N; i++) {
[i] = 0;
accounts}
int nthreads=4;
[nthreads];
pthread_t tid
for(int i=0; i<nthreads; i++) {
(&tid[i], NULL, thread_function, NULL);
pthread_create}
for(int i=0; i<nthreads; i++) {
(tid[i], NULL);
pthread_join}
int balance = read_accounts();
("Balance: %d (expected: 0)\n", balance);
printf
int nb_op = nb_read+nb_write;
("%d operation, including:\n",nb_op);
printf("\t%d read operations (%f %% )\n", nb_read, 100.*nb_read/nb_op);
printf("\t%d write operations (%f %% )\n", nb_write, 100.*nb_write/nb_op);
printf
return EXIT_SUCCESS;
}
If you want to study further synchronization primitives, and to understand memory models, the blog post We Make a std::shared_mutex 10 Times Faster <https://www.codeproject.com/Articles/1183423/We-Make-a-std-shared-mutex-10-Times-Faster> discusses in details atomic operations, instruction reordering, C++ memory model and various synchronization primitives.
By default, an instruction modifying a variable is non-atomic
example : x++
gives :
register = load(x)
register ++
x = store (register)
\(\rightarrow\) Problem if the variable is modified by a other thread simultaneously
volatile
?volatile
does not ensure atomicityHere is an example of a program that may suffer from overly aggressive optimization by the compiler:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#if USE_VOLATILE
volatile int a = 0;
#else
int a = 0;
#endif
void* thread1(void*arg) {
while(a == 0) ;
("Hello\n");
printfreturn NULL;
}
void* thread2(void*arg) {
= 1;
a return NULL;
}
int main(int argc, char**argv) {
, t2;
pthread_t t1(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_create
(t1, NULL);
pthread_join(t2, NULL);
pthread_joinreturn EXIT_SUCCESS;
}
When compiled with the optimization level -O0
(i.e. without any optimization), thread1
spins waiting, and
when thread2
modifies the variable a
, it
unlocks thread1
which displays Hello
:
$ gcc -o volatile volatile.c -Wall -pthread -O0
$ ./volatile
Hello
$
When compiled with the optimization level -O1
, the
generated code no longer works:
$ gcc -o volatile volatile.c -Wall -pthread -O1
$ ./volatile
[waits indefinitely]
^C
$
Analyzing the code generated by the compiler reveals the problem:
$ gcc -o volatile volatile.c -Wall -pthread -O2
$ gdb ./volatile
[...]
(gdb) disassemble thread1
Dump of assembler code for function thread1:
0x0000000000000756 <+0>: auipc a5,0x2
0x000000000000075a <+4>: lw a5,-1778(a5) # 0x2064 <a>
0x000000000000075e <+8>: bnez a5,0x762 <thread1+12>
0x0000000000000760 <+10>: j 0x760 <thread1+10>
0x0000000000000762 <+12>: add sp,sp,-16
0x0000000000000764 <+14>: auipc a0,0x0
0x0000000000000768 <+18>: add a0,a0,36 # 0x788
0x000000000000076c <+22>: sd ra,8(sp)
0x000000000000076e <+24>: jal 0x620 <puts@plt>
0x0000000000000772 <+28>: ld ra,8(sp)
0x0000000000000774 <+30>: li a0,0
0x0000000000000776 <+32>: add sp,sp,16
0x0000000000000778 <+34>: ret
nd of assembler dump.
$
We see here that at the address 0x760
, the program jumps
to the address 0x760
. So it jumps in place
indefinitely.
This is explained by the fact that the variable a
is not
volatile
. The compiler therefore thinks it can optimize
access to this variable: since the thread1
function only
accesses the variable in read-mode, the program loads the variable in a
register (here, the a5
register, see the instruction
0x75a
), then consults the registry. When
thread2
modifies the variable a
, the
modification is therefore not perceived by thread1
!
Declaring the variable as volatile
forces the compiler
to read the variable each time:
$ gcc -o volatile volatile.c -Wall -pthread -O2 -DUSE_VOLATILE=1
$ gdb volatile
(gdb) disassemble thread1
Dump of assembler code for function thread1:
0x0000000000000756 <+0>: add sp,sp,-16
0x0000000000000758 <+2>: sd ra,8(sp)
0x000000000000075a <+4>: auipc a4,0x2
0x000000000000075e <+8>: add a4,a4,-1782 # 0x2064 <a>
0x0000000000000762 <+12>: lw a5,0(a4)
0x0000000000000764 <+14>: beqz a5,0x762 <thread1+12>
0x0000000000000766 <+16>: auipc a0,0x0
0x000000000000076a <+20>: add a0,a0,34 # 0x788
0x000000000000076e <+24>: jal 0x620 <puts@plt>
0x0000000000000772 <+28>: ld ra,8(sp)
0x0000000000000774 <+30>: li a0,0
0x0000000000000776 <+32>: add sp,sp,16
0x0000000000000778 <+34>: ret
End of assembler dump.
Here, the loop while (a == 0)
is translated to the lines
from 0x762
to 0x764
. At each loop iteration,
the value of a
is loaded, then tested.
C11 provides a set of atomic operations, including
atomic_flag_test_and_set
atomic_compare_exchange_strong
atomic_fetch_add
atomic_thread_fence
_Bool atomic_flag_test_and_set(volatile atomic_flag* obj);
Performs atomically:
int atomic_flag_test_and_set(int* flag) {
int old = *flag;
*flag = 1;
return old;
}
Implementing a lock:
void lock(int* lock) {
while(atomic_flag_test_and_set(lock) == 1) ;
}
Here is an example of a program using a test_and_set
based lock:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#define NITER 1000000
#define NTHREADS 4
volatile int lock=0;
int x = 0;
#ifdef NOT_THREAD_SAFE
/* thread-unsafe version */
void do_lock() {
while(lock) ;
= 1;
lock }
void do_unlock() {
= 0;
lock }
#else
/* thread-safe version */
void do_lock() {
while(atomic_flag_test_and_set(&lock)) ;
}
void do_unlock() {
= 0;
lock }
#endif /* NOT_THREAD_SAFE */
void* thread_function(void* arg) {
for(int i=0; i<NITER; i++) {
();
do_lock++;
x();
do_unlock}
return NULL;
}
int main(int argc, char**argv) {
[NTHREADS];
pthread_t tidsint ret;
for(int i = 0; i<NTHREADS; i++) {
= pthread_create(&tids[i], NULL, thread_function, NULL);
ret (ret == 0);
assert}
for(int i = 0; i<NTHREADS; i++) {
= pthread_join(tids[i], NULL);
ret (ret == 0);
assert}
("x = %d\n", x);
printfreturn EXIT_SUCCESS;
}
_Bool atomic_compare_exchange_strong(volatile A* obj, C* expected, C desired);
compares *obj
and *expected
if equal, copy desired
into *obj
and
return true
else, copy the value of *obj
into
*expected
and return false
Performs atomically:
bool CAS(int* obj, int* expected, int desired) {
if(*obj != *expected) {
*expected = *obj;
return false;
} else {
*obj = desired;
return true;
}
}
Here is an example of a program handling a lock-free list
thanks to compare_and_swap
:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#define NITER 1000000
#define NTHREADS 4
struct node {
int value;
struct node* next;
};
struct node *stack = NULL;
#ifdef NOT_THREAD_SAFE
/* thread-unsafe version */
void push(int value) {
struct node* n = malloc(sizeof(struct node));
->value = value;
n->next = stack;
n= n;
stack }
int pop() {
struct node* n = stack;
int value = 0;
if(n) {
= n->value;
value = n->next;
stack (n);
free}
return value;
}
#else
/* thread-safe version */
void push(int value) {
struct node* n = malloc(sizeof(struct node));
->value = value;
n->next = stack;
n
int done = 0;
do {
= atomic_compare_exchange_strong(&stack, &n->next, n);
done } while(!done);
}
int pop() {
int value = 0;
struct node* old_head = NULL;
struct node* new_head = NULL;
int done = 0;
do {
/* Warning: this function still suffers a race condition (search for
* "ABA problem" for more information).
* Fixing this would be too complicated for this simple example.
*/
= stack;
old_head if(old_head)
= old_head->next;
new_head = atomic_compare_exchange_strong(&stack, &old_head, new_head);
done } while (!done);
if(old_head) {
= old_head->value;
value (old_head);
free}
return value;
}
#endif /* NOT_THREAD_SAFE */
_Atomic int sum = 0;
void* thread_function(void* arg) {
for(int i=0; i<NITER; i++) {
(1);
push}
int value;
while((value=pop()) != 0) {
+=value;
sum}
return NULL;
}
int main(int argc, char**argv) {
[NTHREADS];
pthread_t tidsfor(int i = 0; i<NTHREADS; i++) {
(&tids[i], NULL, thread_function, NULL);
pthread_create}
for(int i = 0; i<NTHREADS; i++) {
(tids[i], NULL);
pthread_join}
("sum = %d\n", sum);
printfreturn EXIT_SUCCESS;
}
C atomic_fetch_add( volatile A* obj, M arg );
obj
with arg+obj
obj
Performs atomically:
int fetch_and_add(int* obj, int value) {
int old = *obj;
*obj = old+value;
return old;
}
Here is an example of a program using fetch_and_add
to
atomically increment a variable:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#define NITER 1000000
#define NTHREADS 4
volatile int x = 0;
#ifdef NOT_THREAD_SAFE
/* thread-unsafe version */
void inc(volatile int * obj) {
*obj = (*obj)+1;
}
#else
/* thread-safe version */
void inc(volatile int * obj) {
(obj, 1);
atomic_fetch_add}
#endif /* NOT_THREAD_SAFE */
void* thread_function(void* arg) {
for(int i=0; i<NITER; i++) {
(&x);
inc}
return NULL;
}
int main(int argc, char**argv) {
[NTHREADS];
pthread_t tidsfor(int i = 0; i<NTHREADS; i++) {
(&tids[i], NULL, thread_function, NULL);
pthread_create}
for(int i = 0; i<NTHREADS; i++) {
(tids[i], NULL);
pthread_join}
("x = %d\n", x);
printfreturn EXIT_SUCCESS;
}
C atomic_thread_fence( memory_order order );
Properties to consider when choosing a synchronization primitive
int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);
Benefits
test_and_set
)Disadvantages
It is also possible to implement a spinlock using an atomic operation:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <stdatomic.h>
#include <assert.h>
#define NITER 1000000
#define NTHREADS 4
struct lock {
/* if flag=0, the lock is available
* if flag=1, the lock is taken
*/
volatile int flag;
};
typedef struct lock lock_t;
void lock(lock_t *l) {
/* try to set flag to 1.
* if the flag is already 1, loop and try again
*/
while(atomic_flag_test_and_set(&l->flag)) ;
}
void unlock(lock_t *l) {
->flag = 0;
l}
void lock_init(lock_t *l) {
->flag = 0;
l}
;
lock_t lint x;
void* thread_function(void*arg){
for(int i=0; i<NITER; i++) {
(&l);
lock++;
x(&l);
unlock}
return NULL;
}
int main(int argc, char**argv) {
(&l);
lock_init
[NTHREADS];
pthread_t tidsint ret;
for(int i = 0; i<NTHREADS; i++) {
= pthread_create(&tids[i], NULL, thread_function, NULL);
ret (ret == 0);
assert}
for(int i = 0; i<NTHREADS; i++) {
= pthread_join(tids[i], NULL);
ret (ret == 0);
assert}
("x = %d\n", x);
printf("expected: %d\n", NTHREADS*NITER);
printfreturn EXIT_SUCCESS;
}
System call allowing to build synchronization mechanisms in userland
Allows waiting without monopolizing the CPU
A futex is made up of:
Available operations (among others)
WAIT(int *addr, int value)
while(*addr == value) { sleep(); }
WAKE(int *addr, int value, int num)
*addr = value
num
threads waiting on addr
mutex: an integer with two possible values: 1
(unlocked), or 0
(locked)
mutex_lock(m)
:
0
, call FUTEX_WAIT
mutex_unlock(m)
:
FUTEX_WAKE
to wake up a thread from the waiting
listHere is an example of a program implementing a mutex using futex:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <stdatomic.h>
#include <linux/futex.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <errno.h>
#include <assert.h>
#define NITER 1000000
#define NTHREADS 4
struct lock {
int flag;
};
typedef struct lock lock_t;
static int futex(int *uaddr, int futex_op, int val,
const struct timespec *timeout, int *uaddr2, int val3) {
return syscall(SYS_futex, uaddr, futex_op, val,
, uaddr2, val3);
timeout}
void lock(lock_t *l) {
while (1) {
/* Is the futex available? */
int expected = 1;
if (atomic_compare_exchange_strong(&l->flag, &expected, 0))
return; /* Yes */
/* Futex is not available; wait */
int s = futex(&l->flag, FUTEX_WAIT, 0, NULL, NULL, 0);
if (s == -1 && errno != EAGAIN) {
("futex_wait failed");
perror();
abort}
}
}
void unlock(lock_t *l) {
int expected = 0;
(&l->flag, &expected, 1);
atomic_compare_exchange_strongint s = futex(&l->flag, FUTEX_WAKE, 1, NULL, NULL, 0);
if (s == -1) {
("futex_wake failed");
perror();
abort}
}
void lock_init(lock_t *l) {
->flag = 1;
l}
;
lock_t lint x;
void* thread_function(void*arg){
for(int i=0; i<NITER; i++) {
// printf("%d\n", i);
(&l);
lock++;
x(&l);
unlock}
return NULL;
}
int main(int argc, char**argv) {
(&l);
lock_init
[NTHREADS];
pthread_t tidsint ret;
for(int i = 0; i<NTHREADS; i++) {
= pthread_create(&tids[i], NULL, thread_function, NULL);
ret (ret == 0);
assert}
for(int i = 0; i<NTHREADS; i++) {
= pthread_join(tids[i], NULL);
ret (ret == 0);
assert}
("x = %d\n", x);
printf("expected: %d\n", NTHREADS*NITER);
printfreturn EXIT_SUCCESS;
}
struct cond {
int cpt;
};
void cond_wait(cond_t *c, pthread_mutex_t *m) {
int value = atomic_load(&c->value);
(m);
pthread_mutex_unlock(&c->value, FUTEX_WAIT, value);
futex(m);
pthread_mutex_lock}
void cond_signal(cond_t *c) {
(&c->value, 1);
atomic_fetch_add(&c->value, FUTEX_WAKE, 0);
futex}
Here is an example of a program implementing a condition using futex:
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/syscall.h>
#include <linux/futex.h>
#include <stdatomic.h>
#include <assert.h>
#define N 10
int n_loops = 20;
struct cond {
int cpt;
};
typedef struct cond cond_t;
static int futex(int *uaddr, int futex_op, int val) {
return syscall(SYS_futex, uaddr, futex_op, val, NULL, uaddr, 0);
}
void cond_init(cond_t *c) {
->cpt = 0;
c}
void cond_wait(cond_t *c, pthread_mutex_t *m) {
int cpt = atomic_load(&c->cpt);
(m);
pthread_mutex_unlock(&c->cpt, FUTEX_WAIT, cpt);
futex(m);
pthread_mutex_lock}
void cond_signal(cond_t *c) {
(&c->cpt, 1);
atomic_fetch_add(&c->cpt, FUTEX_WAKE, 0);
futex}
struct monitor{
int value;
;
pthread_mutex_t mutex;
cond_t cond};
int infos[N];
int i_depot, i_extrait;
int nb_produits = 0;
struct monitor places_dispo;
struct monitor info_prete;
void* function_prod(void*arg) {
static _Atomic int nb_threads=0;
int my_rank = nb_threads++;
for(int i=0; i<n_loops; i++) {
int cur_indice;
int product_id;
(100);
usleep(&places_dispo.mutex);
pthread_mutex_lockwhile(places_dispo.value == 0) {
(&places_dispo.cond, &places_dispo.mutex);
cond_wait}
.value--;
places_dispo= i_depot++;
cur_indice = i_depot % N;
i_depot
= nb_produits++;
product_id (&places_dispo.mutex);
pthread_mutex_unlock
(500000);
usleep("P%d produit %d dans %d\n", my_rank, product_id, cur_indice);
printf
(&info_prete.mutex);
pthread_mutex_lock[cur_indice] = product_id;
infos.value ++;
info_prete(&info_prete.cond);
cond_signal(&info_prete.mutex);
pthread_mutex_unlock}
return NULL;
}
void* function_cons(void*arg) {
static _Atomic int nb_threads=0;
int my_rank = nb_threads++;
for(int i=0; i<n_loops; i++) {
int cur_indice;
int product_id;
(100);
usleep(&info_prete.mutex);
pthread_mutex_lockwhile(info_prete.value == 0) {
(&info_prete.cond, &info_prete.mutex);
cond_wait}
.value--;
info_prete= infos[i_extrait];
product_id = i_extrait;
cur_indice = (i_extrait+1) % N;
i_extrait (&info_prete.mutex);
pthread_mutex_unlock
(100000);
usleep("C%d consomme %d depuis %d\n", my_rank, product_id, cur_indice);
printf
(&places_dispo.mutex);
pthread_mutex_lock.value ++;
places_dispo(&places_dispo.cond);
cond_signal(&places_dispo.mutex);
pthread_mutex_unlock}
return NULL;
}
void init_monitor(struct monitor *m, int value) {
->value = value;
m(&m->mutex, NULL);
pthread_mutex_init(&m->cond);
cond_init}
int main(int argc, char**argv) {
(&places_dispo, N);
init_monitor(&info_prete, 0);
init_monitor= 0;
i_depot = 0;
i_extrait
int nthreads_prod=2;
int nthreads_cons=2;
[nthreads_prod];
pthread_t tid_prod[nthreads_cons];
pthread_t tid_consint ret;
for(int i=0; i<nthreads_prod; i++) {
= pthread_create(&tid_prod[i], NULL, function_prod, NULL);
ret (ret == 0);
assert}
for(int i=0; i<nthreads_cons; i++) {
= pthread_create(&tid_cons[i], NULL, function_cons, NULL);
ret (ret == 0);
assert}
for(int i=0; i<nthreads_prod; i++) {
= pthread_join(tid_prod[i], NULL);
ret (ret == 0);
assert}
for(int i=0; i<nthreads_cons; i++) {
= pthread_join(tid_cons[i], NULL);
ret (ret == 0);
assert}
return EXIT_SUCCESS;
}
pthread_mutex_timedlock
)Coarse grain locking
Fine grain locking
Each lock protects a small portion of the program
Advantage: possibility of using various resources in parallel
Disadvantages:
The notion of scalability is discussed in more detail in the module CSC5001 High Performance Systems.
ls
, cp
, X
,
gnome
, etc.The operating system is responsible for operating various hardware. It, therefore, includes drivers capable of interacting with a particular material. The different drivers for the same type of peripheral offer the same interface, which allows the upper layers of the OS to use the hardware interchangeably.
The transition from user space to kernel space is done via a system
call (syscall). The kernel processes the request for the
application and returns a positive or zero integer on success, and
-1
on failure.
From the application point of view, system calls are exposed as
functions (defined in libc
) in charge of executing the
system call.
You must always test the return value of a system call and deal with errors
Prevent the propagation of errors (the discovery of the error can take place much later)
see the fail-fast approach presented in CSC4102
errno
: external variable indicating the cause of the
last error
The ERRORS
section in a function manual describes
the possible causes of error.
Testimony of a former ASR student: “Without insistence from [the CSC4508 teachers], it would not have jumped out to us so quickly that the problems (in the robotics championship) came from a lack of errors handling on a code that had not been carefully proofread”.
The macro void assert (scalar expression)
tests the
expression passed in parameter and, if false, displays a message error
and terminates the program (with the abort ()
function):
struct stat buf;
int rc = stat(file, &buf);
(rc>=0);
assert// -> in case of an error, prints:
// appli: appli.c:12: main: Assertion `rc>=0' failed.
// Abandon
However, the macro should be used with caution because it is disabled
when the program is compiled in optimized mode (with
gcc -O3
for example).
So it is better to test the return code, display a message describing the error, and possibly terminate the process.
struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
(stderr, "Error\n");
fprintf(EXIT_FAILURE); // or abort();
exit}
The errno.h
file lists standard errors. The manual of
each system call (see man 2 function
), and of each function
(man 3 function
) indicates, in the ERRORS
section, the different error codes that may be returned.
The error message associated with a value of errno
can
be obtained with strerror ()
or perror ()
:
struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
(stderr, "Error while accessing file '%s': %s\n", file, strerror());
fprintf// -> message "Error while accessing file 'plop': No such file or directory"
(EXIT_FAILURE);
exit}
or
struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
("Error while accessing file");
perror// -> message: "Error while accessing file: No such file or directory"
(EXIT_FAILURE);
exit}
It is possible to define a macro displaying an error message and indicating where the error occurred. For example:
#define FATAL(errnum, ...) do { \
fprintf(stderr, "Error in %s:%d:\n", __FILE__, __LINE__); \
fprintf(stderr, __VA_ARGS__); \
fprintf(stderr, ": %s\n", strerror(errnum)); \
abort(); \
} while(0)
int main(int argc, char**argv) {
char *file = argv[1];
struct stat buf;
int rc = stat(file, &buf);
if(rc < 0) {
(errno, "Cannot access file '%s'", file);
FATAL}
return EXIT_SUCCESS;
}
// affiche:
// Error in fatal.c:21:
// Cannot access file 'plop': No such file or directory
// Abandon
When a program calls the abort ()
function in order to
terminate the process, a core dump
file (that describes the
process when the error occured) can be generated in order to debug the
program with gdb
.
To activate the generation of a core dump, run the command
ulimit -c unlimited
. Therefore, the function
abort ()
generates a core dump which can be
supplied to gdb
:
$ ./fatal plop
Error in fatal.c:21:
Cannot access file 'plop': No such file or directory
Abandon (core dumped)
$ gdb ./fatal core
GNU gdb (Debian 8.1-4+b1) 8.1
[...]
Reading symbols from ./fatal...(no debugging symbols found)...done.
[New LWP 11589]
Core was generated by `./fatal plop'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7dfb535 in __GI_abort () at abort.c:79
#2 0x0000555555555232 in main (argc=2, argv=0x7fffffffdcd8) at fatal.c:21
On Linux distribution running systemd, the core dumps are managed by
coredumpctl
:
coredumpctl list
prints the list of all the available
core dumpscoredumpctl info
display basic information (such as the
command line that invoked the program, or the backtrace of the threads
when the crash occured) about the last core dump.coredumpctl debug
invokes a debugger (eg.
gdb
) on the last core dumpsp
register)rbp
registersp
to make space to save registers, and for
local variablesra
ra
sp
back to its previous valuera
Depending on the CPU architecture (and sometimes the compiler), the way of making a function call may vary.
x86
32 bitsOn 32-bit x86
architectures, parameters are placed on
the stack so that the first argument is located at address
ebp + 8
, the second at address ebp + 12
(if
the first argument is stored on 4 bytes), etc.
The return address (i.e. the address of the instruction to run after
function) is stored on the stack at the address ebp+4
.
x86
64 bits On 64-bit x86
architectures, the parameters are passed
via the rdi
, rsi
, rdx
,
rcx
, r8
and r9
registers. If
there are more than 6 parameters, the next parameters are placed on the
stack.
Arm
On Arm
architectures, parameters are passed via
registers (x0
to x7
on Arm
64
bits). The return address is also stored in a register.
RISC-V
On RISC-V
architectures, parameters are passed via
registers (a0
to a7
) like Arm
. If
there are more parameters, or their values do not fit in 64 bits
registers, they are placed on the stack. The return address is also
stored in a register. The address of the previous stack frame is not
kept, the compiler issues an instruction to increment the stack pointer
back to the previous stack frame. It means the preamble and epilogue of
compiled functions are slightly more complicated~.
Here is an example of buffer overflow:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char**argv) {
int N = 4;
char tab[N];
int a = 17;
for(int i=0; i<=N ; i++) {
[i] = 'a'+i;
tab}
("tab = {%c, %c, %c, %c}\n", tab[0], tab[1], tab[2], tab[3]);
printf("a = %d\n", a);
printfreturn 0;
}
Here, the bug comes from the loop in charge of filling the array
which iterates too many times (because of <=
. After the
first 4 iterations, here is the memory status:
During the fifth iteration, the modification of tab [4]
may modify one byte of the variable a
:
The variable a
is therefore no longer equal to
17
, but 69
(or 0x45
).
Buffer overflow bugs are potentially serious for the
security of a system, because depending on an input (e.g. a string
entered by the user), the bug may modify the behavior of the application
(without necessarily crashing the program). In our example, if
the variable a
matches the username, the bug could allow
attackers to pretend to be someone else (for example, an
administrator)!
Buffer overflows are among the most common security vulnerabilities. To be convinced of this, just look for the vulnerability announcements that mention “buffer overflow” (around 780 faults in 2017)
Here is an example of stack overflow:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void foo(char* str) {
char new_str[16];
(new_str, str);
strcpy("new_str = %s\n", new_str);
printf}
int main(int argc, char**argv) {
(argv[1]);
foo("Back in main()\n");
printfreturn 0;
}
Here, the foo
function does not check that
new_str
is large enough to hold str
. So if
str
is too long, strcpy
overflows and may
overwrite the return address of foo
.
Here is an example of execution leading to an stack overflow:
$ gdb ./stack_overflow
(gdb) r coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: stack_overflow coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
new_str = coucouAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Program received signal SIGSEGV, Segmentation fault.
0x000055555555518e in foo (str=0x7fffffffe03e "coucou", 'A' <repeats 83 times>) at stack_overflow.c:9
9 }
(gdb) bt
#0 0x000055555555518e in foo (str=0x7fffffffe03e "coucou", 'A' <repeats 83 times>) at stack_overflow.c:9
#1 0x4141414141414141 in ?? ()
#2 0x4141414141414141 in ?? ()
#3 0x4141414141414141 in ?? ()
#4 0x4141414141414141 in ?? ()
#5 0x4141414141414141 in ?? ()
#6 0x4141414141414141 in ?? ()
#7 0x4141414141414141 in ?? ()
#8 0x4141414141414141 in ?? ()
#9 0x0000555555550041 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)
Here, we observe that when exiting the foo
function, the
program tries to execute the instruction located at the address
0x4141414141414141
(0x41
is the hexadecimal
value of 'A'
), which generates an error.
We could exploit the bug by inserting in argv [1]
the
address of the function void bar (int a, int b)
as well as
its parameters [@aleph1996smashing].
strcpy
,
gets
…)
strncpy
,
fgets
…)-fstack-protector-all
option in gccThe implementation of the two operating modes is dependent on the processor architecture.
x86
On x86
, there are four privilege levels called
protection rings. Today, only two are used:
Two intermediate ones were used in a model where device drivers would run separately from the kernel.
RISC-V
Depending on the implemented platform, RISC-V
uses up to
three levels (sometimes called privilege modes).
From most privileged to less privileged:
M
): the level at which the firmware
runsS
) : the kernel levelU
) : the level for normal user
applicationssyscall
function
to handle ecall
ecall
instructionecall
handlerDepending on the type of processor, the way of making a system call may vary. The way to pass the parameters is a convention which can vary from one OS to another. For example, for Linux:
x86_32
ebx
ecx
, edx
, esi
, edi
,
and ebp
registers;eax
register;INT 0x80
;eax
register.x86_64
rdi
, rsi
, rdx
, rcx
,
r8
, and r9
registers;rax
register;syscall
instruction ;rax
register.x0
to x5
registers;x8
register;svc 0
instruction;x0
register.RISC-V
The parameters of the system call are stored in the
a0
to a5
registers;
The system call number is loaded in the a7
register;
Switching to kernel mode is done with the ecall
instructions
ecall
is the generic instruction for a privilege level
to call into the immediately-lower privilege level, i.e., from
U
to S
and from S
to
M
;The return value of the system call is stored in the
a0
register.
Devices use the memory bus for reads/writes
The DMA controller manages the transfer between peripherals or memory
\(\rightarrow\) The processor can execute instructions during an I/O
Processors use memory bus to access devices
Device memory is mapped in memory
Request / response protocol, special instructions
in
/out
stvec
0X14
signals a data block is
available0x14
:
0x2
0x2
In Non Uniform Memory Access architectures (NUMA), a device is linked
to only one NUMA node. On RISC-V
architectures, this means
a device is linked to only one PLIC, as there is one PLIC per NUMA node.
So only a processor from this NUMA node can serve interrupts from this
device.
x86
architectureOn x86
, interrupt routing goes through two tables
configured by the kernel:
Two tables allow more flexibility than a single table which
associates an IRQ number directly with a manager. This is different from
RISC-V
architecture where there can only be one interrupt
handler, that must check the kind of the interrupt to serve it (e.g., a
device interrupt, a timer interrupt, etc.).
Example of routing:
A device sends an IRQ (for example
0x14
)
The routing table associates IRQ14
with IDT47
The IDT table indicates that IDT47
is managed by the function handle_disk_interrupt
This is with only one processor; on multicore x86
systems:
XAPIC protocol on pentium (x2APIC since Intel Core processors)
Each core has a number called APIC number (Advanced Programmable Interrupt Controller)
Each core handles interrupts via its LAPIC (local APIC)
An IOAPIC routes an interrupt to a given LAPIC
SIE
(Supervisor Interrupts
Enable) in register SSTATUS
RISC-V
architecturesThe PLIC (usually an APLIC, Advanced PLIC) remains used by devices that do not (need to) support MSI. When the platform supports MSIs, the APLIC converts wired (i.e., non message signaled) interrupts into MSIs. This is configured by the OS, as if they were direct MSIs from external devices.
x86
On x86
systems, MSIs work roughly the same:
x86
architecturesThe interrupt handler (the function addressed by register
stvec
) is also called when system calls
and exceptions occur
ecall
, which triggers an interrupt of this
type0x5
stvec
points to the unique entrypoint
into the kernel:
On x86
systems, the IDT table is used for every possible
interruption:
int 0x64
simply generates the interrupt IDT 0x64
IDT 0x00
, an
access illicit memory (SIGSEGV
) the interrupt
IDT 0x0e
etc.The IDT table is therefore the table that contains all of the entry points to the kernel:
Jiffies: global time source to update the date
Tick: core-local time source used for scheduling
A process needs to be present in main memory to run
Central memory divided into two parts
Memory management concerns the process space
Memory capacities are increasing, but so are the requirements \(\rightarrow\) Need for multiple memory levels
Principle of inclusion to limit updates between different levels
Regarding the principle of inclusion, in an Intel architecture, the L1 cache (Level 1) is included in L2 cache (Level 2), which is itself included in RAM, which is included in the swap (disk).
Here are the typical access times to data located in the different types of memory on a “classic” machine (Intel Core i5 Skylake processor) in 2017 :
The following table shows the cost difference between the types of memory (and the evolution over the years):
Year | 2008 | 2009 | 2010 | 2014 | 2019 | 2023 |
---|---|---|---|---|---|---|
Hard disk drive, 7200 tr/mn (in €/GiB) | 0,50 | 0,32 | 0,10 | 0,04 | 0.027 | 0.023 |
SSD disk (in €/GiB) | – | – | – | 0,50 | 0.17 | 0.09 |
USB key (in €/GiB) | – | – | 1,64 | 0.62 | 0.27 | 0.06 |
NVMe (in €/GiB) | – | – | – | – | 0.21 | 0.10 |
RAM (in €/GiB) | – | 37,00 | 21,85 | 8.75 | 7.23 | 2.75 |
The memory pages of a process can be
\(\rightarrow\) each process has a contiguous memory space to store its data
The paging mecanism
In Linux, page frames are 4KB in size (defined size by the constants
PAGE_SIZE
and PAGE_SHIFT
in the file
page.h
).
k
bits:
p
bitsd = (k - p)
bits
On 64-bit Intel (x86_64
) or ARM 64 bits architectures
(ARMv8), the addresses are stored on 64 bits
(i.e. size (void *)
is 8 bytes), but only 48 bits are
usable for virtual addresses.
On RISC-V
architectures, a system can choose from four
virtual address sizes, including 48 bits as other architectures, and
also 57 bits.
Some applications use large amounts of data (sometimes several GiB)
that must be placed in a large number of 4 KiB memory pages. In order to
limit the number of pages to handle, some architectures (especially
x86_64
and RISC-V
) allow the use of larger
memory pages (typically 2~MiB and 1~GiB, but also bigger) which are
called huge pages in the Linux kernel (sometimes
megapages and gigapages).
The correspondence between logical address and address physical is done with a page table that contains
x86_64
or RISC-V
, a page table =
4-levels tree
satp
register (cr3
on x86
architectures)n[0..3]
) + 1
offset, then translated using:uint64_t cur = %satp3; // cur = root table physical address
for(int i=0; i<3; i++)
= ((uint64_t*)cur)[n[i]]; // physical memory access, next entry
cur return cur + offset; // add the offset
Intel architectures have Translation Look-aside Buffers (TLB) with 32, 64, or even 256 entries. TLB are sometimes called address translation cache.
.text
,
.data
, etc.)open
mmap
) with the appropriate permissions/proc/<pid>/maps
$ cat /proc/self/maps
5572f3023000-5572f3025000 r--p 00000000 08:01 21495815 /bin/cat
5572f3025000-5572f302a000 r-xp 00002000 08:01 21495815 /bin/cat
5572f302e000-5572f302f000 rw-p 0000a000 08:01 21495815 /bin/cat
5572f4266000-5572f4287000 rw-p 00000000 00:00 0 [heap]
7f33305b4000-7f3330899000 r--p 00000000 08:01 22283564 /usr/lib/locale/locale-archive
7f3330899000-7f33308bb000 r--p 00000000 08:01 29885233 /lib/x86_64-linux-gnu/libc-2.28.so
7f33308bb000-7f3330a03000 r-xp 00022000 08:01 29885233 /lib/x86_64-linux-gnu/libc-2.28.so
[...]
7f3330ab9000-7f3330aba000 rw-p 00000000 00:00 0
7ffe4190f000-7ffe41930000 rw-p 00000000 00:00 0 [stack]
7ffe419ca000-7ffe419cd000 r--p 00000000 00:00 0 [vvar]
7ffe419cd000-7ffe419cf000 r-xp 00000000 00:00 0 [vdso]
void* malloc(size_t size)
size bytes
void* realloc(void* ptr, size_t size)
malloc
void* calloc(size_t nmemb, size_t size)
malloc
, but memory is initialized to 0void *aligned_alloc( size_t alignment, size_t size )
malloc
. The returned address is a multiple of
alignment
void free(void* ptr)
All these functions are implemented in the standard C library (which in some cases make system calls).
The malloc(3)
algorithm is very efficient. It is not
therefore generally not necessary to try to optimize it.
However:
calloc (3)
(it is more efficient than a
malloc(3)
followed by memset(3)
).mallopt
allows to fine tune the behavior
of malloc(3)
__malloc_hook
,
__realloc_hook
and __free_hook
. Be careful,
these mechanisms can lead to reentrancy problems.free
, it is strongly advised
to set the pointer to NULL
. This allows the program to
crash immediatly if, by mistake, we access this (now inexistant) buffer
again using this pointer.The following program illustrates how setting a pointer to
NULL
allows to crash immediatly and how using a debugger
allows to quickly find the origin of the error.
/**********************/
/* resetToNULL.c */
/**********************/
/* This program illustrates the value of assigning a variable to NULL
which contains a pointer to an unallocated buffer.
Indeed, it makes a segmentation fault, which makes it possible to
identify that we have executed an illegal operation. Using a
debugger allows to understand the problem.
*/
/* */
/* 1) Compile the program with the option -g */
/* cc -g -o resetToNULL resetToNULL.c */
/* 2) ./resetToNULL */
/* ==> Segmentation fault */
/* 3) ulimit -c unlimited */
/* 4) ./resetToNULL */
/* ==> Segmentation fault (core dumped) */
/* 5) ddd ./resetToNULL core */
#include <stdlib.h>
#include <assert.h>
void h(char *p){
*p = 'a';
}
void g(char *p){
(p);
h}
void f(char *p){
(p);
g}
int main(){
char *p = NULL;
(p);
f
= malloc(1);
p (p != NULL);
assert
(p);
f
(p);
free= NULL;
p
(p);
f
return EXIT_SUCCESS;
}
Memory alignment depends on the type of data
char
(1-byte), short
(2-bytes),
int
(4-bytes), …A data structure may be larger than its content
A data structure can be packed with
__attribute__((packed))
Memory alignment applies to variables as well as to members of data structures.
The following program illustrates how alignement affects the size of a data structure:
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
struct plop {
int a;
char b;
int c;
};
struct plop_packed {
int a;
char b;
int c;
} __attribute__((packed));
int main(void) {
struct plop p1;
struct plop_packed p2;
("struct plop -- size: %lu bytes, address: %p\n",
printfsizeof(struct plop), &p1);
("\t.a -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p1.a), &p1.a, offsetof(struct plop, a));
("\t.b -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p1.b), &p1.b, offsetof(struct plop, b));
("\t.c -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p1.c), &p1.c, offsetof(struct plop, c));
("\n");
printf
("struct plop_packed -- size: %lu bytes, address: %p\n",
printfsizeof(struct plop_packed), &p2);
("\t.a -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p2.a), &p2.a, offsetof(struct plop_packed, a));
("\t.b -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p2.b), &p2.b, offsetof(struct plop_packed, b));
("\t.c -- size: %lu bytes, address: %p, offset: %lu\n",
printfsizeof(p2.c), &p2.c, offsetof(struct plop_packed, c));
("\n");
printfreturn 0;
}
$ ./memory_alignment
struct plop -- size: 12 bytes, address: 0x7ffc05ad8184
.a -- size: 4 bytes, address: 0x7ffc05ad8184, offset: 0
.b -- size: 1 bytes, address: 0x7ffc05ad8188, offset: 4
.c -- size: 4 bytes, address: 0x7ffc05ad818c, offset: 8
struct plop_packed -- size: 9 bytes, address: 0x7ffc05ad817b
.a -- size: 4 bytes, address: 0x7ffc05ad817b, offset: 0
.b -- size: 1 bytes, address: 0x7ffc05ad817f, offset: 4
.c -- size: 4 bytes, address: 0x7ffc05ad8180, offset: 5
void *sbrk(intptr_t increment)
increment
bytesvoid *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
flags
contains MAP_ANON
, does not map
any file, but allocates an area filled with 0s\(\rightarrow\) Non-Uniform Memory Access \(\rightarrow\) On which memory bank to allocate data?
double *array = malloc(sizeof(double)*N);
for(int i=0; i<N; i++) {
[i] = something(i);
array}
#pragma omp parallel for
for(int i=0; i<N; i++) {
double value = array[i];
/* ... */
}
void *numa_alloc_interleaved(size_t size)
double *array =
(sizeof(double)*N);
numa_alloc_interleaved
for(int i=0; i<N; i++) {
[i] = something(i);
array}
#pragma omp parallel for
for(int i=0; i<N; i++) {
double value = array[i];
/* ... */
}
It is also possible to use set_mempolicy
in order to
choose an allocation strategy for future memory allocations.
mbind
long mbind(void *addr, unsigned long len, int mode, const unsigned long *nodemask, unsigned long maxnode, unsigned flags)
double *array = malloc(sizeof(double)*N);
(&array[0], N/4*sizeof(double),
mbind, &nodemask, maxnode,
MPOL_BIND);
MPOL_MF_MOVE
#pragma omp parallel for
for(int i=0; i<N; i++) {
double value = array[i];
/* ... */
}
In fact, the compiler generally manages to generate a binary which exploits all the capacities of the processor. But the compiler sometimes fails and generates non-optimized code. We must therefore be able to detect the problem, and be able to write code that the compiler can optimize.
1965 - 2005
\(\implies\) Increased processor performance
Since 2005
The number of steps required to execute an instruction depends on the processor type (Pentium 4: 31 steps, Intel Haswell: 14-19 steps, ARM9: 5 steps, etc.)
At each stage, several circuits are used
\(\rightarrow\) One instruction is executed at each cycle
\(\implies\) several instructions executed simultaneously!
Limitations of the superscalar:
There should be no dependency between statements executed simultaneously.
Example of non-parallelizable instructions
= b * c;
a = a + 1; d
cmp a, 7 ; a > 7 ?
ble L1
mov c, b ; b = c
br L2
L1: mov d, b ; b = d
L2: ...
\(\implies\) waste of time
The cost of a wrong choice when loading a branch depends on pipeline depth: the longer the pipeline, the longer it takes to empty it (and therefore wait before executing an instruction). For this reason (among others), the depth of the pipeline in a processor is limited.
0x12 loop:
...
0x50 inc eax
0x54 cmpl eax, 10000
0x5A jl loop
0x5C end_loop:
...
The branch prediction algorithms implemented in modern processors are very advanced and reach a efficiency greater than 98 % (on the SPEC89 benchmark suite).
To know the number of good / bad predictions, we can analyze the
hardware counters of the processor. With the PAPI library http://icl.cs.utk.edu/projects/papi/, the
PAPI_BR_PRC
and PAPI_BR_MSP
counters give the
number of conditional jumps correctly and incorrectly predicted.
Linux perf
also allows collects this information (among
others). For example:
$ perf stat -e branches,branch-misses ./branch_prediction 0
is random is not set
100000000 iterations in 1178.597000 ms
result=199999996
Performance counter stats for './branch_prediction 0':
2447232697 branches
6826229 branch-misses # 0,28% of all branches
1,179914189 seconds time elapsed
1,179784000 seconds user
0,000000000 seconds sys
for(i=0; i<size; i++) {
[i] = A[i] * B[i];
C}
Example: image processing, scientific computing
Using vector instructions (MMX, SSE, AVX, …)
Instructions specific to a processor type
Process the same operation on multiple data at once
for(i=0; i<size; i+= 8) {
*pC = _mm_mul_ps(*pA, *pB);
++; pB++; pC++;
pA}
Vector instructions were democratized at the end of the years 1990 with the MMX (Intel) and 3DNow! (AMD) instruction sets that allow to work on 64 bits (for example to process 2 32-bit operations at once). Since then, each generation of x86 processors brings new extension to the instruction set: SSE2, SSSE3 (128 bit), SSE4, AVX, AVX2 (256 bit), AVX512 (512 bit). The other types of processors also provide vector instructions sets (eg NEON [128 bits], Scalable Vector Extension [SVE] on ARM), or the Vector Extension of RISC-V.
Vector instruction sets are specific to certain processors. The
/proc/cpuinfo
file contains (among others) the instructions
sets that are available on the processor of a machine. For example, on
an Intel Core i7:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 69
model name : Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
stepping : 1
microcode : 0x1d
cpu MHz : 1484.683
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms
invpcid xsaveopt
bugs :
bogomips : 5387.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
[...]
The flags
field contains the list of all the
capabilities of the processor, especially the available
instructions sets:
mmx, sse, sse2, ssse3, sse4_1, sse4_2, avx2
.
Vector instruction can be used directly in assembler or by exploiting
the intrinsics provided by compilers. However, because of the
number of available instruction sets and since each new processor
generation provides new instructions sets, it is recommended to leave
the compiler optimize the code, for example using the -O3
option.
Problem with superscalar / vector processors:
Simultaneous Multi-Threading (SMT, or Hyperthreading)
SMT is an inexpensive way to increase the performance of a processor: by duplicating the “small” circuits (ALU, registers, etc.) and by pooling the “big” circuits (FPU, prediction of branches, caches), we can execute several threads simultaneously. The additional cost in terms of manufacturing is light and the gain in performance can be significant.
Since the dispatcher schedules the instructions of several threads, a branch miss-prediction becomes less serious since while the pipeline of the thread is emptied, another thread can be scheduled.
The performance gain when multiple threads are running is not systematic since some circuits remain shared (by example, the FPU).
Limited scalability of SMT
dispatcher is shared
FPU is shared
\(\rightarrow\) Duplicate all the circuits
It is of course possible to combine multi-core with SMT. Most semiconductor foundries produce multi-core SMT processors: Intel Core i7 (4 cores x 2 threads), SPARC T3 Niagara-3 (16 cores x 8 threads), IBM POWER 7 (8 cores x 4 threads).
\(\rightarrow\) Non-Uniform Memory Architecture
The first NUMA machines (in the 1990s) were simply sets of machines linked by a proprietary network responsible for managing memory transfers. Since 2003, some motherboards allow to plug several Opteron processors (AMD) connected with a HyperTransport link. Intel subsequently developed a similar technology (Quick Path Interconnect, QPI) to connect its Nehalem processors (released in 2007).
Until the 1990s, performance was limited by the performance of the processor. From the software point of view, developers had to minimize the number of instructions to be executed in order to achieve the best performance.
As the performance of processors increases, the bottleneck is now the memory. On the software side, we therefore seek to minimize the number of costly memory accesses. This pressure on memory is exacerbated by the development of multi-core processors.
For example, an Intel Core i7 processor can generate up to 2 memory access per clock cycle. A 18-core processor with hyper-threading (ie 36 threads) running at 3.1 Ghz can therefore generate \(2 \times 36 \times 3.1 \times 10 ^ 9 = 223.2\) billion memory references per second. If we consider access to 64-bit data, this represents 1662 GiB/s (1.623 TiB/s). In addition to these data accesses, the memory access to the instructions (up to 128 bits per instruction) also have to be taken into account. We thus arrive to a 3325 GiB/s (therefore 3.248 TiB/s !) maximum flow.
For comparison, in 2023 a DDR5 RAM DIMM has a maximum throughput of around 70 GiB/s. It is therefore necessary to set up mechanisms to prevent the processor from spending all its time waiting for memory.
To visualize the memory hierarchy of a machine, you can use the
lstopo
tool provided by the hwloc
project.
\(\rightarrow\) Mainly used for small caches (ex: TLB)
The size of a cache line depends on the processor (usually between 32
and 128 bytes). You can find this information in
/proc/cpuinfo
:
$ cat /proc/cpuinfo |grep cache_alignment
cache_alignment : 64
\(\rightarrow\) Direct access to the cache line
Warning: risk of collision
example:
0x12345
67
8
and
0xbff72
67
8
\(\rightarrow\) K-way associative cache (in French: Cache associatif K-voies)
Nowadays, caches (L1, L2 and L3) are generally associative to 4 (ARM Cortex A9 for example), 8 (Intel Sandy Bridge), or even 16 (AMD Opteron Magny-Cours) ways.
What if 2 threads access the same cache line?
Concurrent read: replication in local caches
Concurrent write: need to invalidate data in other caches
Cache snooping: the cache sends a message that invalidates the others caches
To detail this course a little more, we recommend this page web: Modern microprocessors – A 90 minutes guide! http://www.lighterra.com/papers/modernmicroprocessors/.
For (many) more details, read the books [@bryant] and [@patterson2013computer] which describe in detail the architecture of computers. If you are looking for specific details, read [@patterson2011computer].
In this lecture, we mainly talk about files, as this is the easiest example of I/O to manipulable. However, note that the content of the first 3 sections apply to I/O other than files (eg sockets).
A file is a series of contiguous bytes stored in a medium (for example, a disk) under a name (the “name of the file”).
We distinguish several types of the files:
10
while on Windows, ASCII
code character 10
followed by a character of ASCII code
13
);On Unix, the commands hexdump -C filename
,
bless filename
or xxd filename
show the exact
content of a file. Use them to
compare the contents of helloWorldUnix.c
and
helloWorldWindows.c
see that the file
default_names_fichierIssuDuTP10DuModuleCSC4103.txt
is not
quite a text file (and, see also how are the accented characters stored
in a file)
When you “open” a file, the operating system provides a notion of current position (sometimes called offset in the rest of this course) for reading or writing.
This current position determines which byte in the file will be read/written during the next I/O operation.
This offset advances each time a read or write operation is performed.
The operating system provides the user with primitives to explicitly change this position (without reading or writing bytes).
The “end of a file” corresponds to the location behind the last byte of the file. When a program reaches the end of file, it cannot read bytes anymore. On the other hand, the program can write bytes (depending on the mode in which the file was opened).
There are 3 ways to access a file:
The Linux system and the C library provide sequential and direct access modes. For an indexed sequential access mode, other libraries are required (Unix NDBM, GDBM, Oracle Berkeley DB, …).
Buffered I/O
\(\rightarrow\) a buffered I/O \(\neq\) an operation on the disk
fopen
, fread
, fscanf
,
fwrite
, fprintf
, etc.FILE*
Unbuffered I/O
open
, read
, write
,
etc.int
\(\dag\) To be exact, an “unbuffered” I/O generates a system call. The OS can then decide to cache the data or no.
int open(const char *path, int flags, mode_t mode)
returns f_id
flags
can take one of the following values:
O_RDONLY
: read onlyO_WRONLY
: write onlyO_RDWR
: read and writeAdditional flags:
O_APPEND
: append data (write at the end of the
file)O_TRUNC
: truncate (empty) the file when opening itO_CREAT
: creation if the file does not exist. The
permissions are \((mode\;\&\;\sim
umask)\)O_SYNC
: open file in synchronous write modeO_NONBLOCK
(ot O_NDELAY
):
open
and subsequent operations performed on the descriptor
will be non-blocking.int close(int desc)
About the O_SYNC
option in open
:
To improve performance, by default, during a write operation, the operating system does not physically write the bytes on disk (they are stored in a kernel cache, waiting to be writen to disk)
Therefore, in the event of a sudden stop of the machine (example: power outage):
Solutions to synchronize file data in memory with the disc:
O_SYNC
option when opening the file;int fsync(int fd)
primitiveNote that we can also create a file using the creat
primitive:
int creat(const char *path, mode_t mode)
: return value
= f_id
open
:open(path, O_WRONLY|O_CREAT|O_TRUNC, mode)
.ssize_t read(int fd, void *buf, size_t count)
returns the number of bytes successfully read
When read
returns, the buf
zone
contains the read data;
In the case of a file, the number of bytes read may not be be
equal to count
:
In the case where the read
function is used on a
descriptor other than a file (e.g. a pipe, or a socket), the fact that
the number of bytes read may not equal count
may have other
meanings:
ssize_t write(int fd, const void *buf, size_t count)
return the number of bytes written
In the case of a file, the return value (without error) of the write operation means that:
O_SYNC
was
specify at file open;O_SYNC
was
specified.In the case of a file, a number of bytes written that is
different from count
means an error (e.g. No space left on
device)
Writing to disk is atomic: if two processes \(P_1\) and \(P_2\) simultaneously write to the same file in the same location, when the two processes have finished their writing, we will find:
Note that when the file is opened with the option
O_APPEND
, if \(P_1\) and
\(P_2\) write simultaneously (at the
end of the file, because of O_APPEND
), when the two
processes will have finished their writing, we will find at the end of
file:
No writing is therefore lost! Attention, this concurrent write at the end of file is not equivalent to two processes simultaneously performing the following operations:
(fd,0,SEEK_END); /* move the cursor to the end of file */
lseek(fd,data,taille); write
In fact, in the latter case, one of the written data may by overwritten by the other.
The copy.c
file on the next page illustrates the use of
open
, read
, write
and
close
.
/************/
/* copy.c */
/************/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#define USAGE "USAGE: copy src dest\n"
#define WRITE_ERROR "write error (no space left on device ?)\n"
int source, dest;
int buf;
int nb_read, nb_written;
int main(int argc, char *argv[]) {
if (argc != 3) {
(STDERR_FILENO, USAGE, strlen(USAGE));
writereturn EXIT_FAILURE;
}
= open(argv[1], O_RDONLY);
source if (source < 0) {
(argv[1]);
perrorreturn EXIT_FAILURE;
}
= open(argv[2],
dest |O_CREAT|O_TRUNC,
O_WRONLY|S_IRWXG|S_IRWXO);
S_IRWXUif (dest < 0) {
(argv[2]);
perrorreturn EXIT_FAILURE;
}
while ((nb_read = read(source, (void*)&buf, sizeof(buf))) > 0) {
= write(dest, (void*)&buf, nb_read);
nb_written if (nb_written <= 0) {
if (nb_written == 0) {
(STDERR_FILENO, WRITE_ERROR, strlen(WRITE_ERROR));
write}
else {
("write");
perror}
return EXIT_FAILURE;
}
}
if (nb_read < 0) {
("read");
perrorreturn EXIT_FAILURE;
}
if (close(source) < 0) {
(argv[1]);
perrorreturn EXIT_FAILURE;
}
if (close(dest) < 0) {
(argv[2]);
perrorreturn EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
This operation of copying the contents of one file to another
descriptor is an operation frequently performed in web servers. Indeed,
these servers must in particular send the content of files to client who
have requested them. This is why the linux system offers the
sendfile
primitive
(ssize_t sendfile (int out_fd, int in_fd, off_t * offset, size_t count)
).
It reads count
bytes of in_fd
and write them
to out_fd
(which must match an socket).
sendfile
is more more efficient than the combination
read
/ write
.
The fallocate
function is the Linux specific version of
the portable function posix_fallocate
.
int dup(int old_fd)
new_fd
old_fd
int dup2(int old_fd, int new_fd)
new_fd
to become a synonym of
the old_fd
descriptor. If the descriptor
new_fd
is not available, the system first closes
close(new_fd)
struct flock {
short l_type;
short l_whence;
off_t l_start;
off_t l_len;
};
int fcntl(int fd, F_SETLK, struct flock*lock);
Locks are attached to an inode. So locking a file affects all file descriptors (and therefore all open files) corresponding to this inode
A lock is the property of a process: this process is the only one authorized to modify or remove it
Locks have a scope of \([integer1: integer2]\) or \([integer: \infty]\)
Locks have a type:
F_RDLCK
: allows concurrent read accessF_WRLCK
: exclusive access
The exclusive-lock.c
file illustrates exclusive file
locking:
/***********/
/* exclusive_lock.c */
/***********/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
int main(){
int fd;
struct flock lock;
= open("/tmp/ficTest",O_RDWR|O_CREAT, S_IRWXU|S_IRWXG|S_IRWXO);
fd if (fd < 0) {
("open");
perror(EXIT_FAILURE);
exit}
/* Exclusive lock on the 15th byte */
.l_type = F_WRLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 15;
lock.l_len = 1;
lock
/* Because of the F_SETLKW parameter, we get stuck on the fcntl if */
/* the lock cannot be acquired */
("attempt to acquire an exclusive lock by process %d...\n",
printf());
getpidif (fcntl(fd, F_SETLKW, &lock) < 0){
("Acquiring lock");
perror(EXIT_FAILURE);
exit}
("... Exclusive lock acquired by process %d\n", getpid());
printf
/* Here we could do the processing that needed to be protected */
/* by the lock */
(10);
sleep
/* Release the lock */
("Releasing the lock by process %d...\n", getpid());
printf.l_type = F_UNLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 15;
lock.l_len = 1;
lockif (fcntl(fd, F_SETLK, &lock) < 0){
("Releasing lock");
perror(EXIT_FAILURE);
exit}
("...OK\n");
printf
return EXIT_SUCCESS;
}
The shared-lock.c
file illustrates the shared
locking:
/*****************/
/* shared_lock.c */
/*****************/
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
int main(){
int fd;
struct flock lock;
= open("/tmp/ficTest",O_RDWR|O_CREAT, S_IRWXU|S_IRWXG|S_IRWXO);
fd if (fd < 0) {
("open");
perror(EXIT_FAILURE);
exit}
/* Shared lock on the 15th byte */
.l_type = F_RDLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 15;
lock.l_len = 1;
lock
/* Because of the F_SETLKW parameter, we get stuck on the fcntl if */
/* the lock cannot be acquired */
("attempt to acquire a shared lock by process %d...\n",
printf());
getpidif (fcntl(fd, F_SETLKW, &lock) < 0){
("Acquiring lock");
perror(EXIT_FAILURE);
exit}
("... shared lock acquired by process %d\n", getpid());
printf
/* Here we could do the processing that needed to be protected */
/* by the lock */
(10);
sleep
/* Release the lock */
("Releasing the lock by process %d...\n", getpid());
printf.l_type = F_UNLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 15;
lock.l_len = 1;
lockif (fcntl(fd, F_SETLK, &lock) < 0){
("Releasing lock");
perror(EXIT_FAILURE);
exit}
("...OK\n");
printf
return EXIT_SUCCESS;
}
If we run exclusive-lock
first, running
exclusive-lock
or shared-lock
wait before
locking.
If we run shared-lock
first, another
shared-lock
can set the (shared) lock. On the other hand, a
exclusive-lock
must wait to be able to lock.
Note that exclusive_lock may suffer starvation:
To prevent this starvation, we must add a mutual exclusion.
off_t lseek(int fd, off_t unOffset, int origine)
return the new offset
allows to handle the offset of the file
Warning ! Race condition if several threads manipulate the file
Solutions:
pread
or pwrite
instead of
lseek + read
or lseek + write
int posix_fadvise(int fd, off_t offset, off_t len, int advice)
POSIX_FADV_SEQUENTIAL
,
POSIX_FADV_RANDOM
, POSIX_FADV_WILLNEED
Since January 2011, we know that this function is used in Firefox to
reduce startup time by 40 % to 50 % by loading more efficiently GUI
libraries xul.dll
and mozjs.dll
(more
information here
<https://bugzilla.mozilla.org/show_bug.cgi?id=627591>).
int aio_read(struct aiocb *aiocbp);
int aio_write(struct aiocb *aiocbp);
int aio_suspend(const struct aiocb * const aiocb_list[],
int nitems,
const struct timespec *timeout);
int aio_error(const struct aiocb *aiocbp);
For more information on asynchronous I/O, refer to the documentation
(man 7 aio
).
The current implementation of AIO Posix is provided in
user-land by libc and can cause scalability issues. Another
solution is to use the Asynchronous I/O interface provided by the Linux
kernel (see the system calls io_submit
,
io_setup
, etc.), or the libaio
library which
provides an overlay to Linux system calls.
void *mmap(void *addr,
size_t length,
int prot,
int flags,
int fd,
off_t offset);
int munmap(void *addr, size_t length);
To ensure that the memory accesses have been passed on to the disk,
you can use the msync
function.
Device = hardware component other than CPU and memory
Device driver = software allowing access to a device
A device is identified by a number called dev
Most significant bits (major): driver number
Least significant bits (minor): device number
The kernel contains a table which associates a driver number with the driver (access function + status)
“character” devices
\(\rightarrow\) blocks the CPU during the I/O operation
“block” devices
\(\rightarrow\) does not block the CPU during the I / O operation
A single block device driver in xv6
virtio_disk_rw()
in virtio.c
virtio_disk_rw()
takes two parameters:
a boolean, write
, to tell if it is a read or a
write
a buf
(buf.h
) structure
buf.dev/blockno
: access to block blockno
from disk dev
buf.data
: data read or written
write == 0
, the output of
virtio_disk_rw
, data
= data readwrite == 1
, the input of
virtio_disk_rw
, data
= data to write
virtio_disk_rw
algorithmvirtio_disk_rw
mainly performs the following actions:
Setup the DMA data transfer:
Sleep the process with the sleep
function (see
lecture #4)
\(\rightarrow\) switch to another ready process
virtio_disk_intr
functionvirtio_disk_intr
calls wakeup
to wake up
the sleeping processxv6 is written to run on a virtual machine
, i.e., on a
special environment where devices are indeed virtualized. One interface
designed to perform best with those virtual devices is the
virtio interface. While the virtio
protocol is different from the one used by real, physical block devices
(e.g., IDE
or SATA
), in both cases, DMA and
interruptions are used.
Disk access is very slow compared to memory access
I/O cache improves the performance of block type devices
The system manages a set of buffers in memory
To read a block (read operation)
To modify a block (write operation)
buffer cache = xv6 I/O cache (bio.c
)
buf
structuresbuf
structure is associated with a block of a disk
buf
can be valid if its block’s data has been read,
invalid otherwisebuf
has a reference counter to avoid eviction
while still in use
buf
structures form a circular double linked list,
the head is the most recently used blockstruct buf* bget(uint dev, uint blkno)
: return a
locked buffer associated with (dev, blkno
)
dev,blkno
)
dev, blkno
)struct buf* bread(uint dev, uint blkno)
bget()
to find a buffer for this
blockvirtio_disk_rw()
void bwrite(struct buf* b)
virtio_disk_rw()
to write the buffer data to the
diskvoid brelse(struct buf* b)
b
A write operation of a process often requires several block writes
File creation requires:
Adding data to a file requires:
Deleting a file requires:
…
The system can crash anytime
\(\rightarrow\) Inconsistency if it stops in the middle of an operation
Operations must be propagated in the order in which they were performed
\(\rightarrow\) Inconsistency if propagation in random order
No cache when writing (directly propagate write operations)
Recovery in the case of a crash
Recovering a file system is slow
examples: FAT32 on Windows or ext2 on Linux
Recovering is not always possible
\(\rightarrow\) a crash makes the filesystem unusable!
A transaction is a set of writes that is
Principle of implementation
To ensure that the entries are propagated in order in which they were executed, the pending zone is structured like a log
Problems: Multiple processes may perform transactions in parallel
\(\rightarrow\) How do you know which ones are validated?
Classic solution
The system technically manages two logs
One in memory called memory log
One on disk called disk log
\(\rightarrow\) The system can therefore manage up to 3 copies of a block
n
n
to the list of modified blocks
in the memory logThree functions in the log management interface
(log.c
)
begin_op()
: start a transactionend_op()
: validate a transactionlog_write(struct buf* b)
: add b
to the
transactionTo perform a logged operation, instead of calling directly
bwrite ()
, we have to execute:
();
begin_op= bread(...);
b // Modify data of b
...
(b2);
log_write...
(); end_op
void begin_op()
: start a transaction
log.outstanding
)void end_op()
: complete a transaction
Decrement the number of operations in progress, and if equal to 0:
write_log()
)write_head()
)install_trans()
)write_head()
)void log_write(struct buf* b)
b
in the logb
to
prevent it from leaving the buffer cacheThe log controls block writes and releases through
log_write()
and end_op()
. System calls that
implement access to blocks never use bwrite()
and
brelse()
directly. Instead, the log keeps track of blocks
that must be written to disk: they are called dirty blocks,
because their content cached in the buffer cache is different from their
content in the filesystem on the disk.
log_write()
, the log keeps a reference on the
buffers of *dirty blocks to prevent their eviction until it calls
brelse()
in end_op()
end_op()
commits transactions by writing logged dirty
blocks to the disk log, and then to the filesystem, using
bwrite()
After a crash, call install_trans()
which propagates
the writes from disk log to file system
File system: defines the structure for storing files (often for a block type device)
File = consistent set of data that can be read or written
Filesystem = associate names and files
Example : /etc/passwd
→
root:*:0:0:System Administrator...
Usually a special symbol is used as a separator for directories
/
in UNIX systems, \(\backslash\) in Windows systemsA disk is often made up of several partitions
Typical structure of a disk
First block: partition table
Blocks 2 to x: kernel loader
Blocks x to y: partition 1
Blocks y to z: partition 2
etc…
A file itself can contain the data of a complete disc
xv6.img
is the disk image used with the
qemu emulator to start xv6
Five large contiguous zones (in fs.h
)
A file on disk consists of:
metadata called a dinode (fixed size, see
fs.h
)
data blocks
A dinode directly lists the numbers of the first 12 blocks
dinode.addrs [0]
block contains bytes 0 to 511 of
the filedinode.addrs [i]
block contains the bytes i * 512
to i * 512 + 511The indirection block contains the following block numbers
ind [0]
block contains bytes 12 * 512 to 12 * 512 +
511Note: since a block is 512 bytes and a block number is coded out of 4 characters, a file has a maximum size of 12 + 512/4 blocks.
To add a new block to a dinode dino
(function
bmap ()
in fs.h
)
balloc()
in
fs.h
)dino
A directory is a file of type
T_DIR
Contains an array associating names and numbers of dinodes
Inode 1 is necessarily a directory: it is the root directory of the filesystem
Note: dinode.nlink
gives the number of times a
dinode is referenced from a directory
\(\implies\) file deleted when
nlink
equals to 0.
/e0/../en
(see
namex()
in fs.c
)cur = 1
For i in [0 .. n]
Look for the association [inum, name] in the data blocks of
the cur dinode such that name is ei
cur = inum
To create the file f
in the
d
directory (function create()
in
sysfile.c
)
ialloc ()
in fs.h
)[inum, f]
to d
To delete the file f
from the
d
directory (sys_unlink()
function in
sysfile.c
)
f
in
d
nlink
from f
and if
nlink
equals 0f
f
(setting its type to 0)inode = memory cache of a dinode
Enter the cache at open()
Can be evicted from cache from close()
Contains the fields of the dinode
+ fields to know which dinode the inode corresponds to
+ fields required when the dinode is used
Inode table = table which contains the inodes
struct inode* iget(int dev, int inum)
Corresponds to open()
: returns an inode associated
with [dev, inum]
Increments the inode usage counter (non-evictable)
Do not lock the inode and do not read the inode from disk (optimization to avoid disc playback when creates a file)
inode.valid
indicates whether the inode has been read
from diskvoid ilock(struct inode* ip)
void iunlock(struct inode* ip)
void itrunc(struct inode* ip)
void iupdate(struct inode* ip)
void iput(struct inode* ip)
Corresponds to close ()
Decreases the inode usage counter
If cpt drops to 0, the inode can be evicted from the cache and
If nlink is 0 (the inode is no longer referenced by a directory)
Note: if you delete a file from a directory
(unlink()
) while the file is still in use (open) by a
process, the inode is not deleted: it will be when last
close()
when the reference counter drops to 0.
Multiple processes can open the same file
A file structure opened by open ()
contains:
Each process has an ofile
table of open files
d
is an index in this tableproc[i].ofile[d]
points to an open fileproc[i].ofile[d].ip
points to inodeGood to know
fork()
, the parent and the child share the
open filesproc[parent].ofile[d] == proc[child].ofile[d]
A device driver is just a function (virtio_disk_rw()
for example)
Reads and writes are logged
The kernel has an I/O cache
A file system separates
A file descriptor is an index in the ofile table
proc->ofile[i]
is an open file that references an
inode