Gaël Thomas
Device = hardware component other than CPU and memory
Device driver = software allowing access to a device
A device is identified by a number called dev
Most significant bits (major): driver number
Least significant bits (minor): device number
The kernel contains a table which associates a driver number with the driver (access function + status)
“character” devices
→ blocks the CPU during the I/O operation
“block” devices
→ does not block the CPU during the I / O operation
A single block device driver in xv6
virtio_disk_rw() in virtio.cvirtio_disk_rw() takes two parameters:
a boolean, write, to tell if it is a read or a
write
a buf (buf.h) structure
buf.dev/blockno: access to block blockno
from disk devbuf.data: data read or written
write == 0, the output of
virtio_disk_rw, data = data readwrite == 1, the input of
virtio_disk_rw, data = data to write
virtio_disk_rw algorithmvirtio_disk_rw mainly performs the following actions:
Setup the DMA data transfer:
Sleep the process with the sleep function (see
lecture #4)
→ switch to another ready process
virtio_disk_intr
functionvirtio_disk_intr calls wakeup to wake up
the sleeping processDisk access is very slow compared to memory access
I/O cache improves the performance of block type devices
The system manages a set of buffers in memory
To read a block (read operation)
To modify a block (write operation)
buffer cache = xv6 I/O cache (bio.c)
buf structuresbuf structure is associated with a block of a disk
buf can be valid if its block’s data has been read,
invalid otherwisebuf has a reference counter to avoid eviction
while still in use
buf structures form a circular double linked list,
the head is the most recently used blockstruct buf* bget(uint dev, uint blkno): return a
locked buffer associated with (dev, blkno)
dev,blkno)
dev, blkno)struct buf* bread(uint dev, uint blkno)
bget() to find a buffer for this
blockvirtio_disk_rw()void bwrite(struct buf* b)
virtio_disk_rw() to write the buffer data to the
diskvoid brelse(struct buf* b)
bA write operation of a process often requires several block writes
File creation requires:
Adding data to a file requires:
Deleting a file requires:
…
The system can crash anytime
→ Inconsistency if it stops in the middle of an operation
Operations must be propagated in the order in which they were performed
→ Inconsistency if propagation in random order
No cache when writing (directly propagate write operations)
Recovery in the case of a crash
Recovering a file system is slow
examples: FAT32 on Windows or ext2 on Linux
Recovering is not always possible
→ a crash makes the filesystem unusable!
A transaction is a set of writes that is
Principle of implementation
To ensure that the entries are propagated in order in which they were executed, the pending zone is structured like a log
Problems: Multiple processes may perform transactions in parallel
→ How do you know which ones are validated?
Classic solution
The system technically manages two logs
One in memory called memory log
One on disk called disk log
→ The system can therefore manage up to 3 copies of a block
n
n to the list of modified blocks
in the memory logThree functions in the log management interface
(log.c)
begin_op() : start a transactionend_op() : validate a transactionlog_write(struct buf* b) : add b to the
transactionTo perform a logged operation, instead of calling directly
bwrite (), we have to execute:
void begin_op() : start a transaction
log.outstanding)void end_op() : complete a transaction
Decrement the number of operations in progress, and if equal to 0:
write_log())write_head())install_trans())write_head())void log_write(struct buf* b)
b in the logb to
prevent it from leaving the buffer cacheAfter a crash, call install_trans() which propagates
the writes from disk log to file system
File system: defines the structure for storing files (often for a block type device)
File = consistent set of data that can be read or written
Filesystem = associate names and files
Example : /etc/passwd →
root:*:0:0:System Administrator...
Usually a special symbol is used as a separator for directories
/ in UNIX systems, ∖
in Windows systemsA disk is often made up of several partitions
Typical structure of a disk
First block: partition table
Blocks 2 to x: kernel loader
Blocks x to y: partition 1
Blocks y to z: partition 2
etc…
A file itself can contain the data of a complete disc
xv6.img is the disk image used with the
qemu emulator to start xv6Five large contiguous zones (in fs.h)
A file on disk consists of:
metadata called a dinode (fixed size, see
fs.h)
data blocks
A dinode directly lists the numbers of the first 12 blocks
dinode.addrs [0] block contains bytes 0 to 511 of
the filedinode.addrs [i] block contains the bytes i * 512
to i * 512 + 511The indirection block contains the following block numbers
ind [0] block contains bytes 12 * 512 to 12 * 512 +
511Note: since a block is 512 bytes and a block number is coded out of 4 characters, a file has a maximum size of 12 + 512/4 blocks.
To add a new block to a dinode dino (function
bmap () in fs.h)
balloc() in
fs.h)dinoA directory is a file of type
T_DIR
Contains an array associating names and numbers of dinodes
Inode 1 is necessarily a directory: it is the root directory of the filesystem
Note: dinode.nlink gives the number of times a
dinode is referenced from a directory
⟹ file deleted when
nlink equals to 0.
/e0/../en (see
namex() in fs.c)cur = 1
For i in [0 .. n]
Look for the association [inum, name] in the data blocks of
the cur dinode such that name is ei
cur = inum
To create the file f in the
d directory (function create() in
sysfile.c)
ialloc () in fs.h)[inum, f] to dTo delete the file f from the
d directory (sys_unlink() function in
sysfile.c)
f in
dnlink from f and if
nlink equals 0ff (setting its type to 0)inode = memory cache of a dinode
Enter the cache at open()
Can be evicted from cache from close()
Contains the fields of the dinode
+ fields to know which dinode the inode corresponds to
+ fields required when the dinode is used
Inode table = table which contains the inodes
struct inode* iget(int dev, int inum)
Corresponds to open(): returns an inode associated
with [dev, inum]
Increments the inode usage counter (non-evictable)
Do not lock the inode and do not read the inode from disk (optimization to avoid disc playback when creates a file)
inode.valid indicates whether the inode has been read
from diskvoid ilock(struct inode* ip)
void iunlock(struct inode* ip)
void itrunc(struct inode* ip)
void iupdate(struct inode* ip)
void iput(struct inode* ip)
Corresponds to close ()
Decreases the inode usage counter
If cpt drops to 0, the inode can be evicted from the cache and
If nlink is 0 (the inode is no longer referenced by a directory)
Note: if you delete a file from a directory
(unlink()) while the file is still in use (open) by a
process, the inode is not deleted: it will be when last
close() when the reference counter drops to 0.
Multiple processes can open the same file
A file structure opened by open () contains:
Each process has an ofile table of open files
d is an index in this tableproc[i].ofile[d] points to an open fileproc[i].ofile[d].ip points to inodeGood to know
fork(), the parent and the child share the
open filesproc[parent].ofile[d] == proc[child].ofile[d]A device driver is just a function (virtio_disk_rw()
for example)
Reads and writes are logged
The kernel has an I/O cache
A file system separates
A file descriptor is an index in the ofile table
proc->ofile[i] is an open file that references an
inode