Gaël Thomas
Device = hardware component other than CPU and memory
Device driver = software allowing access to a device
A device is identified by a number called dev
Most significant bits (major): driver number
Least significant bits (minor): device number
The kernel contains a table which associates a driver number with the driver (access function + status)
“character” devices
→ blocks the CPU during the I/O operation
“block” devices
→ does not block the CPU during the I / O operation
A single block device driver in xv6
virtio_disk_rw()
in virtio.c
virtio_disk_rw()
takes two parameters:
a boolean, write
, to tell if it is a read or a
write
a buf
(buf.h
) structure
buf.dev/blockno
: access to block blockno
from disk dev
buf.data
: data read or written
write == 0
, the output of
virtio_disk_rw
, data
= data readwrite == 1
, the input of
virtio_disk_rw
, data
= data to write
virtio_disk_rw
algorithmvirtio_disk_rw
mainly performs the following actions:
Setup the DMA data transfer:
Sleep the process with the sleep
function (see
lecture #4)
→ switch to another ready process
virtio_disk_intr
functionvirtio_disk_intr
calls wakeup
to wake up
the sleeping processDisk access is very slow compared to memory access
I/O cache improves the performance of block type devices
The system manages a set of buffers in memory
To read a block (read operation)
To modify a block (write operation)
buffer cache = xv6 I/O cache (bio.c
)
buf
structuresbuf
structure is associated with a block of a disk
buf
can be valid if its block’s data has been read,
invalid otherwisebuf
has a reference counter to avoid eviction
while still in use
buf
structures form a circular double linked list,
the head is the most recently used blockstruct buf* bget(uint dev, uint blkno)
: return a
locked buffer associated with (dev, blkno
)
dev,blkno
)
dev, blkno
)struct buf* bread(uint dev, uint blkno)
bget()
to find a buffer for this
blockvirtio_disk_rw()
void bwrite(struct buf* b)
virtio_disk_rw()
to write the buffer data to the
diskvoid brelse(struct buf* b)
b
A write operation of a process often requires several block writes
File creation requires:
Adding data to a file requires:
Deleting a file requires:
…
The system can crash anytime
→ Inconsistency if it stops in the middle of an operation
Operations must be propagated in the order in which they were performed
→ Inconsistency if propagation in random order
No cache when writing (directly propagate write operations)
Recovery in the case of a crash
Recovering a file system is slow
examples: FAT32 on Windows or ext2 on Linux
Recovering is not always possible
→ a crash makes the filesystem unusable!
A transaction is a set of writes that is
Principle of implementation
To ensure that the entries are propagated in order in which they were executed, the pending zone is structured like a log
Problems: Multiple processes may perform transactions in parallel
→ How do you know which ones are validated?
Classic solution
The system technically manages two logs
One in memory called memory log
One on disk called disk log
→ The system can therefore manage up to 3 copies of a block
n
n
to the list of modified blocks
in the memory logThree functions in the log management interface
(log.c
)
begin_op()
: start a transactionend_op()
: validate a transactionlog_write(struct buf* b)
: add b
to the
transactionTo perform a logged operation, instead of calling directly
bwrite ()
, we have to execute:
void begin_op()
: start a transaction
log.outstanding
)void end_op()
: complete a transaction
Decrement the number of operations in progress, and if equal to 0:
write_log()
)write_head()
)install_trans()
)write_head()
)void log_write(struct buf* b)
b
in the logb
to
prevent it from leaving the buffer cacheAfter a crash, call install_trans()
which propagates
the writes from disk log to file system
File system: defines the structure for storing files (often for a block type device)
File = consistent set of data that can be read or written
Filesystem = associate names and files
Example : /etc/passwd
→
root:*:0:0:System Administrator...
Usually a special symbol is used as a separator for directories
/
in UNIX systems, ∖
in Windows systemsA disk is often made up of several partitions
Typical structure of a disk
First block: partition table
Blocks 2 to x: kernel loader
Blocks x to y: partition 1
Blocks y to z: partition 2
etc…
A file itself can contain the data of a complete disc
xv6.img
is the disk image used with the
qemu emulator to start xv6
Five large contiguous zones (in fs.h
)
A file on disk consists of:
metadata called a dinode (fixed size, see
fs.h
)
data blocks
A dinode directly lists the numbers of the first 12 blocks
dinode.addrs [0]
block contains bytes 0 to 511 of
the filedinode.addrs [i]
block contains the bytes i * 512
to i * 512 + 511The indirection block contains the following block numbers
ind [0]
block contains bytes 12 * 512 to 12 * 512 +
511Note: since a block is 512 bytes and a block number is coded out of 4 characters, a file has a maximum size of 12 + 512/4 blocks.
To add a new block to a dinode dino
(function
bmap ()
in fs.h
)
balloc()
in
fs.h
)dino
A directory is a file of type
T_DIR
Contains an array associating names and numbers of dinodes
Inode 1 is necessarily a directory: it is the root directory of the filesystem
Note: dinode.nlink
gives the number of times a
dinode is referenced from a directory
⟹ file deleted when
nlink
equals to 0.
/e0/../en
(see
namex()
in fs.c
)cur = 1
For i in [0 .. n]
Look for the association [inum, name] in the data blocks of
the cur dinode such that name is ei
cur = inum
To create the file f
in the
d
directory (function create()
in
sysfile.c
)
ialloc ()
in fs.h
)[inum, f]
to d
To delete the file f
from the
d
directory (sys_unlink()
function in
sysfile.c
)
f
in
d
nlink
from f
and if
nlink
equals 0f
f
(setting its type to 0)inode = memory cache of a dinode
Enter the cache at open()
Can be evicted from cache from close()
Contains the fields of the dinode
+ fields to know which dinode the inode corresponds to
+ fields required when the dinode is used
Inode table = table which contains the inodes
struct inode* iget(int dev, int inum)
Corresponds to open()
: returns an inode associated
with [dev, inum]
Increments the inode usage counter (non-evictable)
Do not lock the inode and do not read the inode from disk (optimization to avoid disc playback when creates a file)
inode.valid
indicates whether the inode has been read
from diskvoid ilock(struct inode* ip)
void iunlock(struct inode* ip)
void itrunc(struct inode* ip)
void iupdate(struct inode* ip)
void iput(struct inode* ip)
Corresponds to close ()
Decreases the inode usage counter
If cpt drops to 0, the inode can be evicted from the cache and
If nlink is 0 (the inode is no longer referenced by a directory)
Note: if you delete a file from a directory
(unlink()
) while the file is still in use (open) by a
process, the inode is not deleted: it will be when last
close()
when the reference counter drops to 0.
Multiple processes can open the same file
A file structure opened by open ()
contains:
Each process has an ofile
table of open files
d
is an index in this tableproc[i].ofile[d]
points to an open fileproc[i].ofile[d].ip
points to inodeGood to know
fork()
, the parent and the child share the
open filesproc[parent].ofile[d] == proc[child].ofile[d]
A device driver is just a function (virtio_disk_rw()
for example)
Reads and writes are logged
The kernel has an I/O cache
A file system separates
A file descriptor is an index in the ofile table
proc->ofile[i]
is an open file that references an
inode