Device = hardware component other than CPU and memory
Device driver = software allowing access to a device
A device is identified by a number called dev
Most significant bits (major): driver number
Least significant bits (minor): device number
The kernel contains a table which associates a driver number with the driver (access function + status)
“character” devices
\(\rightarrow\) blocks the CPU during the I/O operation
“block” devices
\(\rightarrow\) does not block the CPU during the I / O operation
A single block device driver in xv6
virtio_disk_rw()
in virtio.c
virtio_disk_rw()
takes two parameters:
a boolean, write
, to tell if it is a read or a
write
a buf
(buf.h
) structure
buf.dev/blockno
: access to block blockno
from disk dev
buf.data
: data read or written
write == 0
, the output of
virtio_disk_rw
, data
= data readwrite == 1
, the input of
virtio_disk_rw
, data
= data to write
virtio_disk_rw
algorithmvirtio_disk_rw
mainly performs the following actions:
Setup the DMA data transfer:
Sleep the process with the sleep
function (see
lecture #4)
\(\rightarrow\) switch to another ready process
virtio_disk_intr
functionvirtio_disk_intr
calls wakeup
to wake up
the sleeping processxv6 is written to run on a virtual machine
, i.e., on a
special environment where devices are indeed virtualized. One interface
designed to perform best with those virtual devices is the
virtio interface. While the virtio
protocol is different from the one used by real, physical block devices
(e.g., IDE
or SATA
), in both cases, DMA and
interruptions are used.
Disk access is very slow compared to memory access
I/O cache improves the performance of block type devices
The system manages a set of buffers in memory
To read a block (read operation)
To modify a block (write operation)
buffer cache = xv6 I/O cache (bio.c
)
buf
structuresbuf
structure is associated with a block of a disk
buf
can be valid if its block’s data has been read,
invalid otherwisebuf
has a reference counter to avoid eviction
while still in use
buf
structures form a circular double linked list,
the head is the most recently used blockstruct buf* bget(uint dev, uint blkno)
: return a
locked buffer associated with (dev, blkno
)
dev,blkno
)
dev, blkno
)struct buf* bread(uint dev, uint blkno)
bget()
to find a buffer for this
blockvirtio_disk_rw()
void bwrite(struct buf* b)
virtio_disk_rw()
to write the buffer data to the
diskvoid brelse(struct buf* b)
b
A write operation of a process often requires several block writes
File creation requires:
Adding data to a file requires:
Deleting a file requires:
…
The system can crash anytime
\(\rightarrow\) Inconsistency if it stops in the middle of an operation
Operations must be propagated in the order in which they were performed
\(\rightarrow\) Inconsistency if propagation in random order
No cache when writing (directly propagate write operations)
Recovery in the case of a crash
Recovering a file system is slow
examples: FAT32 on Windows or ext2 on Linux
Recovering is not always possible
\(\rightarrow\) a crash makes the filesystem unusable!
A transaction is a set of writes that is
Principle of implementation
To ensure that the entries are propagated in order in which they were executed, the pending zone is structured like a log
Problems: Multiple processes may perform transactions in parallel
\(\rightarrow\) How do you know which ones are validated?
Classic solution
The system technically manages two logs
One in memory called memory log
One on disk called disk log
\(\rightarrow\) The system can therefore manage up to 3 copies of a block
n
n
to the list of modified blocks
in the memory logThree functions in the log management interface
(log.c
)
begin_op()
: start a transactionend_op()
: validate a transactionlog_write(struct buf* b)
: add b
to the
transactionTo perform a logged operation, instead of calling directly
bwrite ()
, we have to execute:
();
begin_op= bread(...);
b // Modify data of b
...
(b2);
log_write...
(); end_op
void begin_op()
: start a transaction
log.outstanding
)void end_op()
: complete a transaction
Decrement the number of operations in progress, and if equal to 0:
write_log()
)write_head()
)install_trans()
)write_head()
)void log_write(struct buf* b)
b
in the logb
to
prevent it from leaving the buffer cacheThe log controls block writes and releases through
log_write()
and end_op()
. System calls that
implement access to blocks never use bwrite()
and
brelse()
directly. Instead, the log keeps track of blocks
that must be written to disk: they are called dirty blocks,
because their content cached in the buffer cache is different from their
content in the filesystem on the disk.
log_write()
, the log keeps a reference on the
buffers of *dirty blocks to prevent their eviction until it calls
brelse()
in end_op()
end_op()
commits transactions by writing logged dirty
blocks to the disk log, and then to the filesystem, using
bwrite()
After a crash, call install_trans()
which propagates
the writes from disk log to file system
File system: defines the structure for storing files (often for a block type device)
File = consistent set of data that can be read or written
Filesystem = associate names and files
Example : /etc/passwd
→
root:*:0:0:System Administrator...
Usually a special symbol is used as a separator for directories
/
in UNIX systems, \(\backslash\) in Windows systemsA disk is often made up of several partitions
Typical structure of a disk
First block: partition table
Blocks 2 to x: kernel loader
Blocks x to y: partition 1
Blocks y to z: partition 2
etc…
A file itself can contain the data of a complete disc
xv6.img
is the disk image used with the
qemu emulator to start xv6
Five large contiguous zones (in fs.h
)
A file on disk consists of:
metadata called a dinode (fixed size, see
fs.h
)
data blocks
A dinode directly lists the numbers of the first 12 blocks
dinode.addrs [0]
block contains bytes 0 to 511 of
the filedinode.addrs [i]
block contains the bytes i * 512
to i * 512 + 511The indirection block contains the following block numbers
ind [0]
block contains bytes 12 * 512 to 12 * 512 +
511Note: since a block is 512 bytes and a block number is coded out of 4 characters, a file has a maximum size of 12 + 512/4 blocks.
To add a new block to a dinode dino
(function
bmap ()
in fs.h
)
balloc()
in
fs.h
)dino
A directory is a file of type
T_DIR
Contains an array associating names and numbers of dinodes
Inode 1 is necessarily a directory: it is the root directory of the filesystem
Note: dinode.nlink
gives the number of times a
dinode is referenced from a directory
\(\implies\) file deleted when
nlink
equals to 0.
/e0/../en
(see
namex()
in fs.c
)cur = 1
For i in [0 .. n]
Look for the association [inum, name] in the data blocks of
the cur dinode such that name is ei
cur = inum
To create the file f
in the
d
directory (function create()
in
sysfile.c
)
ialloc ()
in fs.h
)[inum, f]
to d
To delete the file f
from the
d
directory (sys_unlink()
function in
sysfile.c
)
f
in
d
nlink
from f
and if
nlink
equals 0f
f
(setting its type to 0)inode = memory cache of a dinode
Enter the cache at open()
Can be evicted from cache from close()
Contains the fields of the dinode
+ fields to know which dinode the inode corresponds to
+ fields required when the dinode is used
Inode table = table which contains the inodes
struct inode* iget(int dev, int inum)
Corresponds to open()
: returns an inode associated
with [dev, inum]
Increments the inode usage counter (non-evictable)
Do not lock the inode and do not read the inode from disk (optimization to avoid disc playback when creates a file)
inode.valid
indicates whether the inode has been read
from diskvoid ilock(struct inode* ip)
void iunlock(struct inode* ip)
void itrunc(struct inode* ip)
void iupdate(struct inode* ip)
void iput(struct inode* ip)
Corresponds to close ()
Decreases the inode usage counter
If cpt drops to 0, the inode can be evicted from the cache and
If nlink is 0 (the inode is no longer referenced by a directory)
Note: if you delete a file from a directory
(unlink()
) while the file is still in use (open) by a
process, the inode is not deleted: it will be when last
close()
when the reference counter drops to 0.
Multiple processes can open the same file
A file structure opened by open ()
contains:
Each process has an ofile
table of open files
d
is an index in this tableproc[i].ofile[d]
points to an open fileproc[i].ofile[d].ip
points to inodeGood to know
fork()
, the parent and the child share the
open filesproc[parent].ofile[d] == proc[child].ofile[d]
A device driver is just a function (virtio_disk_rw()
for example)
Reads and writes are logged
The kernel has an I/O cache
A file system separates
A file descriptor is an index in the ofile table
proc->ofile[i]
is an open file that references an
inode