Day 3 22 Oct 2002

More than you ever wanted to know about filesystems
Theodore Ts'o
tytso@mit.edu

*number* is a slide number.


*1*
What is a file system?

File system types: disk-based, network based
Future of file systems

*2*
A file system organizes streams of data (files, inodes)

We generally assume capability for random access, but some file
systems  (such as those stored on tape) may lack this feature. The
only thing you can do with tape is start at the beginning and read
forward.


*3*

Not all file systems are hierarchical. Anyone remember CP/M? It had 15
"user areas" ... IBM's 360 OS also lacked hierarchical structure.

*4* 
The goal of a file system is to minimize overhead. An ideal file
system should have the speed of the raw device, yet provide additional
services such as filename and other metadata.

It should also be optimized for arbitrary user/application access
patterns. This is a relatively new requirement -- a good example is
directories. Now (at the present time) most directories use some kind
of BTree strucutre so the number of files in the directory is not
important.  Unix's original designers wanted to keep the file system
code simple, so they used a simpler data structure which led to
performance degradation if a lot of files were kept in a
subdirectory. They were phone guys, so one of the example uses they
were thinking about was telephone directory lookups. They'd have 26
root directories, each for a letter of the alphabet. Under them would
be 26 more, et cetera. To look up a phone number, you'd find the
directory of the first letter of the name, then under that the
directory of the second letter, following down the tree and through
the spelling of the name until you came to the file containing the
phone number. So no directory would contain an excessive number of
files, and a BTree architecture was not required.

Today, instead of breaking files into small directories, we have
programs like news or qmail which put 100,000 queue files into a
single directory. And it's still expected to go fast.

Daniel Phillips HTree patches are basically designed to make the file
system go faster.

*6* 

You can classify disk-based file systems into several major types:

   *  FAT (File Allocation Table) based -- MSDOS, windows
   *  Inode based -- UNIX, OS/2, everyone else
   *  Log structured -- still experimental

*7*  
FAT based systems are very simple and also very awful.  
The fundamental metadata structure in a FAT based file system is the
File Allocation Table, which contains entries which are:
reserved (in use or last block).   

Random access is hard because each file is a singly linked
list. Defragmentation is important because (1) you don't have to seek
as much to read a file (2) it's easier for the OS to have a cache so
that you can keep previously read block paths and have true random
access on a file.  

The original FAT16 used 2 bytes for each FAT entry (16 bits), so the
maximum cluster number in the FAT was 65536. As disk size got bigger,
cluster size also got bigger. So on a 4 gb disk, each cluster comes
out to 64K. This means that a 1K file takes 64K of physical disk
space. And a 65K file takes 128K of disk space. This is called
internal fragmentation and is a Bad Thing. An average file can waste
half the disk space it's allegedly using.

*9*
VFAT actually stands for virtual as 32-bit virtual protected
mode. Remember that Windows 98 is actually a 32-bin OS grafted onto a
16-bit program loader (MSDOS). 

When VFAT was implemented under Linux they made it into a distinct
filesystem type, but arguably it should have been an argument flag to
mount, as "mount  -t msdos -o long-filenames /dev/...".  

The long filename support in Windows is a terrifying kludge. Long
filenames did not change the 8.3 filename convention of MSDOS's
FAT. They used a complicated algorithm to collapse a long filename
into an 8.3 filename for MSDOS. On a VFAT system, multiple FAT
directory entries which look like deleted files store long filenames
13 characters at a time. Microsoft also has some magic way of making
sure that the 'deleted file' space does not get re-used if the older
FAT system writes to a disk containing new-style long names. 

*9*
This is shocking. NTFS, an inode-based file system, wasn't mainstream
on Windows until 2000 and XP. 

Because of the ubiquity of Windows, everyone assumes that you must
have a defragmenter because the dumbness of the FAT file system
requires it. So people assume that Unix is not sophisticated because
it lacks such a tool, since it doesn't  need one.

*10*

Inode-based file systems

*11*

The inode separates the filename from the rest of the file metadata,
then uses the filename as a key to look it up in an inode table. This
allows hardlinks, a thing completely un-doable in FAT-based
systems. 

In ext2, an inode entry takes 128 bytes, so conventional practice is
to vastly over-allocate inodes and assume you'll never run out.
Somtimes the inode table runs out on smaller disks or if you have a
huge number of 0-length files.

Some modern FSs have extensible inode tables, but most do not. It
turns out that an extensible inode table forces you to pay a price in
reliability, because once it's extensible, then you no longer can be
exactly sure of its location on the disk. If you lose the inode table,
of course, your file system is essentially toast.

The linux JFS == OS/2 HPFS file system.  The AIX JFS is not the same
as the Linux JFS. This is because the AIX JFS was contaminated with
intellectual property from (some company), so IBM brought in a bunch
of folks who'd worked on the OS/2 JFS to create the Linux one. It's
actually good because OS/2 is still used in a lot of places, and this
at least provides them with a migration path away from an operating
system which is essentially no longer supported.

In an inode-based file system, you must divide the inode table into
groups in order to spread out where you allocate blocks for the file
system. Otherwise, since you always start writing a file from the
start of the disk, ffs is very prone to fragmentation.  If you have
200 blocks in the file system, then you have 200 places where you can
search forward to find free blocks. 

There's a  danger of having file blocks interleaved on a FAT based system,
because there's no concept of groups. Because of groups, disk usage on
ext2 file systems tends to be in stripes rather than in a block with
all the free space at the end.

The slow-growth problem. E.g. mail spool files. Most files are created
in one go. If you think about untarring, compiling to a .o file, et
cetera. These files are generally created quickly in sequential
order. But mail files grow by appending, often at long intervals. So
they can get very fragmented. You can of course defragment mail files
just by copying them to a different file and then copying them
back. You can speed mutt up by this process. There's ioctl to tell
what physical location of each block of file. To get it to the screen
is 20 lines or so of C code which nobody has actually packaged up into
a program. Those of us who need it just whip it up ad hoc.  The fsck
program also gives a hint on file fragmentation.  ext2 has
preallocation to help with this problem but this was disabled in ext3.

*12*

ext2 file system layout
BG = block group. 

The Size of block group is the number of bits in a single block, so
you can use one block as an allocation map for the entire group of
blocks. So in 1k ext2, each block group covers 8 mb of disk space. A
4K block group means that 1 block covers 128 mb of disk space --
reasonable for 4 gb or smaller disks.

The block group descriptor is the bottom row in this picture. 
This is a fairly simplified version of the ext2 structure. For one
thing, the area taken up by copies of the superblock grows order n^2
with a very small constant if every block group contains a copy of the
superblock, so we did not see the problem for a long time. modern ext2
filesystems use sparse superblocks, so that there are a sufficient
number of spare copies of the superblock without burdening every
superblock with one.

*13*

ext2 inode structure  Metadata. One optimization here is that there
are spaces in the inode for 'direct blocks', so a file with 12 or
fewer blocks in ext2 is all contained in the inode. For 12 or more
blocks, the inode contains an indirect block, which in turn points to
a block containing block pointers to 256 more blocks. More file data
is contained in a double indirect block, as in the bottom part of the
diagram. Four mb or larger files contain triple indirect block, which
takds the file size up to 4 gb or so.

Max file size for ext2?  ext2 is based on a blocks field which is 32
bits in units of 512-byte sectors, so the limit in reality is that you
will run into the maximum size of the file system first. It's
somewhere around 16 terabytes.

Even on one-terabyte or petabyte disk arrays, it's not obvious that
you should have one big file system for all of the device.

*14*

ext2 directory layout.  An ext2 directory is a linked list of names
and inode numbers.

*15*

Other Inode-based file systems 

XFS has a BTree of free nhodes, so it can shoehorn a new file into the
first contiguous place where it will fit.

Most Unix file systems defer writing data for 30 seconds or so in
order to coalesc the data into one big write.  Use fsync to force the
write of entire block to stable store immediately.

User programs generally don't know if a file is sparse or not.  The
gnu cp(1) command will check for a block of all 0s, so it will seek
and write as necessary for sparse files. Without the 'sparse file'
flag, cp(1) may copy new file as smaller than the old one.  

People thought fsck got faster after I added a spinner and progress
bar to it.  I got bug reports until I designed the progress bar to
work with terminals down to 9600 baud.

*15*

Reiserfs

Developed by Hans Reiser.  I will try to be fair here, since Hans
thinks that we are in some kind of competition.  I took these slides
from one of Hans's presentations. Hans believes that there should be
no distinction between a file system and a database -- e.g. inetd
should not be a file full of strings but a directory containing a
directory for each row, then a file for each individual field, so that
each row dir of the "inetd.conf" directroy tree would contain a file
called "service_name", another called "sock_type", a third called
"port", et cetera. In this way, you would not have to do any of that
hinky string processing in order to get a piece of data out of the
file system.  

One innovation of reiserFS is that small files a stored entirely
inside the directory brtree, so there are no data blocks for small
files because of the internal fragmentation problem.

One problem with this approach is that system calls are not cheap. So
to read the inetd.conf file, you'd have to open, read, and close a
large number of files with system calls. One way that Hans has
proposed to get around this is to have the kernel execute byte codes
so that you could make it do several system calls in a row without
returning. But then of course you're back to parsing the results.

Another problem is that the file system consistency checker is a
little weak, with corruption of files written several hours before an
unclean shutdown a possibility. This is a Bad Thing because people
usually expect that the files that they were working on right when the
machine failed may be corrupted, but are upset to find that files
which they hadn't opened in hours before the shutdown also have
problems.

Hans has said in some emails that if you really want stability in your
file system, just use ext2. From the horses' mouth...

The fast symlink actually has the symbolic link information stored in
the inode itself, rather than in the contents of a file on disk.

A typical unix file system has so few files with less than 60 bytes
that optimizing for this case is probably not worth it.

In ext2, we worried more about large file performance than small file
performance. Given the modern trend toward mp3s, multimedia, and other
larger file uses, I think this is a sound idea.

*16*

JFs -- "Not a bad file system".  No compatibility with the AIX JFS
(sad) The file system format of AIX JFS is public now, but no
clean-room open source implementation appears to have been tried
yet. The JFS tries to be good for everything.

*17*

XFS
Journaling

If you always want to make sure that the image on disk is either
consistent or can be made consistent  through the journal:

To create a file you must:

Modify a block in the inode table
Allocate the block
Write the data

ext3 will prevent any of these changes from hitting the disk until a
copy of the new modified block is written to the journal. After the
journal is updated, then the disk is modified. The journal is a
sequence of block numbers with the new block information, then control
blocks. The journal file is always sequentially written, and at the
end you wrap around to the beginning -- you never seek inside it. At
the go-round point, make sure that the journal is written. After a
crash, you just keep writing the appropriate journal file blocks to
the right blocks on the disk until the end of the consistency check
points.  The consistency check blocks written to the disk are crafted
to make it difficult for random mush to accidentally look like one,
with checksums for each block which must be passed.

One implication of this scheme is that metadata gets written twice.
Metadata is written first, so it is always consistent. Data cannot be
written to disk until the metadata write finishes.  You can use a
different device to hold your journal for more speed, even
a battery-backed RAM store if it is available. The journal is always
sequentially written with no seeks, so write to it are generally very
fast. You can also request that data be written to the journal -- this
is handy for nfs, for example, which won't acknowledge an nfs write
until the data is written to stable storage. A write to the journal
counts for this.

XFS and JFS do logical block journaling. Instead of writing an entire
block to the journal, they'll store opcodes in the journal, which is
much more compact - e.g. the opcode will say "Set bit 422 in the
allocation table to 1".  The ext3 approach may be better because PC
hardware is generally crap.  When the power to a PC fails, big
capacitors in the power supply start discharging, and the power falls
down in a slope rather than off a cliff. Different components want
different amounts of power. Memor is most sensitive to power
problems. so as the power fails it begins to give the OS garbage. Disk
drives have those relatively big platters in them still spinning, so
they're not as sensitive to power problems. Thus they fail much later
in the death of the PC, as it were. So DMA may write garbage to the
disk drives as the power on the PC is going away. 

ext3 is physical block journaling, so it will simply overwrite the
last few blocks written to disk. But a logical journaling system may
not have the original data which went into those blocks, but only the
few changes which were supposed to be made to them most recently. You
could argue that this is not really a file system problem but a
hardware problem, and you'd be right. But we basically have to live
with it.

Commercial-grade hardware has a power-fail interrupt built in to tell
the OS that something bad is about to happen, often in a matter of
milliseconds. So XFS has code to scram all pending DMA operations to
prevent this problem, but alas PCs lack the requisite hardware to take
advantage of it.

On create and write a lot of blocks, ext3 journaling overhead is
negligible. But nextbench (tries to simulate a samba server) creates a
whole bunch of files, then deletes them over & over. So a filesystem properly
tuned for the netbench benchmark should never even light the drive
light, but just keep creating files, then observing that they're
immediately deleted and never hitting the drive.  ext3's netbench
numbers are disastrous because the benchmark is so metadata-intensive.

The main reason for ext3 is to eliminate the fsck cost on boot after
an improper shutdown.  You cannot disable journaling on ext3 now but
if we add new features to ext3, we will include a nojournal option to
it.

proposed extensions to ext3 paper published at the monterey conference
of usenix. One feature commonly requested is a dynamically extensible
file system, but old kernels want to be able to read this. Online
resize will be appearing soon, possibly before the end of the
year. The existence of logical volume managers makes this feature a
must-have. 

XFS and JFS have this feature today. Neither one can _shrink_ a file
system. The ext3 system will be able to shrink but only off-line. The
veritas file system might be the only one which allows online file
shrinking. 

On a priority list of 0-10, online expansion is maybe an 8, but online
shrinkage is about a 3. After all, disk space is cheap.

Growing the inode table is the really hard part of doing online file
system growth. Without having to do this it's really easy, but kind of
pointless.

*17*

XFS -- not CXFS (they are trying to sell this) 
XFS and JFS work better on big SMP machines.

Developers of ext3 include me, stephen tweedy, andreas grubenbacher,
Daniel Phillips, Andrew Norton.

*18*

ext3 is currently in 2.4 and 2.6

*19*
Log structured file systems

-- none are available for linux right now.
bsd has 'lfs', but all previous file systems are update-in-place file
systems. 

*20*
The log cleaner 
You get older snapshots of the file system for free in a logging file
system.   One problem with logging file systems is that writes are
continuous but reads are scattered all over the disk. Directories are
also scattered all over the disk so there's no locality whatsoever.
With enough memory they though you could cache everything. This turned
out not to be true in the early '80s. These days it might be not so
much a disaster because memory is so much cheaper.

And, what happens when the circular log fills up? The log cleaner is
basically a giant garbage collector.

*21*
lfs performance issues. 
LFS read performance 

*23*
 Other non-update-in-place file systems

DRBD for replication of file systems across a network

*24* 
Other disk-based file system issues

*25*
Compression. The big problem with compression is how to do random
access. If you think about gzip, gzip works on a stream but if you
need to modify something in the middle of a gzipped file you cannot do
it. The solution is to compress in blocks, uncompressing and
recompressing each block in order to make random access changes.

*26*

Flash memory/portable linux

jfff does not use standard blocking techniques, because flash memory
is expensive and typically has only a limited number of write cycles
available before it breaks down. This means that your smart cards have
only a few hundred thousand photographs in them before they're no good
any more.  A 160 byte file in jfs occupies exactly 160 bytes of data
memory. 

*27*

Multiple stream support 

People like this because if you have a GUI it's nice to move
stuff around as a complete bundle. But an executable in Mac OSX is a
directory, with different files inside of it for the different
components. In Mac OS/9 they learned the hard way that multiple
streams were a Bad Idea, because when you tried to use ftp on a mac OS
executable, you typically got only part of it -- ftp assumes only one
stream per file. This lead to all kinds of hideous ftp kludges which
essentially base64 encoded the file so you could get all of it, then
base64 decoded it at the other end. 

I believe that multiple streams in a single file is a really bad
idea. There are far too many things in the world which assume that a
file is only one stream -- tar(1), ftp(1), et cetera. That said, the
Gnome people are pushing hard for this and we may eventually see
support for it in ext3.

*28*

Logical volume managers are not really a file system issue. 

IBMs EVMS allows read/write snapshots, so you can actually fork a file
system. 

md layer raid 1. If the system crashes, then the md layer must
re-mirror the entire disk. On crash it does not know whether any
writes on disk 0 did not make it to disk 1. So for an hour or so after
a system crash your system is a lot slower because half the disk
bandwidth is taken up with copy to disk 1. A new version of evms will
fix this with commits.

lvm1 was removed from 2.5 because it was not stable. evms is still
trying to merge in.

*29*
Network-based file systems

*30*
nfs -- the lingua franca. First release was 1985-1986. Very
feature-poor, but for a lot of purposes it is good enough.

caching is hard for network-based file systems because if someone else
modifies a file I've cached, I now have stale data. So you need some
kind of callback feature so the server can notify everyone who holds
the file that their copy is no longer good and they should re-fetch if
they need it.

The NFS V4 spec is an RFC which tries to address some of these
issues. So far the University of Michigan implementation of this is
ahead of everyone else, but nobody has an actual working
implementation of this spec done yet.

*31*
SMB/CIFS

Creator of samba says that "CIFS" is 3 lies in 1:

Common -- it ain't common
Internet -- you can't use it over anything but a local area network
File System -- No, it's a really kludgy protocol.

*32*

AFS and coda 
Digital/ibm at MIT

Does  all the complicated cache/callback calls you need. Replicates
volumes over multiple file servers, yet the client reads from the
closest one. You can actually move a file between servers, and the
client will adapt transparently. Read-only.

smAT -- IDE disk health monitoring system

sysadmins who use AFS _never_ want to go back to NFS. Transarc
developed AFS as proprietary, then took it open-source. Now there's an
'open afs' and the commercial version is sunsetted. If you have 3-4
nfs file servers, look at afs.

coda is a research file system, sort of afs on steroids. It's
currently moribund.

*33* 
GFS -- originally open source, closed source software is sistina. A
shared lock file system. GFS limits the number of computers talking to
one disk to 8-16 machines. Used in some clustering
environments. Largely moribund.

*34*
sfs
sfs is like ssh as nfs is like telnet. For 100 or so users, kerberos
is far too much overhead. So ssh is used instead -- it does not scale
well, but requires no central server or admin overhead. But updating
passwords is painful if you have a very large number of users. 

sfs tries to do for nfs what ssh does for telnet. There's no central
administrator It's all done in user space by encapsulating ssh into
the filename, so there are no kernel patches. 

You mount nfs over a loopback device, then pass it  over a port, add
security, and send it out to implement. A clever hack.

*35*

file systems specialize for specific niches over time. For general
purpose file systems, competition is really useful because it keeps us
honest -- many features come out quickly in a lot of general-purpose
FSs.