Day 3 22 Oct 2002 More than you ever wanted to know about filesystems Theodore Ts'o tytso@mit.edu *number* is a slide number. *1* What is a file system? File system types: disk-based, network based Future of file systems *2* A file system organizes streams of data (files, inodes) We generally assume capability for random access, but some file systems (such as those stored on tape) may lack this feature. The only thing you can do with tape is start at the beginning and read forward. *3* Not all file systems are hierarchical. Anyone remember CP/M? It had 15 "user areas" ... IBM's 360 OS also lacked hierarchical structure. *4* The goal of a file system is to minimize overhead. An ideal file system should have the speed of the raw device, yet provide additional services such as filename and other metadata. It should also be optimized for arbitrary user/application access patterns. This is a relatively new requirement -- a good example is directories. Now (at the present time) most directories use some kind of BTree strucutre so the number of files in the directory is not important. Unix's original designers wanted to keep the file system code simple, so they used a simpler data structure which led to performance degradation if a lot of files were kept in a subdirectory. They were phone guys, so one of the example uses they were thinking about was telephone directory lookups. They'd have 26 root directories, each for a letter of the alphabet. Under them would be 26 more, et cetera. To look up a phone number, you'd find the directory of the first letter of the name, then under that the directory of the second letter, following down the tree and through the spelling of the name until you came to the file containing the phone number. So no directory would contain an excessive number of files, and a BTree architecture was not required. Today, instead of breaking files into small directories, we have programs like news or qmail which put 100,000 queue files into a single directory. And it's still expected to go fast. Daniel Phillips HTree patches are basically designed to make the file system go faster. *6* You can classify disk-based file systems into several major types: * FAT (File Allocation Table) based -- MSDOS, windows * Inode based -- UNIX, OS/2, everyone else * Log structured -- still experimental *7* FAT based systems are very simple and also very awful. The fundamental metadata structure in a FAT based file system is the File Allocation Table, which contains entries which are: reserved (in use or last block). Random access is hard because each file is a singly linked list. Defragmentation is important because (1) you don't have to seek as much to read a file (2) it's easier for the OS to have a cache so that you can keep previously read block paths and have true random access on a file. The original FAT16 used 2 bytes for each FAT entry (16 bits), so the maximum cluster number in the FAT was 65536. As disk size got bigger, cluster size also got bigger. So on a 4 gb disk, each cluster comes out to 64K. This means that a 1K file takes 64K of physical disk space. And a 65K file takes 128K of disk space. This is called internal fragmentation and is a Bad Thing. An average file can waste half the disk space it's allegedly using. *9* VFAT actually stands for virtual as 32-bit virtual protected mode. Remember that Windows 98 is actually a 32-bin OS grafted onto a 16-bit program loader (MSDOS). When VFAT was implemented under Linux they made it into a distinct filesystem type, but arguably it should have been an argument flag to mount, as "mount -t msdos -o long-filenames /dev/...". The long filename support in Windows is a terrifying kludge. Long filenames did not change the 8.3 filename convention of MSDOS's FAT. They used a complicated algorithm to collapse a long filename into an 8.3 filename for MSDOS. On a VFAT system, multiple FAT directory entries which look like deleted files store long filenames 13 characters at a time. Microsoft also has some magic way of making sure that the 'deleted file' space does not get re-used if the older FAT system writes to a disk containing new-style long names. *9* This is shocking. NTFS, an inode-based file system, wasn't mainstream on Windows until 2000 and XP. Because of the ubiquity of Windows, everyone assumes that you must have a defragmenter because the dumbness of the FAT file system requires it. So people assume that Unix is not sophisticated because it lacks such a tool, since it doesn't need one. *10* Inode-based file systems *11* The inode separates the filename from the rest of the file metadata, then uses the filename as a key to look it up in an inode table. This allows hardlinks, a thing completely un-doable in FAT-based systems. In ext2, an inode entry takes 128 bytes, so conventional practice is to vastly over-allocate inodes and assume you'll never run out. Somtimes the inode table runs out on smaller disks or if you have a huge number of 0-length files. Some modern FSs have extensible inode tables, but most do not. It turns out that an extensible inode table forces you to pay a price in reliability, because once it's extensible, then you no longer can be exactly sure of its location on the disk. If you lose the inode table, of course, your file system is essentially toast. The linux JFS == OS/2 HPFS file system. The AIX JFS is not the same as the Linux JFS. This is because the AIX JFS was contaminated with intellectual property from (some company), so IBM brought in a bunch of folks who'd worked on the OS/2 JFS to create the Linux one. It's actually good because OS/2 is still used in a lot of places, and this at least provides them with a migration path away from an operating system which is essentially no longer supported. In an inode-based file system, you must divide the inode table into groups in order to spread out where you allocate blocks for the file system. Otherwise, since you always start writing a file from the start of the disk, ffs is very prone to fragmentation. If you have 200 blocks in the file system, then you have 200 places where you can search forward to find free blocks. There's a danger of having file blocks interleaved on a FAT based system, because there's no concept of groups. Because of groups, disk usage on ext2 file systems tends to be in stripes rather than in a block with all the free space at the end. The slow-growth problem. E.g. mail spool files. Most files are created in one go. If you think about untarring, compiling to a .o file, et cetera. These files are generally created quickly in sequential order. But mail files grow by appending, often at long intervals. So they can get very fragmented. You can of course defragment mail files just by copying them to a different file and then copying them back. You can speed mutt up by this process. There's ioctl to tell what physical location of each block of file. To get it to the screen is 20 lines or so of C code which nobody has actually packaged up into a program. Those of us who need it just whip it up ad hoc. The fsck program also gives a hint on file fragmentation. ext2 has preallocation to help with this problem but this was disabled in ext3. *12* ext2 file system layout BG = block group. The Size of block group is the number of bits in a single block, so you can use one block as an allocation map for the entire group of blocks. So in 1k ext2, each block group covers 8 mb of disk space. A 4K block group means that 1 block covers 128 mb of disk space -- reasonable for 4 gb or smaller disks. The block group descriptor is the bottom row in this picture. This is a fairly simplified version of the ext2 structure. For one thing, the area taken up by copies of the superblock grows order n^2 with a very small constant if every block group contains a copy of the superblock, so we did not see the problem for a long time. modern ext2 filesystems use sparse superblocks, so that there are a sufficient number of spare copies of the superblock without burdening every superblock with one. *13* ext2 inode structure Metadata. One optimization here is that there are spaces in the inode for 'direct blocks', so a file with 12 or fewer blocks in ext2 is all contained in the inode. For 12 or more blocks, the inode contains an indirect block, which in turn points to a block containing block pointers to 256 more blocks. More file data is contained in a double indirect block, as in the bottom part of the diagram. Four mb or larger files contain triple indirect block, which takds the file size up to 4 gb or so. Max file size for ext2? ext2 is based on a blocks field which is 32 bits in units of 512-byte sectors, so the limit in reality is that you will run into the maximum size of the file system first. It's somewhere around 16 terabytes. Even on one-terabyte or petabyte disk arrays, it's not obvious that you should have one big file system for all of the device. *14* ext2 directory layout. An ext2 directory is a linked list of names and inode numbers. *15* Other Inode-based file systems XFS has a BTree of free nhodes, so it can shoehorn a new file into the first contiguous place where it will fit. Most Unix file systems defer writing data for 30 seconds or so in order to coalesc the data into one big write. Use fsync to force the write of entire block to stable store immediately. User programs generally don't know if a file is sparse or not. The gnu cp(1) command will check for a block of all 0s, so it will seek and write as necessary for sparse files. Without the 'sparse file' flag, cp(1) may copy new file as smaller than the old one. People thought fsck got faster after I added a spinner and progress bar to it. I got bug reports until I designed the progress bar to work with terminals down to 9600 baud. *15* Reiserfs Developed by Hans Reiser. I will try to be fair here, since Hans thinks that we are in some kind of competition. I took these slides from one of Hans's presentations. Hans believes that there should be no distinction between a file system and a database -- e.g. inetd should not be a file full of strings but a directory containing a directory for each row, then a file for each individual field, so that each row dir of the "inetd.conf" directroy tree would contain a file called "service_name", another called "sock_type", a third called "port", et cetera. In this way, you would not have to do any of that hinky string processing in order to get a piece of data out of the file system. One innovation of reiserFS is that small files a stored entirely inside the directory brtree, so there are no data blocks for small files because of the internal fragmentation problem. One problem with this approach is that system calls are not cheap. So to read the inetd.conf file, you'd have to open, read, and close a large number of files with system calls. One way that Hans has proposed to get around this is to have the kernel execute byte codes so that you could make it do several system calls in a row without returning. But then of course you're back to parsing the results. Another problem is that the file system consistency checker is a little weak, with corruption of files written several hours before an unclean shutdown a possibility. This is a Bad Thing because people usually expect that the files that they were working on right when the machine failed may be corrupted, but are upset to find that files which they hadn't opened in hours before the shutdown also have problems. Hans has said in some emails that if you really want stability in your file system, just use ext2. From the horses' mouth... The fast symlink actually has the symbolic link information stored in the inode itself, rather than in the contents of a file on disk. A typical unix file system has so few files with less than 60 bytes that optimizing for this case is probably not worth it. In ext2, we worried more about large file performance than small file performance. Given the modern trend toward mp3s, multimedia, and other larger file uses, I think this is a sound idea. *16* JFs -- "Not a bad file system". No compatibility with the AIX JFS (sad) The file system format of AIX JFS is public now, but no clean-room open source implementation appears to have been tried yet. The JFS tries to be good for everything. *17* XFS Journaling If you always want to make sure that the image on disk is either consistent or can be made consistent through the journal: To create a file you must: Modify a block in the inode table Allocate the block Write the data ext3 will prevent any of these changes from hitting the disk until a copy of the new modified block is written to the journal. After the journal is updated, then the disk is modified. The journal is a sequence of block numbers with the new block information, then control blocks. The journal file is always sequentially written, and at the end you wrap around to the beginning -- you never seek inside it. At the go-round point, make sure that the journal is written. After a crash, you just keep writing the appropriate journal file blocks to the right blocks on the disk until the end of the consistency check points. The consistency check blocks written to the disk are crafted to make it difficult for random mush to accidentally look like one, with checksums for each block which must be passed. One implication of this scheme is that metadata gets written twice. Metadata is written first, so it is always consistent. Data cannot be written to disk until the metadata write finishes. You can use a different device to hold your journal for more speed, even a battery-backed RAM store if it is available. The journal is always sequentially written with no seeks, so write to it are generally very fast. You can also request that data be written to the journal -- this is handy for nfs, for example, which won't acknowledge an nfs write until the data is written to stable storage. A write to the journal counts for this. XFS and JFS do logical block journaling. Instead of writing an entire block to the journal, they'll store opcodes in the journal, which is much more compact - e.g. the opcode will say "Set bit 422 in the allocation table to 1". The ext3 approach may be better because PC hardware is generally crap. When the power to a PC fails, big capacitors in the power supply start discharging, and the power falls down in a slope rather than off a cliff. Different components want different amounts of power. Memor is most sensitive to power problems. so as the power fails it begins to give the OS garbage. Disk drives have those relatively big platters in them still spinning, so they're not as sensitive to power problems. Thus they fail much later in the death of the PC, as it were. So DMA may write garbage to the disk drives as the power on the PC is going away. ext3 is physical block journaling, so it will simply overwrite the last few blocks written to disk. But a logical journaling system may not have the original data which went into those blocks, but only the few changes which were supposed to be made to them most recently. You could argue that this is not really a file system problem but a hardware problem, and you'd be right. But we basically have to live with it. Commercial-grade hardware has a power-fail interrupt built in to tell the OS that something bad is about to happen, often in a matter of milliseconds. So XFS has code to scram all pending DMA operations to prevent this problem, but alas PCs lack the requisite hardware to take advantage of it. On create and write a lot of blocks, ext3 journaling overhead is negligible. But nextbench (tries to simulate a samba server) creates a whole bunch of files, then deletes them over & over. So a filesystem properly tuned for the netbench benchmark should never even light the drive light, but just keep creating files, then observing that they're immediately deleted and never hitting the drive. ext3's netbench numbers are disastrous because the benchmark is so metadata-intensive. The main reason for ext3 is to eliminate the fsck cost on boot after an improper shutdown. You cannot disable journaling on ext3 now but if we add new features to ext3, we will include a nojournal option to it. proposed extensions to ext3 paper published at the monterey conference of usenix. One feature commonly requested is a dynamically extensible file system, but old kernels want to be able to read this. Online resize will be appearing soon, possibly before the end of the year. The existence of logical volume managers makes this feature a must-have. XFS and JFS have this feature today. Neither one can _shrink_ a file system. The ext3 system will be able to shrink but only off-line. The veritas file system might be the only one which allows online file shrinking. On a priority list of 0-10, online expansion is maybe an 8, but online shrinkage is about a 3. After all, disk space is cheap. Growing the inode table is the really hard part of doing online file system growth. Without having to do this it's really easy, but kind of pointless. *17* XFS -- not CXFS (they are trying to sell this) XFS and JFS work better on big SMP machines. Developers of ext3 include me, stephen tweedy, andreas grubenbacher, Daniel Phillips, Andrew Norton. *18* ext3 is currently in 2.4 and 2.6 *19* Log structured file systems -- none are available for linux right now. bsd has 'lfs', but all previous file systems are update-in-place file systems. *20* The log cleaner You get older snapshots of the file system for free in a logging file system. One problem with logging file systems is that writes are continuous but reads are scattered all over the disk. Directories are also scattered all over the disk so there's no locality whatsoever. With enough memory they though you could cache everything. This turned out not to be true in the early '80s. These days it might be not so much a disaster because memory is so much cheaper. And, what happens when the circular log fills up? The log cleaner is basically a giant garbage collector. *21* lfs performance issues. LFS read performance *23* Other non-update-in-place file systems DRBD for replication of file systems across a network *24* Other disk-based file system issues *25* Compression. The big problem with compression is how to do random access. If you think about gzip, gzip works on a stream but if you need to modify something in the middle of a gzipped file you cannot do it. The solution is to compress in blocks, uncompressing and recompressing each block in order to make random access changes. *26* Flash memory/portable linux jfff does not use standard blocking techniques, because flash memory is expensive and typically has only a limited number of write cycles available before it breaks down. This means that your smart cards have only a few hundred thousand photographs in them before they're no good any more. A 160 byte file in jfs occupies exactly 160 bytes of data memory. *27* Multiple stream support People like this because if you have a GUI it's nice to move stuff around as a complete bundle. But an executable in Mac OSX is a directory, with different files inside of it for the different components. In Mac OS/9 they learned the hard way that multiple streams were a Bad Idea, because when you tried to use ftp on a mac OS executable, you typically got only part of it -- ftp assumes only one stream per file. This lead to all kinds of hideous ftp kludges which essentially base64 encoded the file so you could get all of it, then base64 decoded it at the other end. I believe that multiple streams in a single file is a really bad idea. There are far too many things in the world which assume that a file is only one stream -- tar(1), ftp(1), et cetera. That said, the Gnome people are pushing hard for this and we may eventually see support for it in ext3. *28* Logical volume managers are not really a file system issue. IBMs EVMS allows read/write snapshots, so you can actually fork a file system. md layer raid 1. If the system crashes, then the md layer must re-mirror the entire disk. On crash it does not know whether any writes on disk 0 did not make it to disk 1. So for an hour or so after a system crash your system is a lot slower because half the disk bandwidth is taken up with copy to disk 1. A new version of evms will fix this with commits. lvm1 was removed from 2.5 because it was not stable. evms is still trying to merge in. *29* Network-based file systems *30* nfs -- the lingua franca. First release was 1985-1986. Very feature-poor, but for a lot of purposes it is good enough. caching is hard for network-based file systems because if someone else modifies a file I've cached, I now have stale data. So you need some kind of callback feature so the server can notify everyone who holds the file that their copy is no longer good and they should re-fetch if they need it. The NFS V4 spec is an RFC which tries to address some of these issues. So far the University of Michigan implementation of this is ahead of everyone else, but nobody has an actual working implementation of this spec done yet. *31* SMB/CIFS Creator of samba says that "CIFS" is 3 lies in 1: Common -- it ain't common Internet -- you can't use it over anything but a local area network File System -- No, it's a really kludgy protocol. *32* AFS and coda Digital/ibm at MIT Does all the complicated cache/callback calls you need. Replicates volumes over multiple file servers, yet the client reads from the closest one. You can actually move a file between servers, and the client will adapt transparently. Read-only. smAT -- IDE disk health monitoring system sysadmins who use AFS _never_ want to go back to NFS. Transarc developed AFS as proprietary, then took it open-source. Now there's an 'open afs' and the commercial version is sunsetted. If you have 3-4 nfs file servers, look at afs. coda is a research file system, sort of afs on steroids. It's currently moribund. *33* GFS -- originally open source, closed source software is sistina. A shared lock file system. GFS limits the number of computers talking to one disk to 8-16 machines. Used in some clustering environments. Largely moribund. *34* sfs sfs is like ssh as nfs is like telnet. For 100 or so users, kerberos is far too much overhead. So ssh is used instead -- it does not scale well, but requires no central server or admin overhead. But updating passwords is painful if you have a very large number of users. sfs tries to do for nfs what ssh does for telnet. There's no central administrator It's all done in user space by encapsulating ssh into the filename, so there are no kernel patches. You mount nfs over a loopback device, then pass it over a port, add security, and send it out to implement. A clever hack. *35* file systems specialize for specific niches over time. For general purpose file systems, competition is really useful because it keeps us honest -- many features come out quickly in a lot of general-purpose FSs.