April 20, 2014

An In-Depth Look at Reiserfs - page 4

Included in the Linux kernel

  • January 22, 2001
  • By Scott Courtney
Now that you understand the need for journaled filesystems, we can take a look at one particular type, called Reiserfs. Originally designed by Hans Reiser, Reiserfs carries the analogy between databases and filesystems to its logical conclusion. In essence, Reiserfs treats the entire disk partition as if it were a single database table. Directories, files, and file metadata are organized in an efficient data structure called a "balanced tree." This differs somewhat from the way in which traditional filesystems operate, but it offers large speed improvements for many applications, especially those which use lots of small files.

Reading and writing of large files, such as CDROM images, is often limited by the speed of the disk hardware or the I/O channel, but access to small files such as shell scripts is often limited by the efficiency of the filesystem design. The reason for this is that opening a file requires the system first to locate the file, and that means reading directories off the disk. Furthermore, the system needs to examine the security metadata to see if the user has permission to access the file, and that means additional disk reads. The system can literally spend more time deciding whether to allow the access, and then locating the data on the drive, than it does actually reading such a small amount of information from the file itself.

Reiserfs uses its balanced trees to streamline the process of finding the files and retrieving their security (and other) metadata. For extremely small files, the entire file's data can actually be stored physically near the file's metadata, so that both can be retrieved together with little or no movement of the disk seek mechanism. If an application needs to open many small files rapidly, this approach significantly improves performance.

Another feature of Reiserfs is that the balanced tree stores not just metadata, but also the file data itself. In a traditional filesystem such as ext2, space on the disk is allocated in blocks ranging in size from 512 bytes to 4096 bytes, or even larger. If a file's size happens to be anything other than an exact multiple of the block size, space will be wasted. For example, suppose the block size is 1024 bytes but you need to store a file that is 8195 bytes long. Eight blocks is 8192, so almost all of the file will fit into eight blocks. The remaining three bytes have their own block, which is mostly empty! The wasted space is almost one whole block out of nine, or about 11 percent. Now imagine a file 1025 bytes long. It almost, but not quite, fits into one block, but requires two. The wasted space is nearly 50 percent. The worst case is a very tiny file, such as a trivial (but useful) one-line shell script. Such a file may be only 50 bytes or so (for example) and would fit into just one block. But if the block is 1024 bytes, then the file has wasted about 95 percent of its allocated space. As you can see, the wasted space (as a percentage) is smaller if the files are larger.

Reiserfs doesn't use a traditional block approach to allocating space, instead relying on the tree structure to keep track of exact byte counts. On small files, this can save a lot of storage space. Furthermore, since more files are placed closer together, the system is able to open and read many small files with just one physical access to the drive. This further improves performance by eliminating time-consuming head seek operations.

Some applications benefit more than others from this type of optimization. Imagine a directory with hundreds of tiny PNG or GIF files used as web page icons, on a busy site. This situation is tailor-made for something like Reiserfs. Likewise, a web site with thousands of HTML files, each just a few kilobytes in size, is an excellent candidate. On the other hand, a disk partition that stores ISO9660 CDROM images, each hundreds of megabytes in size, will see little performance gain from Reiserfs. As with so many other things in the world of computing, the best performance is gained by matching the right tool with the job at hand. (Note that I'm not saying Reiserfs is slower than ext2 on large files -- only that there won't be much difference in some cases.)

On top of everything else, Reiserfs is a true journaled filesystem like xfs, ext3, and IBM's JFS. Each of these systems implements the journaling feature in a different way, but the effect is the same: extremely good reliability, and extremely fast recovery after an abrupt shutdown or crash. On my system, I have found that filesystems that took several minutes to check using ext2 take only a second or two under Reiserfs. This difference is typical of any journaled filesystem versus a traditional filesystem.

Sitemap | Contact Us