An In-Depth Look at Reiserfs
Reiserfs

Scott Courtney
Monday, January 22, 2001 08:42:21 AM
Now that you understand the need for journaled filesystems,
we can take a look at one particular type, called Reiserfs.
Originally designed by Hans Reiser, Reiserfs carries the analogy
between databases and filesystems to its logical conclusion.
In essence, Reiserfs treats the entire disk partition as if it
were a single database table. Directories, files, and file
metadata are organized in an efficient data structure called
a "balanced tree." This differs somewhat from the way in which
traditional filesystems operate, but it offers large speed
improvements for many applications, especially those which
use lots of small files.
Reading and writing of large files,
such as CDROM images, is often limited by the speed of the
disk hardware or the I/O channel, but access to small files
such as shell scripts is often limited by the efficiency of
the filesystem design. The reason for this is that opening
a file requires the system first to locate the file, and
that means reading directories off the disk. Furthermore,
the system needs to examine the security metadata to see
if the user has permission to access the file, and that
means additional disk reads. The system can literally spend
more time deciding whether to allow the access, and then
locating the data on the drive, than it does actually reading
such a small amount of information from the file itself.
Reiserfs uses its balanced trees to streamline the process
of finding the files and retrieving their security (and other)
metadata. For extremely small files, the entire file's data
can actually be stored physically near the file's metadata,
so that both can be retrieved together with little or no
movement of the disk seek mechanism. If an application needs
to open many small files rapidly, this approach significantly
improves performance.
Another feature of Reiserfs is that the balanced tree stores
not just metadata, but also the file data itself. In a
traditional filesystem such as ext2, space on the disk is
allocated in blocks ranging in size from 512 bytes to
4096 bytes, or even larger. If a file's size happens to
be anything other than an exact multiple of the block
size, space will be wasted. For example, suppose the block
size is 1024 bytes but you need to store a file that is
8195 bytes long. Eight blocks is 8192, so almost all of
the file will fit into eight blocks. The remaining three
bytes have their own block, which is mostly empty! The
wasted space is almost one whole block out of nine,
or about 11 percent. Now imagine a file 1025 bytes
long. It almost, but not quite, fits into one block, but
requires two. The wasted space is nearly 50 percent.
The worst case is a very tiny file, such as a trivial
(but useful) one-line shell script. Such a file may
be only 50 bytes or so (for example) and would fit
into just one block. But if the block is 1024 bytes,
then the file has wasted about 95 percent of its
allocated space. As you can see, the wasted space (as a
percentage) is smaller if the files are larger.
Reiserfs doesn't use a traditional block approach to
allocating space, instead relying on the tree structure
to keep track of exact byte counts. On small files, this
can save a lot of storage space. Furthermore, since more
files are placed closer together, the system is able to
open and read many small files with just one physical
access to the drive. This further improves performance
by eliminating time-consuming head seek operations.
Some applications benefit more than others from this type
of optimization. Imagine a directory with hundreds of tiny
PNG or GIF files used as web page icons, on a busy site.
This situation is tailor-made for something like Reiserfs.
Likewise, a web site with thousands of HTML files, each
just a few kilobytes in size, is an excellent candidate.
On the other hand, a disk partition that stores ISO9660
CDROM images, each hundreds of megabytes in size, will see
little performance gain from Reiserfs. As with so many
other things in the world of computing, the best performance
is gained by matching the right tool with the job at hand.
(Note that I'm not saying Reiserfs is slower than ext2 on
large files -- only that there won't be much difference in
some cases.)
On top of everything else, Reiserfs is a true journaled
filesystem like xfs, ext3, and IBM's JFS. Each of these
systems implements the journaling feature in a different
way, but the effect is the same: extremely good reliability,
and extremely fast recovery after an abrupt shutdown or
crash. On my system, I have found that filesystems that
took several minutes to check using ext2 take only a second
or two under Reiserfs. This difference is typical of any
journaled filesystem versus a traditional filesystem.
Next: Installation -- Aye, There's the Rub! »