|
Modern Distributed Filesystems For Linux: An Introduction
What Are Distributed Filesystems?The ability to share disks, directories, and files over a network is one of the most significant advances in modern computing, reducing local disk space requirements and making it easy for users to collaborate without ending up with hundreds of versions of the same files. Personal computers running Microsoft Windows and Apple's MacOS and Mac OS X inherently support sharing disks and directories with other systems of the same types. Linux and Unix systems traditionally use the NFS network filesystem in order to do the same sort of thing. NFS is the best-known network file-sharing mechanism for Unix, Linux, and related operating systems because it is included in most Unix-like operating system distributions and is trivial to configure. NFS is supported in the Linux kernel and NFS-related utilities are provided with every Linux distribution. However, a number of more modern mechanisms for sharing files and directories over networks are available for today's Linux systems. Each of these can provide significant administrative and usability advantages for sites running Linux. Distributed filesystems such as OpenAFS (http://www.openafs.org) are Open Source releases of distributed filesystems that have been in commercial use for over a decade (AFS). Support for network-oriented filesystems such as InterMezzo (http://www.inter-mezzo.org) and Coda (http://coda.cs.cmu.edu) is already integrated into later 2.4 Linux kernels. New, web-based file-sharing mechanisms such as WebDAV (http://www.webdav.org) are easily integrated into existing Web-oriented environments, and can be mounted as though they were filesystems. The expanding dependence on networking as a basic tenet of computing today will only help popularize these newer, more powerful file-sharing mechanisms. This article provides an overview of the benefits of distributed filesystems, discusses the most significant administrative issues in deploying and using distributed filesystems, and introduces the most interesting new distributed filesystems available for Linux today. Subsequent articles in this series will provide hands-on guidance for installing, configuring, and experimenting with some of the more interesting and useful of these networked filesharing mechanisms.
Introduction to Distributed FilesystemsToday's preferred term for networked filesystems is "distributed filesystems." This term reflects the fact that these many of these filesystems do much more than simply export data over a network. The storage media associated with these filesystems do not need to be located on a single system, but can literally be distributed across many different computers. Distributed filesystems such as OpenAFS and Coda include their own volume management mechanisms that simplify shared storage management. They also support replication, which is the ability to make copies of exported volumes and store those copies on other file servers. If one file server becomes unavailable, the data stored in its volumes can still be accessed from from available replicas of that volume. A primary difference between the types of directory and disk sharing offered by personal computer operating systems (Windows and the original Mac OS) and Unix-like multi-user operating systems such as Linux and Mac OS X is how these operating systems use and organize disks and partitions. Windows and the original Mac OS export disk partitions as separate drive letters or drives--remote systems that want to access shared devices must mount them individually. When the highest level of organization possible in a filesystem is a disk partition, as it is in Windows filesystems, client workstations who want to access this data over the network map local drive letters to these shared partitions. Shared drive mappings are usually done in Windows user and group profiles in an attempt to standardize them. Unfortunately, the letters to which these shared partiitons are mapped is not guaranteed to be the same across multiple computers systems because of the way that drive letters are assigned. Local disks and partitions always take precedence, and any system with a large number of local devices may require different drive mappings than less complex workstations. In contrast, the Unix filesystem is a hierarchical filesystem to which additional storage is added by mounting it on an existing directory in the filesystem. This effectively enables you to add storage to any existing filesystem. If you mount a new partition on a directory that is being exported as part of a distributed filesystem, that new storage instantly and transparently becomes available to clients of that distributed filesystem. Distributed filesystems such as OpenAFS and Coda that provide volume management services take this a step further by enabling you to mount volumes from different file servers into a central directory hierarchy that is supported by the filesystem. OpenAFS uses a central directory called /afs, while Coda uses /coda. These directory hierarchies are visible to all clients of these distributed filesystems, and look exactly the same from any client workstation. This enables users to access their data files in exactly the same way from any client computer. If the machine on your desk fails, you can just use another--your files are still intact and safe on the file server. Distributed filesystems that provide the same data to many different computer systems enable users to use the type of desktop machine that best suits their needs while still having access to a centralized filesystem. Macintosh users can take advantage of the superior graphics tools available under the Mac OS while transparently saving their files to centralized file servers. Windows users can have access to a robust wide-area filesystem while still being able to play Minesweeper. Distributed filesystems are especially attractive when trying to coordinate work between groups located in different cities, states, or even countries. Shared data is always available over the network, regardless of your location.
Administrative Issues in Distributed FilesystemsUsing a distributed filesystem introduces new commands and new concerns for system administrators, but also simplifies many standard administrative tasks. Distributed computing environments typically enable users to log in on any workstation within an administrative domain. This requires that the login, or authentication, mechanism, also be network-aware. In distributed filesystem environments, password and group files located on individual machines must be secondary to networked authentication mechanisms. A network-aware authentication mechanism, such as Kerberos or NIS, provides users the flexibiltiy to use any workstation, while standard machine-specific authentication mechanisms must still exist so that administartors can log in on individual machines to repair them or perform administrative tasks. Storing shared data on centralized file servers rather than on individual desktop systems simplifies administrative tasks such as backing up and restoring files and directories. It also centralizes standard storage administration tasks such as monitoring filesystem use, and introduces new possibilities for storage management, such as load balancing. Distributed filesystems such as OpenAFS and Coda provide built-in logical volume management systems that enable administrators to move heavily-used volumes to more powerful or lightly-used machines. If the distributed filesystem supports replication, copies of heavily-used volumes can be distributed across multiple fileservers for use by different clients. This can reduce network use and lighten the load on specific servers. By using logical volumes rather than disk-specific physical volumes, distributed filesystems can also make it easy to add storage to your computing environment while your systems are running, without requiring downtime. Using a distributed filesystem also makes it easier to share access to software, though you have to make sure that your software licenses enable you to install software into a distributed filesystem. Like the print servers that were part of the original motivation for client/server computing, distributed filesystems also simplify sharing access to specialized hardware by connecting to the system that hosts the hardware over the network and still being able to see all your files and data. Using a centralized distributed filesystem can provide significant cost and performance benefits for client systems. Distributed filesystems substantially reduce hardware costs by minimizing the type and amount of storage that is required on any desktop or laptop workstation. Using a distributed filesystem as the repository for user data usually means faster client restart times because much of the data is no longer stored locally and therefore does not have to be checked for filesystem consistency after restarting a client. The combination of a distributed filesystem using a journaling filesystem for all or most of the filesystems local to client systems can provide additional improvements in system restart times.
Support for Disconnected OperationIntroducing a distributed filesystem increases computer systems' dependence on the network. This dependency on file and data storage that people can only access over a network raises some interesting issues for laptop and mobile users who need access to their data even when they may not be directly connected to the network. This is known as disconnected operation, because the system needs to be able to function even when resources that it typically expects to use (such as user data) are not available in the standard fashion. Even a system like Windows provides integrated GUI and desktop features for marking files that you want to work with when you're not connected to the network, and for synchronizing those files when you reconnect. The Coda and InterMezzo distributed filesystems that are currently available for Linux provide integrated support for offline operation, and work is also being done to provide this capability for NFS filesystems. Support for both the Coda and InterMezzo filesystems is already integrated into the mainline Linux kernel source - InterMezzo support has only been available in the kernel since version 2.4.5 or so, while Coda was integrated into the 2.4 kernel source from the beginning. Coda is a distributed filesystem with its origin in AFS (the parent of OpenAFS), and has been under development at Carnegie Mellon University since 1987. InterMezzo is a relatively new distributed filesystem with a focus on high availability, flexible replication of directories, disconnected operation, and a persistent cache. InterMezzo was inspired by CMU's Coda, but is not based on the Coda source code. The initial creator of InterMezzo, Peter Braam, was the head of the Coda project at CMU for several years before moving on with InterMezzo and other advanced computing projects. Examples of using these filesystems will be provided in subsequent articles in this series.
Extending Filesystems Across the WebBefore distributed filesystems, filesharing across a network was limited to simple file transfers using protocols such as the File Transfer Protocol (FTP). The World Wide Web has largely done away with the need for standalone file transfer commands by integrating the FTP protocol into most browsers. The ability to easily transfer files over the Web has also led to substantial research in using the Web and its underlying HyperText Transfer Protocol (HTTP) as the basis for distributed filesharing. The best-known of these is WebDAV, which stands for Web-enabled Distributed Authoring and Versioning. WebDAV is a set of extensions to the HTTP protocol that provides a collaborative environment for users to access, manage, and edit files that are stored on Web servers. WebDAV support is built into popular Web servers such as the Apache Web server, where it relies on the authentication mechanisms supported by the server, ranging from simple .htaccess files to integrated NIS, LDAP, or even Windows authentication. Using WebDAV to access and update files over the Web is built into operating systems such as Mac OS X, is supported in recent versions of Microsoft's Intenet Explorer, and is also available for Linux using applications such as the GNOME project's Nautilus file manager. While not a filesystem in the traditional sense, you can even mount WebDAV filesystems on Linux systems by using the davfs loadable kernel module. WebDAV provides standard distributed computing features such as file locking, creating, renaming, copying, and deleting files, and also support advanced features such as file metadata, which is searchable information about the files on a WebDAV server, such as their title, subject, creator, and so on. In the near future, WebDAV will include integrated support for revision control, which will make it easier for multiple users to collaborate on shared files by tracking changes, their authors, and other aspects of shared document maintenance. These versioning capabilities are provided by the DeltaV protocol, which is actively under development by the DeltaV Working Group (http://www.webdav.org/deltav) of the Internet Engineering Task Force (IETF). Some projects, such as Subversion (http://subversion.tigris.org), a WebDAV and DeltaV-based replacement for the standard Unix/Linux Concurrent Versionsing System (CVS), are already available in Alpha form. Subversion provides versioning and a database-based file repository with a C-language API that simulates a versioned filesystem that is easily accessed over the Web.
Wrapping Up
IS and IT managers responsible for enterprise computing services may
already be using a distributed filesystem such as NFS, or filesystem
adapters, such as Samba, Netatalk, or the Novell-related NCP tools, to
unify their Linux and microcomputer network environments. Newer, more
powerful distributed filesystems such as OpenAFS, Coda, InterMezzo,
and WebDAV can As we'll see in subsquent articles in this series, modern distributed filesystems can easily be integrated into existing Linux computing environments. Distributed filesystems such as InterMezzo, Coda, and WebDAV can provide increased flexibilty in your computing environment, expedite filesharing, reduce per-system costs, and simplify edministrative tasks. Modern distributed filesystems are bringing new life to Sun Microsystem's slogan of "The network is the computer" by extending computer filesystems across the modern computer network. Bill von Hagen has written for Linux Magazine, Maximum Linux, Linux Format, Mac Home, Mac Tech, and various Linux and Macintosh-related online publications. He is the author of books on SGML, Linux Filesystems, and Red Hat Linux, and is the co-author of a book on Mac OS X. He is the Content Manager for TimeSys Corporation.
|