Rescuing Linux Systems--Generic and Distribution-Specific Safety Nets

By: Bill von Hagen
Monday, July 8, 2002 11:31:59 AM EST
URL: http://www.linuxplanet.com/linuxplanet/reports/4294/1/

Sending Out an SOS

The time comes when every Linux system administrator experiences a system failure. Hardware failures are usually resolved quickly enough by replacing a deceased motherboard, power supply, or controller, but component failure can have other side effects, especially in disk subsystems where errant or incomplete writes may corrupt boot information and filesystems. The true twilight zone for system administrators occurs when an otherwise useful system is unbootable due to disk corruption or accidental system misconfiguration--your data is just a few inches away, but is inaccessible for one reason or another.

The easiest solution to this sort of problem is a bootable disk known as a "rescue disk," located on removable media such as a floppy disk or CD. These are designed to help you boot failed systems, resolve or work around common problems, and quickly restore your system to self-sufficiency.

Linux rescue disks generally fall into two distinct classes, each with its own advantages and disadvantages. The first class of these are rescue disks that are provided with or produced by a specific Linux distribution and are therefore targeted toward correcting problems encountered on a machine running that distribution. These distribution-specific rescue disks may either be floppies created during the installation process, or may be boot options that are available from the distribution's installation CD. In either case, such distribution-specific rescue disks reflect the boot loader, filesystems, and tools used by that distribution.

The second class of rescue disks are distribution-independent, single-floppy or single-CD rescue disks that are designed to help you recover any Linux system, regardless of the distribution on which it is based. The fact that these types of rescue disks are independent of any given distribution makes them a flexible solution that you can use to repair and recover many different kinds of Linux systems. At the same time, distribution-independent rescue disks may not be able to help you if the machine you are trying to repair uses filesystems or depends on custom software that is not supported outside of a specific distribution.

Both of these types of rescue disks are like insurance policies--you hope that you don't have to use them, but you'll be glad that you have them if you need them. This article discusses the kinds of problems that typically require the use of a rescue disk, highlights the rescue mechanisms provided with various Linux distributions, and concludes by comparing and contrasting some of the more popular and powerful distribution-independent rescue disks that are freely available on the Web today.

Problems That May Require Rescue Disks

Like any operating system, Linux comes with sets of utilities that are designed to automatically repair common problems when the system verifies its state as part of the boot process. Ignoring hardware failure, which usually has a simple solution, the most common types of problems that can prevent a Linux system from booting successfully are problems related to the boot process itself or self-test problems that either prevent the system from running the self-tests or from restoring itself to a consistent state.

Some classic examples of things that can induce boot problems are missing or damaged disk blocks (such as the master boot record--MBR), missing files required by the boot loader (LILO or GRUB), bad or incorrectly updated boot loader configuration information, or a bad or missing kernel.

Assuming that the kernel and boot loader files themselves are present and consistent, errors in the /boot or / filesystems can prevent the root or /boot filesystems from being correctly identified, located, or mounted. As part of the boot process for most Linux systems, the system startup scripts verify the state of a flag in the filesystem header that identifies whether the filesystem was unmounted cleanly the last time that the system booted. Most types of filesystem corruption can be handled during the boot process, but corruption that affects the system boot scripts or the fsck utility itself can leave you with a system that "almost" boots--which is a million virtual miles from one that boots successfully.

As mentioned in the introduction to this article, different types of rescue disks have different capabilities. At the low end of the rescue spectrum, some of these simply provide a boot block and kernel that lets you mount an existing and consistent root filesystem. At the high end of the spectrum are rescue disks that provide full-blown tool sets that provide replacement tools that enable you to repair almost every type of disk corruption.

Common Rescue/Recovery Scenarios

The number of ways in which a computer system can break is essentially infinite. Luckily, the number of common "dead system" scenarios that you can actually recover from is relatively small, and falls into several general classes. The following are my favorites, and some tips and tricks for recovering from them:

  • Rescue disks were made for the situation where your system won't boot because the root filesystem is corrupted and you can't even boot to the point where you can access the fsck utility on the system itself. In this case, it's fairly easy to boot from a rescue disk and then use the version of fsck that they provide to repair the corrupted filesystem. You can use standard fsck tricks such as using an alternate superblock if the filesystem's primary superblock is damaged. If you actually lost files, you can then either copy them to removable media on another system and then reinstall the copies on your original system, or reinstall them from your original disks if you have access to them.
  • If filesystem misconfiguration is the problem, you can boot from the rescue disk and then repair the filesystem configuration file (/etc/fstab) or use utilities such as tune2fs, debugfs, and so on to correct misconfiguration in the filesystem header.
  • If you are having bootloader problems, you can boot from a rescue disk, correct the boot loader configuration files (if necessary), and then reinstall some or all of the bootloader (if necessary), including running LILO if that's your boot loader.
  • If you lost the kernel on your system, you can boot from a rescue disk, mount the root partition from the actual Linux system and then rebuild the kernel. This generally involves first using the chroot command to change the system's notion of the root filesystem so that you can then rebuild or simply reinstall the correct kernel in the correct place. If you;re lucky, you can then reboot from your original system, and voila!
  • In the worst case, you may find that your filesystems are so damaged that it is easier to reinstall your system in its entirety. In this case, you can boot from the rescue disks and then use backup utilities to back up files to supported removable media or over the network, if that's supported by the rescue disk that you're using.

The next few sections discuss a variety of different Linux rescue mechanisms, the rescue mechanisms that are provided with many common Linux distributions, and a variety of distribution-independent rescue disks that are designed to provide the tools needed to get almost any Linux system up and running, regardless the vendor who provided the distribution on which it is based.

Distribution-Specific Rescue Floppies

Most Linux distributions enable you to create a rescue floppy as part of the installation process. These rescue floppies are primarily designed to help you recover from simple boot configuration or boot loader problems, such as forgetting to update your boot loader configuration files after building a new kernel, forgetting to run LILO after such an update (if you're using LILO as your boot loader), and so on.

The primary drawback of the rescue disks created by most Linux distributions is that they don't provide tools to help you recover from more serious problems. For example, the boot configuration file on rescue disks created with the "mkbootdisk" script that is provided with Red Hat Linux contains an entry for the location of the root filesystem on the system where it was created. You can therefore only use this type of rescue disk "out of the box" to boot systems that have the same partitioning scheme and use the same general type of hardware as the system on which it was created.

When using floppy rescue disks such as Red Hat's, you can usually specify the "root=/dev/whatever" option at the rescue disk's LILO prompt if the partitioning scheme is different on the system that you are trying to rescue, replacing "whatever" with the name correct root partition for the system that you are trying to rescue. However, since this type of rescue disk contains a kernel image taken from the system on which it was created, it may not support the hardware in your other systems. For example, a rescue disk created on a system without SCSI support won't help you rescue a SCSI-based system unless the kernel on the rescue disk system has SCSI support compiled in, since most rescue floppies do not include loadable kernel modules. This isn't surprising--after all, they have to fit on a floppy.

Floppy-based rescue disks that are simply designed to help you boot your system usually do not include any utilities to enable you to repair a more seriously damaged system. To continue with the Red Hat example, rescue disks created with the "mkbootdisk" script only contain a boot block, kernel, and associated configuration files. They depend on being able to locate and mount your system's root filesystem in order to find the tools that you may need to completely "rescue" that system. If your system's root file system is corrupted or otherwise damaged, you may not be able to access those tools.

Most of today's Linux distributions include a "rescue" mode on their boot CD that enables you to boot a generic kernel and provides access to critical tools such as fsck and the utilities used to create and write boot configuration information. The next section explores the rescue capabilities of the boot CDs for a variety of Linux distributions.

Booting Popular Linux Distributions in Rescue Mode

Though most businesses that depend on Linux tend to run the same distribution on all their machines, there are still good reasons for sticking with vendor-supported distributions even if they're not your favorite Linux. Hardware vendors such as Dell (who actively supports Linux) often provide distributions with enhanced or customized drivers that reflect the hardware that they ship, and may not be able to help you unless you're running their Linux. Similarly, developers may have installed their favorite Linux on their desktops, but still expect you to be able to save them when they've accidentally dd'ed their .bashrc over their hard drive's boot block.

Running multiple Linux distributions in the same machine room or across the desktops in your enterprise sometimes means that you may not be able to quickly put you hands on "the right" set of Linux CDs when a problem occurs - you may have stored them off-site or may simply be drowning in a sea of installation CDs, and a crisis is no time to lament your filing system. Luckily, even if your damaged Linux system runs a different Linux distribution, you can often use the boot CD from one distribution to rescue another, depending on the type of hardware you're using and the filesystems and boot loaders supported by each.

The following sections provide an overview of the rescue modes provides by most of the popular Linux distributions and the tools that each provides to help you get your damaged Linux system up and running again.

Debian Woody

Booting from a Debian 2.6 CD created using the jigdo utility provides two rescue mechanisms. You can either boot from the CD and begin an installation that you can escape from in order to repair an existing system, or specify an existing root partition using the root=/dev/whatever option. In the former case, after selecting the language and language variant that you want to use, you can use the arrow keys to select repair/recovery-related commands such as "View the Partition Table", "Execute a Shell", "Make System Bootable" (after root mounted), "Make a Boot Floppy" (after root and swap mounted), and so on. Selecting Execute a Shell" gives you access to fsck for ext2 and ext3 filesystems, the fdisk partitioning utility, and the mkfs utility for ext2 filesystems and mkswap utility to recreate corrupted swap partitions.

Mandrake 8.2

Mandrake's traditional user friendliness carries over into the rescue mode provided by their distribution CDs. After booting from the first CD, press F1 for a list of advanced options, type "rescue" at the boot prompt, and press <CR>. After booting, Mandrake's rescue mode displays a menu of choices: "Re-install Boot Loader", "Restore Windows Boot Loader", "Mount your partitions under /mnt", "Go to console", "Reboot", and "Doc: what's addressed by this Rescue?". "Go to console" is the most useful of these choices across a variety of Linux systems, because it simply gives you a root prompt and access to the tools that you can use to repair your system.

Mandrake 8.2's rescue mode provides versions of fsck for the ext2, ext3, JFS, ReiserFS, and XFS filesystems. For true calamities, it also provides the fdisk and sfdisk partitioning utilities, as well as mkfs for ext2 filesystems. If your bootloader is damaged, Mandrake's rescue mode also provides grub-install.

Red Hat 7.3

Red Hat's rescue mode is quite powerful. After booting from CD #1, press F4 for information and type "linux rescue" at the LILO prompt to boot the rescue kernel. You then select the language and keyboard that you want to use, and the rescue mode tries to locate an existing Red Hat root directory and mount it as /mnt/sysimage. You can then press <CR> to log in and get a shell.

Red Hat's rescue mode only includes versions of fsck for ext2, ext3, and ReiserFS filesystems. If a partition is beyond recovery through fsck, Red Hat's rescue mode includes the fdisk and sfdisk partitioning tools, and the commands to create ext2, ext3, RAID, ReiserFS, swap, and even VFAT filesystems. It includes LILO and provides access to grub-install if it has been able to find and mount existing partitions that include this utility.

Slackware 8.1

As one of the longest-lived Linux distributions, Slackware's lack of focus on graphical bells and whistles makes it a good candidate for rescuing any damaged Linux system. After booting from the first install CD, you can either press <CR> to load the standard kernel or press F3 to see a list of bootable kernels with built-in support for specific device subsystems, filesystems, and so on, including SCSI, the XFS filesystem, the JFS filesystem, RAID, USB, and so on.

Once you've booted from one of these kernels, Slackware provides convenient commands for probing and initializing subsystems such as the network, PCMCIA, and so on. It only includes versions of fsck for the ext2 and ext3 filesystem, since recovery of JFS and XFS filesystems should be handled as part of the mount command under a kernel that supports these filesystems. In the catastrophe department, Slackware includes the fdisk and cfdisk partitioning utilities, versions of mkfs for ext2, ext3, JFS, ReiserFS, and XFS filesystems, and also includes the commands to create RAID devices and the physical volumes used by LVM. Slackware's rescue mode does not include LILO or grub-install.

SuSE 8.0

As you might expect from the world's largest, most-frequently-updated Linux distribution, SuSE install CDs provide a rescue mode that gives you more tools than you can shake an operator at. To boot SuSE 8.0 in rescue mode, boot from CD #1 or the DVD, use the arrow keys to select "Rescue System" from the initial menu and press <CR>. The rescue kernel boots, prompts you to select a keyboard map, and gives you a root prompt.

SuSE 8.0's rescue mode provides versions of fsck for the ext2, ext3, jfs, ReiserFS, and XFS filesystems, as well as additional tools for repairing ReiserFS and XFS filesystems, LVM (logical volume management), and RAID support. For worst-case partition damage scenarios, it provides fdisk and versions of mkfs for all of these types of filesystems. It also includes LILO if you need to rewrite the boot block on your primary disk.

Distribution-Independent Rescue Disks

The previous section highlighted the rescue modes provides by various Linux distributions from their primary installation CDs. The ultimate recovery disk for a Linux distribution is, of course, the boot CD from which you installed that distribution or a rescue disk produced when installing that particular system. However, if you run many different Linux distributions or even multiple versions of the same Linux distribution, keeping track of all of those rescue floppies and installation CDs can be tricky at best.

An easy solution to this sort of problem is to use one of the many distribution-independent rescue disks that are available on the Web. Some of these began life as small Linux distributions themselves, while others grew out of popular Linux distributions such as Red Hat. The common thread in all of these rescue disks is that they are tailored toward providing the tools you need to repair and recover other systems - they are not designed to be Linux distributions, but Linux sysadmin toolkits. Their focus is on relatively small size and providing the tools you need to get other Linux systems up and running again.

The next sections summarize the most popular and powerful rescue disks that I've found and used to repair different types of systems in the past. Some of these are floppy-based, which has some obvious advantages--even if you don' t have access to a working Linux system, you can quickly create boot disks from online floppy images using the RAWRITE.EXE program. Others are provided as ISO images, which have the obvious advantage of being larger and therefore (hopefully) providing a wider selection of tools. You can also usually create bootable CDs from a DOS or Windows system depending on the CD burning software that you have, and whether you have a CD burner in the first place.

Tom's Root Boot

Tom's Root Boot (available at http://www.toms.net/rb and often known as "tomsrtbt") is a single floppy rescue solution whose slogan is "The most GNU/Linux on one floppy disk". This is actually quite, true, as Tom's Root Boot provides an incredible assortment of Linux and recovery-related tools and hardware support options on a single floppy. Tom's Root Boot floppies are created by using dd to copy a 1.722-MB floppy image (double-sided, 82 tracks, 21 sectors/track) to a standard, formatted floppy.

Aside from the specialized disk format, one of the primary ways that Tom's Root Boot provides such a huge assortment of tools is by making heavy use of Erik Anderson's amazing BusyBox (http://www.busybox.net/). BusyBox is a single binary that provides the functionality of many standard Linux utilities based on the name by which it is invoked. For example, creating a hard link to the BusyBox executable named "ls" and executing the resulting "rm" command causes BusyBox to behave as the standard Unix/Linux file deletion command. Hard links are used rather than symbolic links in order to save space.

Tom's Root Boot uses BusyBox to provide commands such as chgrp, chmod, chown, chroot, clear, cmp, egrep, ifconfig, init, insmod, mknod, mkswap, rm, route, sed, tail, telnet, and many more. In addition, Tom's Root Boot contains small versions of partition access and recovery tools such as debugfs, mount, mke2fs, and tune2fs. Tom's Root boot includes drivers for popular SCSI, PCMCIA, parallel-port ZIP, and network adapters, making it easy to access a variety of devices and even get an existing system up on the network. If you are dealing with a system that you can't repair without recovering, Tom's Root Boot includes tools such as cpio and tar (both actually links to the "pax" archive utility) to enable you to archive data to external parallel and SCSI devices in an emergency.

The RIP/RamFloppy Linux Rescue System

Kent Robotti provides two excellent rescue systems from his Web page at http://www.tux.org/pub/people/kent-robotti/looplinux/rip/index.html. This page provides floppy and CD-based versions of the RIP rescue system, and a similar floppy-based rescue system known as RamFloppy. All of these systems are designed to help you boot, repair, back up, and otherwise rescue existing systems.

All of these rescue disks take a similar approach to Tom's Root Boot by using a combination of CRUNCHbox and BusyBox to maximize the number of utilities that it can provide by minimizing the disk space that they require. CRUNCHbox is a mechanism whereby multiple static executables are combined into a single utility that executes the specified utility based on the name by which it is invoked. The utilities that are crunched together on the RamFloppy rescue disk include BusyBox, providing double savings in some sense.

The RamFloppy and RIP floppy rescue systems support ext2, ext3, is9660, ntfs, umsdos, ReiserFS, and VFAT filesystems, making it easy for you to mount and access these types of filesystems on the computer that you are rescuing. They provide the e2fsck command to enable you to correct filesystem corruption on ext2 and ext3 filesystems, includes the tune2fs command for repairing ext2 and ext3 filesystem attributes, and includes the badblocks utility to search for and identify bad blocks on an existing disk. It also includes tools for creating ext2, ext3, and ReiserFS filesystems. For truly damaged disks, it includes fdisk for basic partitioning and mke2fs and mkreiserfs for creating new ext2, ext3, and ReiserFS filesystems.

The RIP bootable rescue CD provides all of the tools on the floppy systems, but adds support for the JFS and XFS journaling filesystems, as well as the fsck.jfs and xfs_repair utilities that can be used to repair corrupted filesystems of those types. The RIP CD includes easier-to-use filesystem partitioning utilities such as cfdisk and sfdisk, in addition to fdisk.

The RIP CD also includes utilities such as mkisofs and cdrcord that enable you to back up data from a damaged system onto CD. Beyond simply backing up files, the RIP rescue CD also include the partimage utility that enables you to create image files of some or all of your existing partitions, which you can then also back up onto CD (dependent on their size, of course). This can save an incredible amount of time if you find that the easiest way to repair your damaged system is simply to reinstall it, but want a quick and easy way to recreate some of your existing filesystems. The image files produced by the partimage utility only contain portions of the filesystem that actually held data, making them potentially much smaller than the partitions that originally held the filesystems.

All of the RIP and RAMFloppy rescue systems can be used to boot an existing Linux system by passing the "linux root=/dev/whatever" boot options to the boot loader. You can use other command-line options to boot into single-user mode or to specify an alternate to the standard "init" program. This last option enables you to boot directly into a utility such as /bin/sh if the init program or related startup scripts are damaged on the system that you are rescuing but you still want to boot from the hard disk's root partition.

The RIP rescue CD and floppy, and the RamFloppy rescue floppy are free, powerful rescue disks that include support for an impressive number of filesystems, devices, and backup mechanisms. They do not provide network support unless you use them to boot from an undamaged root partition whose kernel and modules provide network support (and which provides network-related utilities). However, their focus on enabling you to repair filesystems of various types and create backups of existing data can be a tremendous asset when your only other alternative is to lose data or slowly copy smaller files to small external media such as floppy disks.

SuperRescue

H. Peter Anvin's SuperRescue rescue disk (http://freshmeat.net/projects/superrescue/?topic_id=866%2C861) advertises itself as "the most overfeatured rescue disk ever created", and lives up to the name. Based on Red Hat 7.2 (for the most part), SuperRescue essentially gives you a complete, single-CD version of Linux that contains everything from basic rescue-related utilities (fsck, tunefs, fdisk, etc), all the way to a version of the X Window system pre-configured for 1024x768 (but tunable with the Xconfigurator utility - also included, of course).

SuperRescue uses a special on-the-fly compression/decompression mechanism to provide 1.7 GB of binaries on a single bootable CD. Incredibly, after burning a SuperRescue CD from the downloadable ISO, booting from it, and starting multi-user mode, I was able to use the "startx" script to start not only the X Window system, but the entire KDE desktop. I would have been happy just to get twm, but all of KDE? Full-featured indeed.

Providing the X Window system is interesting, but the real focus of s rescue disk is the disk rescue, recovery, restoration, and backup utilities that it provides, SuperRescue is still reasonably complete in this respect, providing versions of fsck for the ext2, ext3, and ReiserFS filesystems. Given the size of SuperRescue and the number of tools that it provides, I was somewhat disappointed by the absence of tools and support for journaling filesystems such as JFS and XFS (even though these often require kernel patches to integrate into the Linux kernel). I would have traded support for these filesystems for a fair number of the X Window system utilities that SuperRescue includes, but that's my personal choice. SuperRescue does provide support for logical volume management and the physical volumes that underly it. This makes SuperRescue quite attractive if you are trying to repair or recover a system that uses standard Linux logical volume management.

SuperRescue is also a great rescue disk to use if you want to get the Linux system you're repairing onto a network and integrated with networked filesystems. SuperRescue includes loadable kernel modules for the Coda and InterMezzo distributed filesystems as well as Appletalk, IPX, and SMBFS if you want to mount remote Apple, Novell, or Windows filesystems and copy data to or from them. It has a limited number of loadable kernel modules for various PCMCIA Ethernet cards, but complaining about this is really looking a gift horse in the mouth.

SuperRescue's bountiful approach to rescuing damaged Linux systems excels in providing access to a wide variety of utilities that you can use to back up data from the damaged system that you are rescuing. SuperRescue includes dump, rdump, tar, cpio, and CD-related utilities such as mkisofs and cdrecord.

SuperRescue is an impressive piece of work. If you are more comfortable trying to rescue a Linux system using a graphical interface, SuperRescue is the only system-independent rescue disk that includes the X Window system. (There are various small Linux distributions that provide MicroX or Nano-X, but these are not specifically designed as rescue disks, and so are outside the scope of this article.)

Wrapping Up

As mentioned in the introduction to this article, few things are more frustrating than being inches away from data that you can't access because of disk, boot loader, or filesystem problems or corruption. Having bootable floppies or CDs handy gives you a toolkit that you can use to bring a damaged Linux system back to life or to at least back up some or all of its data before resorting to the classic reformat/reinstall paradigm that is so popular in the Windows world.

As discussed in this article, having the installation CDs for the Linux distribution you are using is the fastest and easiest way to recover a damaged Linux system. Nowadays, all of the popular Linux distributions either provide an explicit rescue mode or a shell that you can use to try to repair filesystem damage, replace or reinstall missing or damaged files and packages, or reinstall your favorite bootloader.

If you are running many different types of Linux systems or simply can't find the installation disks used to build a damaged Linux system, the system-independent rescue disks discussed in this article can repair most common filesystem and bootloader-related problems. This article highlighted my personal favorites - there are many others. You can find an extensive list of pre-made rescue disks at http://www.tldp.org/HOWTO/Bootdisk-HOWTO/premade.html, which is a section of the Linux Boot Disk HOWTO which is itself an incredibly useful reference for resolving boot problems.

One of the greatest things about Linux is its community philosophy, which is much of the reason for the system-independent rescue disks discussed in this article. The greatest Unix recovery story I've ever read was in a heroic old Usenet posting, now available thanks to Google at
http://groups.google.com/groups?q=Mario+Wolczko+Alasdair&hl=en&lr=&ie=UTF-8&selm=telecom16.402.11%40massis.lcs.mit.edu&rnum=3. With a few Linux rescue disks in your sysadmin toolbox, you may be able to get the same results--your data or your system back--with much less wizardry.

Copyright Jupitermedia Corp. All Rights Reserved.