Using RAID in Linux

By: Alexander Prohorenko
Thursday, August 1, 2002 01:52:49 PM EST
URL: http://www.linuxplanet.com/linuxplanet/tutorials/4349/1/

The Mysteries of RAID

When you look at some of the installation documents for any of popular Linux distributions, you will see only few mentions of the term RAID, typically with passages such as "you will need RAID only if you are a very professional systems administrator and you already know what are you doing."

Even in the latest documentation of the latest Linux releases, this is likely the only thing you will see about RAID. This is one big reason why I think we should move past this barrier and demonstrate that RAID can be used by "normal" people.

RAID stands for "Redundant Array of Inexpensive Disks." This seems to be rather self-explantory, except for that strange word "inexpensive." In reality, it's usually just refers to common PC hard disks, either SCSI and IDE.

But some additional explanation is necesary, however. Array simply means multiple units. It is perhaps the most significant term in the acronym--for owners of just one, even huge, hard drive, RAID is absolutely useless. Also, the word "redundant" is not an entirely descriptive label. As you'll see, it is not as easy as that.

First, let's begin by describing what RAID can be.

Hardware vs. Software RAID

Hardware RAID means that you are the proud owner of some hardware device and you are applying RAID concepts to it.

There are a lot of such devices--starting from a simple controller card all the way up to a big strong-box with many cables and hard drives inside.

The difference between these devices are that the multi-disk devices are taking care about data organization on hard drives, backup copies, "hot" replacement and other intraRAID stuff, while only asking the operating system of your PC only about one thing--the corresponding driver.

As a rule, all disks of such an array are viewable by the OS as one or more "virtual" disk device(s), nearly the same as other common drives. Linux supports a large amount of hardware RAIDs, but almost every serious server vendor tries to add to their favorite controller support for their RAID systems.

We should mention cards of vendors such as AMI MegaRAid and its Hewlett-Packard and Compaq-clones, IBM ServerRAID, and controllers by Mylex and DPT. These devices are highly recommended.

One should also point out that every serious RAID device isn't SCSI or IDE at all. Usually, when you have the proper controller driver, you can work with Linux as usual--because hardware RAID is absolutely separated from your system.

Of course, you can have some problems during the initial loading of your system--not all controllers can be automatically detected during installation and not for all of them have support integrated into the system kernel by default. Sometimes you have to re-build the system kernel and even temporarily use that for the separated "normal" disk. But this typically only happens in very rare and hard cases.

A fiscal word of caution about the popular "IDE-RAID" cards should be put forth here. Sometimes such cards can be integrated right onto motherboards, but of the price of this option is rather high.

Sometimes when the term "hardware" is applied to devices, it could refer to cards such as the popular and cheap WinModem with the combined presence of a software driver. In the case of the WinModem, this driver exists only for Microsoft software and users of other OSes can't use it.

Nevertheless, using the fact that physically the data of "RAID-controllers" are nothing more than multi-channel IDE-controllers, nothing prevents us from using such devices in Linux for organization of software RAID.

Software RAID can be organized with the help of the OS and it doesn't need anything except additional CPU time for its support. But, CPU time is the cheapest resource among all that we have. There is myth that states hardware RAID is always better than software RAID. Hardware sellers like this myth very much, for reasons which you can well understand. We can also hear the same line of thinking from system administrators, too, though nobody knows why.

Nevertheless, this myth is basically cancelled out because any hardware nowadays usually becomes obsolete before it will actually wear out, and yesterday's beautiful RAID hardware device seemingly perfect properties will just be just lost money tomorrow.

Also, RAID hardware devices, as a rule, are incompataible with each other and once a controller or some specific disk was broken it would have to be changed to exactly the same model from exactly the same vendor--otherwise it would be a serious headache.

Beyond that, software RAID works with the paritions of a hard drive, not with whole disks, and provides more flexibility. In some cases, well-constructed and organized software RAID works much faster than hardware RAID, and with less cost.

With these factors in mind, let's learn more about software RAID. When we refer to "RAID" in the next sections, we will actually be talking about "software RAID." Also, we should stipulate that for all of the techniques descibed in this article, you should have a rather new Linux package, with a kernel version equal to or higher than the latest 2.2/2.4 releases, and with the raidtools version higher than 0.90. In earlier versions, it's simply not a trivial task and there would be a lot of manual work. One last thing, in everything described below we are are assuming that you have Red Hat Linux 6.2 or later, or something from a Red Hat-based distro.

What are We Fighting For?

And, by the way, just exactly what do we want from RAID?

In general, two things: high speed or high reliability. And, if we can, both of them.

We can acheive high speed in some variants of RAID by setting in-out operations in two parallel modes to several different disk devices. We can increase reliability because several kinds of RAID keep track of additional information that helps to restore data after system crash.

For example, assume we need a "fast" RAID system. First, it should be noted that RAID can parallelize data streams for physical devices only, so paritions in "fast" RAID systems need to be on different hard drives. If you are using IDE-RAID be sure to remove all slave devices! Any one of these devices will slow down data exchange for other devices because in IDE it's impossible to maintain different data exchange rates with both devices on one cable.

For "reliable" RAID systems, you need to remember the above mentioned IDE-RAID caveat too, though for another reason. Even if you have SCSI or some other type of device, don't place too many devices on one interface. For example, in the case of a 40 Mbit UW-interface, with hard drives that each support data streams of 10-12Mbit, we don't need to place more than 3-4 such disks on that cable.

Let's discuss "reliable" RAID some more, and just what that term means. You should never think that software RAID will protect you from all software problems and errors or will eliminate the neccessity of performing a backup of your system. Nothing could be further from the truth. Any RAID is a low-level function, and any data corruption done by the system will be invisible to the RAID functions and will be duplicated on additional hard stores. The same holds true about any kind of disk errors, which cannot be detected by controller, either.

You also shouldn't try to use RAID in place of an APC reserve-power device--once the electricity is off, some data exchange transactions on the disks could be in different stages of completion, and after the next reload, the array will be asynchronized. To minimize the probability of such trouble, some hardware RAIDs can be integrated with reserve power batteries.

So basically, here's what you should know: "reliable" RAID can help you to keep your data safe only in case of good disk hardware error detection, which depends on the "level" of RAID--something we will discuss in the next section.

Counting... 0, 1, 4, 5!

Linux supports these RAID levels very well. It also supports their combinations, for example, the popular 0+1, which sometimes can be called RAID 10. Also, there is Linear-mode--also known as "paritioned volumes."

We should mention there are a lot of different pieces of literature and documentation about the different RAID levels, so we will browse them only briefly.

  • Linear-mode. Some would argue that this is not true RAID, because there is no "redundancy." In this mode the crash of any drive will destroy the whole array. It is used for the unification of small paritions of disks into one big-sized partition. Sometimes it can increase productivity. For example, when we have a lot of small files, with the ext2 file system trying to place them all over the partition, and of course, the files are all going to different disks. Good choice for news servers and other servers with similar functionality.
  • RAID 0. Again, nothing impressive about reliability. It's a classic "fast" RAID--speed, speed, and more speed, especially with reading/writing big data streams. Of course, this is done at the expense of reliability. It's similar to Linear-mode, but paritions are separated by (so-named) stripes or chunks, which will later be uniformly placed on devices included in array. The best choice for big database files, multimedia, streaming video, and so on. With medium file sizes and medium size chunks, we will see almost no effect.
  • RAID 1. Famous and very popular since the advent of Novell NetWare "mirrors." Two and more partitions of same size--information will be mirrored to each other. This is the first of "reliable" RAID levels, which achieved because of the accuracy of keeping information, which is about 100% * (N - 1), where N is number of disks in array. When one fails (or even a few of them, if N > 2), it's usually possible to work with the others. RAID Linux drivers can balance the top-level load on reading, separating requests to other "mirrors." Sometimes this helps to increase the speed of reading by almost N times. During writing, RAID 1 usually slows down because the same information could be copied on multiple disks at once. It happens that sometimes productivity of writing could be N times worse than with just one disk. RAID 10 in reality is just Linux using two or more RAID 0 arrays combined into a single RAID 1. This is possible because the RAID-array can be treated just like a normal block device or hard disk. Everything you can do with a disk, you see, you can do with an array. So, RAID 10 appears to be a good option when you have a lot of disks and controllers.
  • RAID 4. It's much like RAID 5, but the reliability is worse. As such , it's not popular, since anyone learning twoard the relable end of the RAID spectrum is just as likely to "bump up" to RAID 5. Any additional information you can find in documentation files.
  • RAID 5. Neutral in many cases. It supports the necessary reliability in cases of disk crash and supports rather good speed for reading/writing an is very economical on disk space. Total disk space equals (N - 1) * S, where N is the number of disks and S is the size of one partition. Be sure not to use RAID 5 if your the number of disks is less than three--such a configuration doesn't have any advantages over RAID 1. You'll just get "dummy" productivity and problems during recovery. RAID 5 is created with a number of chunks like in RAID 0, but part of these chunks consist of odd data blocks uniformly on all of the existing partition of the array. In the case some disk crashes, the array continue to work, however with less speed. And information from the broken disk will be restored according of contents of odd blocks. By adding a spare disk, information can be reconstructed and the workability of the array will be fully restored. It goes without saying that there is a way to do all that "on the fly" without turning off system. However, because it is necessary to reorganize all odd data blocks, it's impossible to change the size of an existing RAID 5 array without removing all data on it. It is possible to support more reliability--for example, organizing RAID 5 over three three-disk RAID 5 arrays, which supports workability in the case of a simultaneous crash of any of the three disks. And so on. Of course, relability can always be improved by losing additional disk space.

What Are We Keeping There?

This is a good question. First of all--what don't we need to keep in an array? There is no sense to keep our swap there there, especially in a RAID 0 or RAID 5 configuration. Linux can put its swap on common disks and will handle the swap space better. For example, /etc/fstab configuration can look like:


/dev/sda2       swap           swap    defaults,pri=1   0 0
/dev/sdb2       swap           swap    defaults,pri=1   0 0
/dev/sdc2       swap           swap    defaults,pri=1   0 0
/dev/sdd2       swap           swap    defaults,pri=1   0 0

which means that partitions /dev/sda2 to /dev/sdd2 are using swap with equal priority and the system will balance the load on them itself.

The only exception to this approach is when using RAID 1--in this case, the mirroring of swap-partitions can increase the long-life of your system. In case a disk crashes, then, the computer will continue to work with the swap space on the mirror.

Should we place the root file system on the array and/or try to boot from it? I don't know the proper answer, and it's a never-ending dispute among system administrators. From my point of view, there is no profitability in such a configuration, and only possible harm when you may not be able to boot at all.

In any case, it's kind of a moot point, since nowadays there is no possibility to boot from any RAID except RAID 1. Therefore, if you want to keep file systems (for some reason) on any other level of array, you will need to create a special separate partition (/boot) for kernel loading.

Also, I don't think it's good idea to keep /usr on RAID 0 or RAID 5, because in case of array rejection or breakdown, you can easily lose all the useful system tools, and without them you will have really big problems trying to restore your system integrity.

There are also the file systems /home, /opt, /var, /tmp, /usr/local and others to consider. When planning RAID, remember that usually UNIX filesystems like /home, /opt, and /usr/local obviously keep "slow-changing" data, and file systems like /var "fast-changing" data. And for /tmp, well, we don't need to take care of it at all after a system crash. So, I recommend that for /home, /opt, and /usr/local the best choice will be RAID 5 and for /var its preferable to apply RAID 0 or RAID 10. Remember, everything you decide about RAID configuration should come from your system targets and common sense.

Setting It Up

The easiest way to create a RAID array is to do it during the installation of any new Linux distribution from the graphical installer. In Red Hat the utility named Disk Druid suits our needs. You can create RAID partitions as easy as simple partitions; then you can combine them into one array and set its level. That's all!

However, sometimes Disk Druid is too "clever" and it suggests partitions placement on disks, which goes absolutely against what any system administrator would want.

If this has happened to you, you can easily divide partitions with the command fdisk (don't forget about assigning for Linux partitions RAID value with partition type 0xfd). In the future, you can combined them into larger arrays with Disk Druid.

If you don't want to reinstall a distro, this may be the best way to start working with RAID anyway. Though, as with everything else in Linux -- the best of RAID can be achieved by editing its configuration file.

So, with your favorite text editor, create file /etc/raidtab and typing something like this for RAID Linear-mode:


raiddev               /dev/md0       # raid device name
raid-level            linear         # linear mode
nr-raid-disks         2              # number of used disks
chunk-size            32             # in this case it doesn't affect at all
persistent-superblock 1              
# list of partitions below and their placement
device                /dev/sdb6      # partition name
raid-disk             0              # disk number in array
device                /dev/sdc5      # ...  and so on
raid-disk             1
# ... 
For creating such array we just need to execute:

mkraid /dev/md0 
After that, while viewing /proc/mdstat, we can make sure of the workability of our array. This device can be run with next command:

raidstart /dev/md0
and stopped with this command:

raidstop /dev/md0

Easy, isn't it?

Once the array is created, the device /dev/md0 can be used for placement of system files as usual--like with any other disk of the system. After reboot, this device will be auto-connected, without any raidstart (or raidstop) needed. You don't need to fix initialization scripts, you don't need to touch absolutely anything!

For RAID 0, the file /etc/raidtab can looks like:


raiddev               /dev/md0               # as above
raid-level            0
nr-raid-disks         2
persistent-superblock 1
chunk-size            4                      # here, size makes sense. look commons below.
# everything is the same, like in example above
device                /dev/sdb6
raid-disk             0
device                /dev/sdc5
raid-disk             1
The chunk-size argument in this case means stripe size in kilobytes. For best productivity, (at least in this configuration) the size of the partitions should average out to be the same. The default value is 4KB, however a higher value--about 32KB--will give more productiviy. It should be like the size of disk cylinders. The calculation of disk caches in modern hard drives can sometimes vary, sometimes becoming more like cache size. Creating, starting, and stopping RAID uses identical methods as those describe above.

/etc/raidtab for RAID 1 will read:


raiddev               /dev/md
raid-level            1     .
nr-raid-disks         2
nr-spare-disks        1             
chunk-size            4     # doesn't matter
persistent-superblock 1
# as usual
device                /dev/sdb6
raid-disk             0
device                /dev/sdc5
raid-disk             1
# description of drives of "hot reserve"
device                /dev/sdd5
spare-disk            0 

When we have "hot reserve" disks and if one of the "mirror" disks fails, a process of reconstruction of disk information from the proper disk in the array will start in the background. After that, the "hot reserve" disk will be exchanged with the broken disk.

Finally, for RAID 5, the /etc/raidtab file will read:


raiddev /dev/md0
raid-level            5
nr-raid-disks         3    
nr-spare-disks        1   
persistent-superblock 1
parity-algorithm      left-symmetric  # it should be this way
chunk-size            128             # "good" value for the beginning
#
device                /dev/sda3
raid-disk             0
device                /dev/sdb1
raid-disk             1
device                /dev/sdc1
raid-disk             2
device                /dev/sdd5  # reserve disk
spare-disk            0

This situation is like what we find in RAID 1. Array productiviy depends on chunk-size, so in this case you should increase that value, more than what it is in RAID 0. 128-256KB usually gives good results.

It is important to remember that while formatting the file system with mke2fs command, you have special argument stride, which affects records placement on disks. Usually, the best value of this argument is chunk size/inode size, i.e., with chunk-size = 256 and block-size = 4096 bytes, stride = 32.

You point it this way:


mke2fs -b 4096 -R stride=32

Only run this command for RAID levels 0, 4, or 5. For Linear-mode and RAID 1, it doesn't make any sense.

Recovering RAID, Hot Upgrades, and Some Final Cautions

Usually this is the best method to take for recovery.

  1. Restore system;
  2. Change "dead" disk;
  3. Starting system again;
  4. Typing
    
    raidhotadd /dev/mdX /dev/h(s)dY
    
    where X and Y - correspondingly, number of md-device and partition on the "new" disk;
  5. Waiting until array is automatically reconstructed.

"Hot" upgrades refer to the changing out of broken hard disk "on the fly", without stopping the server. This is a very useful ability, especially for servers where even a little downtime could means big trouble. This ability is often supported by expensive hardware controllers, but nothing prevents us from using it in software RAID.

But, if your RAID is IDE--forget about it, it's impossible. You can destroy your drive even with unstable electricity or just turning the machine on/off, because there is no such "bug trap" even in the interface. Beyond that, rescanning of IDE devices is absolutely necessary, and usually this can only be done with the BIOS of PC during booting.

With SCSI drives, it's a bit harder; but with special cable/disks/cutoff points and powerful controllers you canachieve a hot upgrade. But, before doing anything you should look through the hardware documentation from vendor, and check with the support team for the device if the docs aren't clear.

Finally, here are some very definite don'ts when working with RAID arrays:

  • Divide partitions included in a working array. Stop the array first. Otherwise it will be worse.
  • Execute fsck on some separated array partitions. You can easily asynchronize the array with many little and big problems and consequences. When you need to run fsck, first try to restore RAID with the utility ckraid with key --fix, and only then try fsck /dev/mdX, which will be more safe and efficient.

In general RAID is not that scary, if you look into it more deeply. Nevertheless, with all aspects of RAID, you need always remember simple precept: always back up your files!

Copyright Jupitermedia Corp. All Rights Reserved.