Tolerating Fault in an Intolerant World - page 2
The Limitations of Clustering
NEC's Express5800 Fault Tolerant Linux is the first offering from the company for this technology, which makes a simple promise: with proper care and planning, you can't bring this machine down.
Big words, but here's a watered-down version of what's going on inside one of these machines.
The 5800/ft is really just a big metal box that contains two separate, modular systems that can be released from the box with just the flip of a couple of latches. Each system is connected to each other by ASIC chips at two levels. There are also two power supplies and two CPU modules that each contain a dual-SMP set of processors and all physical memory.
There is also redundancy among the PCI I/O modules, which houses all of the system's I/O hardware. So bundled up in all of this are the USB, networks, and SCSI adapters, as well as the PCI slot for external adapters. All of this hardware, besides being redundant, uses what NEC calls "hardened" drivers to run--drivers that are tightened up with better error-handling to the point that they are very hard to break.
It is this redundancy among the two systems that makes the system so extremely available. So interconnected are the memory processes that if you physically remove an entire module from the 5800/ft box, there is zero downtime. The system maintains the exact same memory state on each system, so that not even the time-state of any one given transaction is lost. Given the huge demands of some transactional systems' uptime, that zero loss of memory state may be one of the highest draws to the system.
While each system is running, there are a number of different diagnostics that are continually run on each "side" of the entire box. If one side gets out of sync with the other, the error-handling processes will kick in and immediately determine the nature of the problem and pull the faulty system out of the processes while at the same time alerting the system administrator.
To make all this work efficiently, NEC's developers went to work on the Linux 2.4.2 kernel to harden it significantly. A new PHP module was added, memory allocation and release functions improved, and interrupt vectors were redirected from the earlier boot stages, just to list a few of the hardening features added.
With all of these redundancies and hardened software features added to these fault-tolerant systems, it is indeed hard to imagine how such a system could ever be totally brought down. If a module does fail, a company could have a new module inserted into the box within 72 hours. Depending on the level of support, turn-around time for a module could be even less--even to the point of having a spare module on-site.
The memory management of one of these systems is so enhanced that if a new module is inserted into the system, the memory of the original module can be copied over to the new module in less than one second.