Tolerating Fault in an Intolerant World
Introducing Unstoppable Linux

Brian Proffitt
Monday, December 23, 2002 09:49:05 AM
NEC's Express5800 Fault Tolerant Linux is the first offering from the
company for this technology, which makes a simple promise: with proper
care and planning, you can't bring this machine down.
Big words, but here's a watered-down version of what's going on inside
one of these machines.
The 5800/ft is really just a big metal box that contains two separate,
modular systems that can be released from the box with just the flip
of a couple of latches. Each system is connected to each other by ASIC
chips at two levels. There are also two power supplies and two CPU
modules that each contain a dual-SMP set of processors and all
physical memory.
There is also redundancy among the PCI I/O modules, which houses all of
the system's I/O hardware. So bundled up in all of this are the USB,
networks, and SCSI adapters, as well as the PCI slot for external
adapters. All of this hardware, besides being redundant, uses what NEC
calls "hardened" drivers to run--drivers that are tightened up with
better error-handling to the point that they are very hard to break.
It is this redundancy among the two systems that makes the system
so extremely available. So interconnected are the memory processes that
if you physically remove an entire module from the 5800/ft box, there
is zero downtime. The system maintains the exact same memory state on
each system, so that not even the time-state of any one given
transaction is lost. Given the huge demands of some transactional
systems' uptime, that zero loss of memory state may be one of the
highest draws to the system.
While each system is running, there are a number of different
diagnostics that are continually run on each "side" of the entire
box. If one side gets out of sync with the other, the error-handling
processes will kick in and immediately determine the nature of the
problem and pull the faulty system out of the processes while at the
same time alerting the system administrator.
To make all this work efficiently, NEC's developers went to work on
the Linux 2.4.2 kernel to harden it significantly. A new PHP module
was added, memory allocation and release functions improved, and
interrupt vectors were redirected from the earlier boot stages, just
to list a few of the hardening features added.
With all of these redundancies and hardened software features added to
these fault-tolerant systems, it is indeed hard to imagine how such a
system could ever be totally brought down. If a module does fail, a
company could have a new module inserted into the box within 72
hours. Depending on the level of support, turn-around time for a
module could be even less--even to the point of having a spare module
on-site.
The memory management of one of these systems is so enhanced that if a
new module is inserted into the system, the memory of the original
module can be copied over to the new module in less than one second.
Next: Reaping the Benefits of Unstoppable Linux »