StaQWare: High Availability for Cobalt RaQ3i Servers
Extending the Cobalt RaQ3i Server Clusters

Lisa Phifer
Tuesday, September 19, 2000 10:26:45 AM
Originally appearing in ISP Planet.
Anyone in the hosting business knows that high availability is absolutely
required. The real questions: how much do you need, how do you ensure
it, and how much to spend in doing so. A plethora of alternatives exist,
from fault-tolerant servers to hardware load balancers to open source
software. Those with a bankroll can spend anywhere from $7K to $70K on
high-speed switching and traffic management boxes that transparently distribute
load and bypass failed nodes. Enterprising admins with time but no cash
can dig into heartbeat and takeover code being developed by the High-Availability
Linux Project.
Near the low end of this price/feature spectrum is StaQWare, a high-availability
solution for Cobalt RaQ3i server clusters. This $999 software add-on provides
continuous system monitoring, unattended fail-over, and real-time data
synchronization for a redundant pair of identical RaQ3i's. Providers using
this high-density 1U Cobalt box to host customer web, file, mail, and
name services can employ StaQWare to ensure up to 99.9 percent availability.
The High Availability Landscape
If cost were no object, our goal might be continuous availabilitynon-stop
service. Instead, most settle for high availability (HA) -- measures taken
to promote high-percentage reliability, availability, and serviceability
despite system and network failures. Very high availability can be achieved
with fault-tolerant hardware -- expensive carrier-class boxes with redundant
CPUs, storage, power supplies, and NICs.
Alternatively, less expensive "non-fault-tolerant" servers can be clustered
to increase availability by using supervisory software to detect outage
and initiate failover. Surviving cluster member(s) assume responsibility
for services previously provided by the failed member. Inevitably, there
is a brief service outage during failover -- the time it takes to detect
failure and complete take-over. TCP connections may or may not survive
this lapse. Address takeover or redirection techniques are generally used
to make failover transparent to clients.
Some high availability solutions use a hot-standby or takeover pair,
where one member is always active, the other always idle. Others provide
load sharing across active cluster members -- for example, assigning
primary responsibility for sites A-L to one server, M-Z to another. Some
solutions even provide geographic load balancing across locations, adding
another availability dimension -- the ability to survive full-site outage
or WAN link failure.
Many high availability systems serve up content -- files, static web
pages, dynamic data. Some solutions rely on shared storage; others use
synchronization to ensure transparent failover and recovery without loss
of data. For example, RAID level 1 mirrors data on redundant drives, sometimes
on separate servers (peer-to-peer RAID 1). Real-time synchronization may
be irrelevant or critical, depending upon the service. Failover from one
mail server to another isn't useful if the current SMTP queue, user accounts,
and mailboxes aren't accessible to the standby server. But static web
content, updated nightly, may not benefit from 24x7 synchronization.
Each HA solution makes a tradeoff between capital investment, cost of
operation, performance overhead, and achieved RAS. Deciding how much availability
is enough for your business and your customers is the first step. Understanding
what each alternative does -- and doesn't -- do is the next. To that end,
we took a hands-on look at StaQWare in our test lab.
Next: Adding HA To Your RaQ3i »