StaQWare: High Availability for Cobalt RaQ3i Servers
Extending the Cobalt RaQ3i Server ClustersOriginally appearing in ISP Planet.
Anyone in the hosting business knows that high availability is absolutely required. The real questions: how much do you need, how do you ensure it, and how much to spend in doing so. A plethora of alternatives exist, from fault-tolerant servers to hardware load balancers to open source software. Those with a bankroll can spend anywhere from $7K to $70K on high-speed switching and traffic management boxes that transparently distribute load and bypass failed nodes. Enterprising admins with time but no cash can dig into heartbeat and takeover code being developed by the High-Availability Linux Project.
The High Availability LandscapeIf cost were no object, our goal might be continuous availabilitynon-stop service. Instead, most settle for high availability (HA) -- measures taken to promote high-percentage reliability, availability, and serviceability despite system and network failures. Very high availability can be achieved with fault-tolerant hardware -- expensive carrier-class boxes with redundant CPUs, storage, power supplies, and NICs.
Alternatively, less expensive "non-fault-tolerant" servers can be clustered to increase availability by using supervisory software to detect outage and initiate failover. Surviving cluster member(s) assume responsibility for services previously provided by the failed member. Inevitably, there is a brief service outage during failover -- the time it takes to detect failure and complete take-over. TCP connections may or may not survive this lapse. Address takeover or redirection techniques are generally used to make failover transparent to clients.
Some high availability solutions use a hot-standby or takeover pair, where one member is always active, the other always idle. Others provide load sharing across active cluster members -- for example, assigning primary responsibility for sites A-L to one server, M-Z to another. Some solutions even provide geographic load balancing across locations, adding another availability dimension -- the ability to survive full-site outage or WAN link failure.
Many high availability systems serve up content -- files, static web pages, dynamic data. Some solutions rely on shared storage; others use synchronization to ensure transparent failover and recovery without loss of data. For example, RAID level 1 mirrors data on redundant drives, sometimes on separate servers (peer-to-peer RAID 1). Real-time synchronization may be irrelevant or critical, depending upon the service. Failover from one mail server to another isn't useful if the current SMTP queue, user accounts, and mailboxes aren't accessible to the standby server. But static web content, updated nightly, may not benefit from 24x7 synchronization.
Each HA solution makes a tradeoff between capital investment, cost of operation, performance overhead, and achieved RAS. Deciding how much availability is enough for your business and your customers is the first step. Understanding what each alternative does -- and doesn't -- do is the next. To that end, we took a hands-on look at StaQWare in our test lab.