|
StaQWare: High Availability for Cobalt RaQ3i Servers
Extending the Cobalt RaQ3i Server ClustersOriginally appearing in ISP Planet. Anyone in the hosting business knows that high availability is absolutely required. The real questions: how much do you need, how do you ensure it, and how much to spend in doing so. A plethora of alternatives exist, from fault-tolerant servers to hardware load balancers to open source software. Those with a bankroll can spend anywhere from $7K to $70K on high-speed switching and traffic management boxes that transparently distribute load and bypass failed nodes. Enterprising admins with time but no cash can dig into heartbeat and takeover code being developed by the High-Availability Linux Project. Near the low end of this price/feature spectrum is StaQWare, a high-availability solution for Cobalt RaQ3i server clusters. This $999 software add-on provides continuous system monitoring, unattended fail-over, and real-time data synchronization for a redundant pair of identical RaQ3i's. Providers using this high-density 1U Cobalt box to host customer web, file, mail, and name services can employ StaQWare to ensure up to 99.9 percent availability. The High Availability LandscapeIf cost were no object, our goal might be continuous availabilitynon-stop service. Instead, most settle for high availability (HA) -- measures taken to promote high-percentage reliability, availability, and serviceability despite system and network failures. Very high availability can be achieved with fault-tolerant hardware -- expensive carrier-class boxes with redundant CPUs, storage, power supplies, and NICs.Alternatively, less expensive "non-fault-tolerant" servers can be clustered to increase availability by using supervisory software to detect outage and initiate failover. Surviving cluster member(s) assume responsibility for services previously provided by the failed member. Inevitably, there is a brief service outage during failover -- the time it takes to detect failure and complete take-over. TCP connections may or may not survive this lapse. Address takeover or redirection techniques are generally used to make failover transparent to clients. Some high availability solutions use a hot-standby or takeover pair, where one member is always active, the other always idle. Others provide load sharing across active cluster members -- for example, assigning primary responsibility for sites A-L to one server, M-Z to another. Some solutions even provide geographic load balancing across locations, adding another availability dimension -- the ability to survive full-site outage or WAN link failure. Many high availability systems serve up content -- files, static web pages, dynamic data. Some solutions rely on shared storage; others use synchronization to ensure transparent failover and recovery without loss of data. For example, RAID level 1 mirrors data on redundant drives, sometimes on separate servers (peer-to-peer RAID 1). Real-time synchronization may be irrelevant or critical, depending upon the service. Failover from one mail server to another isn't useful if the current SMTP queue, user accounts, and mailboxes aren't accessible to the standby server. But static web content, updated nightly, may not benefit from 24x7 synchronization. Each HA solution makes a tradeoff between capital investment, cost of operation, performance overhead, and achieved RAS. Deciding how much availability is enough for your business and your customers is the first step. Understanding what each alternative does -- and doesn't -- do is the next. To that end, we took a hands-on look at StaQWare in our test lab.
Adding HA To Your RaQ3iStaQWare lets you quickly create a high-availability cluster of two RaQ3i servers, configured as an active/standby pair. The RaQ3i's must have identical internal disks and dual NICs -- only RAM can vary. External storage and non-RaQ3i servers cannot be used with StaQWare. No hardware changes are required to your LAN or the active RaQ3i -- StaQWare is truly a "drop in" upgrade. Because the standby RaQ3i is wiped clean during installation, most will purchase a StaQWare license ($999) and a standby RaQ3i (from $2m499) for each existing RaQ3i to be upgraded with HA. RaQ3i's have dual 10/100 NICs. "Network 1" NICs connect both servers to your production LAN for normal traffic and availability monitoring. "Network 2" NICs connect the pair back-to-back for real-time data synchronization (peer-to-peer RAID). Cobalt strongly recommends a cross-over cable be used for Network 2. We found no reason not to: RaQ3i's with StaQWare do not listen to any other protocol on this NIC, and synchronization is much slower if a hub is used. Supplied diagrams clearly illustrate cabling alternatives. Installation is simple. Cable the units and power them on. Use the new (standby) RaQ3i's LCD control panel to configure Network 1 address. Access the Server Management GUI on the existing (active) RaQ3i at http://activeIP:81/.cobalt/sysManage. Select Maintenance / Install Software and verify that the OS is version 2.0 or later. If not, upgrade the OS using the supplied CD. We subsequently upgraded to version 3.0 and strongly recommend that all users do so before StaQWare installation -- we'll explain why later. StaQWare is installed just like as any other RaQ software upgrade: select the file supplied on the StaQWare CD using Maintenance / Install Software. After reboot, use this page to jump to HA Cluster Management or go to http://activeIP:81/.cobalt/cobalt-ha directly. There, select Cluster Network to launch a 3-step configuration wizard. To configure StaQWare, you'll need the admin login/password and six (6) IP addresses:
While c) can be private, a) and b) must be routable public addresses. Address a) on the active RaQ3i is assumed by the standby RaQ3i during failover -- this is the external address used by clients to reach the server cluster. The other addresses must be distinct from virtual site addresses hosted by this server. After initial configuration, the active RaQ3i contacts the standby RaQ3i and begins to mirror stored data to the standby, a process that can take up to 3 hours. During this time, the active RaQ3i remains in-service, supporting web, file, and mail clients. Cluster Status and ControlOnce created, all HA Cluster Management occurs through the active RaQ3i. . The standby cannot be managed through either NIC -- although it can be monitored by pinging address b).A Cluster Status page indicates whether HA services are available and, if not, why not. As a rule, HA services are unavailable ("degraded") whenever the standby RaQ3i is unreachable, data has not yet been (re)synchronized, or HA has been administratively disabled. When HA services are available, the active and standby continuously ping each other over Network 1, and synchronize data using peer-to-peer RAID over Network 2. If the standby cannot ping the active for a configurable period (by default, 5 seconds), it initiates failover. If the active cannot ping the standby for this period, it declares HA services unavailable. This "outage tolerance" can be adjusted up to 5 minutes to accommodate frequent-but-normal loss of reachability. HA services can also be disabled for up to 24 hours to permit maintenance of the standby. For scheduled maintenance of the active, manually initiate failover, then disable HA services.
Synchronization -- Slow But SureActive and standby RaQ3i's try to maintain data synchronization all the time, so that the standby can jump in immediately and replace the active without loss of data. Unfortunately, when a failed or disabled unit returns to service, synchronization is lost. If just one block of data differs, the standby must be completely resynchronized with the active. We found 90 minutes typical when syncing 20.4 GB disks over a 100 Mbps cross-connect. When we replaced this cross-connect with a 10 Mbps hub, the period quadrupled. The good news: the active RaQ3i continues to serve requests during resynchronization, and we encountered no data corruption or loss when we interrupted this process -- it simply restarted again from scratch. Web transactions and mail transfers completed before failover were mirrored to the standby and available after failover. The bad news: during resynchronization, HA services are (naturally) unavailable, leaving the active vulnerable to failure. Failover -- When, Why, and How?Failover from active to standby is initiated automatically when StaQWare detects a problem (left). What kinds of problems can StaQWare detect, and how quickly can service be restored? If the standby cannot reach the active over Network 1, or failover is initiated manually, the active is shutdown, the standby takes over the active's address, and a mail notification is sent to the administrator. We found web, FTP, DNS and mail service restored within just 1-2 minutes, using default tolerance settings. The active must be manually powered off and on. This implies that HA services remain unavailable until someone physically touches the box to restart it, and resynchronization completes. If the active cannot reach the standby over Network 1, HA services are unavailable but real-time synchronization continues over Network 2. When Network 1 reachability is restored, HA services resume immediately, without further resynchronization. If the active and standby cannot reach each other over Network 2, the active continues to service clients over Network 1 but drops into a non-HA state. When Network 2 reachability is restored, a full resynchronization is required before HA services are available. If the active and standby cannot reach the default gateway, there is no point in initiating failover, because both servers are configured with the same gateway. Thus, StaQWare cannot be used to increase availability through diverse routing. StaQWare indicates the problem, but does so in a confusing manner. The standby LCD displays "RaQ is not in HA *check cabling*" -- nearly always sound advice. But GUI status and mail notifications indicate "Standby RaQ has failed or been disconnected". One cannot determine the real problem unless actually looking at the standby's LCD panel. We tested network failure by yanking Ethernet cables. We also tested abrupt power failures. When the active loses power, the standby checks its own file system, then takes over the active's address and restores service (in our experience, within 6-7 minutes). When the standby loses power, HA services become unavailable until the standby is restored and data is completely resynchronized. If both units lose power? The active and standby search for each other for upon restart. After 5 minutes, if the standby cannot locate the active, the standby takes over, restoring service within about 7 minutes. For faster restoration, the standby's LCD panel can be used to boot in active mode.
Out Of Bounds
In short, StaQWare provides system-level monitoring -- it can improve availability in the event of power failure, malfunctioning NICs, failed disk writes (not tested), and directly-connected cabling. But it's also important to understand what an inexpensive "drop-in" solution like StaQWare does not do. StaQWare does not monitor service availability -- individual web, file, mail, or name services can fail without initiating takeover. It cannot circumvent WAN failures -- unreachable gateways, broken access links. For that kind of high-availability, you'll need to consider more expensive and complex solutions -- for example, L4/L7 switches and geographic load balancers. Such products often start around $20K (a few less, others considerably more). StaQWare also does not provide server load balancing, and cannot be used to increase availability of heterogeneous server clusters. Again, one can consider hardware server load balancers -- or balancing requests across clusters using round-robin DNS. If your budget is limited, consider inexpensive cross-platform HA software like ATG's Mod_Redundancy or TwinCom's Network Disk Mirror. Few software products will integrate as smoothly with Cobalt as StaQWare, but we did find one promising similarly-priced product: PolyServe's LocalCluster for Solaris, NT, FreeBSD, RedHat, SuSE, Slackware, Debian, and Cobalt RaQ, RaQ2, RaQ3 and Qube. Does Anybody Really Know What Time It Is?Our test efforts were unexpectedly hampered by an odd problem: Whenever failover occurred, the (now active) RaQ3i took on a seemingly random, time of day. Because time synchronization is nearly as important as data synchronization, we placed an NTP server on Network 1 and made sure our RaQ3i's could reach it. As long as the RaQ3i was active, it periodically synchronized with the NTP server's clock. But, after failover, the standby RaQ3i clock changed, interfering with timestamps in mail notifications, log files, and service statistics.Ultimately, Cobalt support provided the answer -- a clock fix in OS 3.0. Once we upgraded, the problem disappeared. But this fix affected firmware, requiring both the active and standby RaQ3i's to be upgraded. We had to disband the cluster and rebuild the standby from the supplied OS Restore CD. To do so, cable a PC with a supported 10/100 NIC to the RaQ3i, boot the PC from the Restore CD, then boot the RaQ3i over this network. Once both RaQ3i's are running OS 3.0, re-enable HA, and wait for full resynchronization. Presumably new units will ship with 3.0 and few of you will share this experience. Bottom LineStaQWare fulfilled its promise as a simple, inexpensive HA add-on for RaQ3i server pairs. When used with OS 3.0, StaQWare operated reliably and predictably, with near-zero administration.Manually restarting failed units and lengthy resynchronization can leave the cluster without standby for hours. Pulling the wrong cable for just a minute can take HA services down for an hour. An unattended RaQ3i that halts at midnight can leave you without HA services until morning. On the other hand, how many times will the standby fail immediately after the active? StaQWare is an economical solution for basic no-fuss HA, protecting RaQ3i's against infrequent system failure. When HA services are available, StaQWare takes quick, automated action to restore service in just minutes. If you're using Cobalt servers without high availability today, give StaQWare a closer look.
|