February 17, 2019

StaQWare: High Availability for Cobalt RaQ3i Servers - page 4

Extending the Cobalt RaQ3i Server Clusters

  • September 19, 2000
  • By Lisa Phifer

In short, StaQWare provides system-level monitoring -- it can improve availability in the event of power failure, malfunctioning NICs, failed disk writes (not tested), and directly-connected cabling. But it's also important to understand what an inexpensive "drop-in" solution like StaQWare does not do.

StaQWare does not monitor service availability -- individual web, file, mail, or name services can fail without initiating takeover. It cannot circumvent WAN failures -- unreachable gateways, broken access links. For that kind of high-availability, you'll need to consider more expensive and complex solutions -- for example, L4/L7 switches and geographic load balancers. Such products often start around $20K (a few less, others considerably more).

StaQWare also does not provide server load balancing, and cannot be used to increase availability of heterogeneous server clusters. Again, one can consider hardware server load balancers -- or balancing requests across clusters using round-robin DNS. If your budget is limited, consider inexpensive cross-platform HA software like ATG's Mod_Redundancy or TwinCom's Network Disk Mirror. Few software products will integrate as smoothly with Cobalt as StaQWare, but we did find one promising similarly-priced product: PolyServe's LocalCluster for Solaris, NT, FreeBSD, RedHat, SuSE, Slackware, Debian, and Cobalt RaQ, RaQ2, RaQ3 and Qube.

Does Anybody Really Know What Time It Is?

Our test efforts were unexpectedly hampered by an odd problem: Whenever failover occurred, the (now active) RaQ3i took on a seemingly random, time of day. Because time synchronization is nearly as important as data synchronization, we placed an NTP server on Network 1 and made sure our RaQ3i's could reach it. As long as the RaQ3i was active, it periodically synchronized with the NTP server's clock. But, after failover, the standby RaQ3i clock changed, interfering with timestamps in mail notifications, log files, and service statistics.

Ultimately, Cobalt support provided the answer -- a clock fix in OS 3.0. Once we upgraded, the problem disappeared. But this fix affected firmware, requiring both the active and standby RaQ3i's to be upgraded. We had to disband the cluster and rebuild the standby from the supplied OS Restore CD. To do so, cable a PC with a supported 10/100 NIC to the RaQ3i, boot the PC from the Restore CD, then boot the RaQ3i over this network. Once both RaQ3i's are running OS 3.0, re-enable HA, and wait for full resynchronization. Presumably new units will ship with 3.0 and few of you will share this experience.

Bottom Line

StaQWare fulfilled its promise as a simple, inexpensive HA add-on for RaQ3i server pairs. When used with OS 3.0, StaQWare operated reliably and predictably, with near-zero administration.

Manually restarting failed units and lengthy resynchronization can leave the cluster without standby for hours. Pulling the wrong cable for just a minute can take HA services down for an hour. An unattended RaQ3i that halts at midnight can leave you without HA services until morning. On the other hand, how many times will the standby fail immediately after the active?

StaQWare is an economical solution for basic no-fuss HA, protecting RaQ3i's against infrequent system failure. When HA services are available, StaQWare takes quick, automated action to restore service in just minutes. If you're using Cobalt servers without high availability today, give StaQWare a closer look.

Most Popular LinuxPlanet Stories