StaQWare: High Availability for Cobalt RaQ3i Servers - page 3
Extending the Cobalt RaQ3i Server Clusters
Active and standby RaQ3i's try to maintain data synchronization all the time, so that the standby can jump in immediately and replace the active without loss of data.
Unfortunately, when a failed or disabled unit returns to service, synchronization is lost. If just one block of data differs, the standby must be completely resynchronized with the active. We found 90 minutes typical when syncing 20.4 GB disks over a 100 Mbps cross-connect. When we replaced this cross-connect with a 10 Mbps hub, the period quadrupled.
The good news: the active RaQ3i continues to serve requests during resynchronization, and we encountered no data corruption or loss when we interrupted this process -- it simply restarted again from scratch. Web transactions and mail transfers completed before failover were mirrored to the standby and available after failover. The bad news: during resynchronization, HA services are (naturally) unavailable, leaving the active vulnerable to failure.
Failover -- When, Why, and How?
Failover from active to standby is initiated automatically when StaQWare detects a problem (left). What kinds of problems can StaQWare detect, and how quickly can service be restored?
If the standby cannot reach the active over Network 1, or failover is initiated manually, the active is shutdown, the standby takes over the active's address, and a mail notification is sent to the administrator.
We found web, FTP, DNS and mail service restored within just 1-2 minutes, using default tolerance settings. The active must be manually powered off and on. This implies that HA services remain unavailable until someone physically touches the box to restart it, and resynchronization completes.
If the active cannot reach the standby over Network 1, HA services are unavailable but real-time synchronization continues over Network 2. When Network 1 reachability is restored, HA services resume immediately, without further resynchronization.
If the active and standby cannot reach each other over Network 2, the active continues to service clients over Network 1 but drops into a non-HA state. When Network 2 reachability is restored, a full resynchronization is required before HA services are available.
If the active and standby cannot reach the default gateway, there is no point in initiating failover, because both servers are configured with the same gateway. Thus, StaQWare cannot be used to increase availability through diverse routing. StaQWare indicates the problem, but does so in a confusing manner. The standby LCD displays "RaQ is not in HA *check cabling*" -- nearly always sound advice. But GUI status and mail notifications indicate "Standby RaQ has failed or been disconnected". One cannot determine the real problem unless actually looking at the standby's LCD panel.
We tested network failure by yanking Ethernet cables. We also tested abrupt power failures. When the active loses power, the standby checks its own file system, then takes over the active's address and restores service (in our experience, within 6-7 minutes). When the standby loses power, HA services become unavailable until the standby is restored and data is completely resynchronized.
If both units lose power? The active and standby search for each other for upon restart. After 5 minutes, if the standby cannot locate the active, the standby takes over, restoring service within about 7 minutes. For faster restoration, the standby's LCD panel can be used to boot in active mode.