Date: May 11, 2024
Tags: Failovers
What Causes Failovers to Happen?
Working in support, one of the most common questions we get from customers is “What prompted the failover from my primary node to the secondary node?”.
There are several reasons this might occur… and we will attempt to explain the most common causes and how you can identify these.
Before we get started, let’s differentiate between a ‘failover’ and a ‘switchover’, since many customers use these terms interchangeably.
A ‘switchover’ is the act of manually moving your hierarchy from the primary node to the secondary node. This can be done thru the GUI, by performing an ‘In Service’ on the secondary node or thru the command line:
perform_action -a restore -t $LKTag (bring hierarchy into service)
A ‘failover’, on the other hand, is performed without any manual interaction… and is defined as automatic switching to a backup server upon the failure of the previously active server, application, or hardware/network..
Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while a switchover is intentional and requires human intervention.
The following are the most common ‘failures’ that initiate a ‘failover’:
1. Server Level Causes
Server Failure
- Primary server loses power or is turned off.
- CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause the system to become unresponsive such that LifeKeeper may detect a server as down and initiate a failover.
- Quorum/Witness – As part of the I/O fencing mechanism of quorum/witness, when a primary server loses quorum, a “fastboot”:, “fastkill” or “osu” is performed (based on settings) and a failover is initiated. When determining when to fail over, the witness server allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node.
Communication (Heartbeat) Failure
LifeKeeper has a built-in “heartbeat” signal that periodically notifies each server in the configuration that its paired server is operating. By default, LifeKeeper sends the heartbeat between servers every five seconds (this is adjustable for busy clusters). If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, LifeKeeper takes no action. However, if the communication path remains dead for three beats, LifeKeeper will label that communication path as dead. It will initiate a failover if the redundant communication path is also dead (we recommend two paths).
The following can result in a missed heartbeat:
- Network connection to the primary server is lost.
- Network latency.
- Heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and LifeKeeper initialization problems.
- Failed NIC.
- Failed network switch.
- Manually pulling/removing network connectivity
- Primary server loses power or is turned off.
- CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause the system to become unresponsive such that LifeKeeper may detect a server as down and initiate a failover.
Adjusting the heartbeat parameter:
LCMNUMHBEATS=Y (where Y is the number of heartbeats before logging a communication path failed error in the log). The default is 3 and can be changed if your systems are busy or across a WAN to avoid false communication path failures.
LCMHBEATTIME=5 (this is the interval in seconds and this is the default and should not be changed).
These tunables are NOT in the /etc/default/LifeKeeper file by default. You will need to add them to change the heartbeat values.
After adding these tunables and values in /etc/default/LifeKeeper you need to stop LifeKeeper and restart it. You can use the command lkstop -f, which stops LifeKeeper but does not bring down the protected applications.
And you need to do this on both systems.
This will allow 5 times Y seconds before LifeKeeper marks the communication paths as failed.
What is a Split-Brain, and what causes it?
If a single comm path is used and the comm path fails, then LifeKeeper hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, LifeKeeper may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly.
The following are scenarios that can cause split-brain:
- Any of the comm failures listed above
- Improper shutdown of LifeKeeper
- Server resource starvation
- Losing all network paths
- DNS or other network glitch
- System lockup
Using Quorum/Witness to prevent a Split Brain
- The Quorum/Witness Server Support Package for LifeKeeper (steeleye-lkQWK, hereinafter “Quorum/Witness Package”) combined with the existing failover process of the LifeKeeper core allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that local site failovers and failovers to nodes across a WAN can be done while greatly reducing the risk of split-brain situations.
- In a distributed system that takes network partitioning into account, there is a concept called quorum to obtain consensus across the cluster. A node having quorum is a node that can obtain consensus of all the clusters and is allowed to bring resources in service. On the other hand, a node not having quorum is a node that cannot obtain consensus of all the clusters and it is not allowed to bring resources in service. This will prevent split brain from happening. To check whether a node has quorum is called quorum check. It is expressed as “quorum check succeeded” if it has quorum, and “quorum check failed” if it does not have quorum.
- In case of a communication failure, using one node where failure occurred and another multiple nodes (or other devices) will allow a node to get a “second opinion” on the status of the failing node. The node to get a “second opinion” is called a witness node (or a witness device), and getting a “second opinion” is called witness checking. When determining when to fail over, the witness node (the witness device) allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. During actual operation, the witness node (the witness device) will be consulted when LifeKeeper is started or the failed communication path is restored. Witness checking can only be performed for nodes having quorum.
2. Resource Failure Causes
LifeKeeper is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. LifeKeeper monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated.
Application Failure
- An application failure is detected, but the local recovery process fails.
- Remove Failure – During the resource failover process, certain resources need to be removed from service on the primary server and then brought into service on the selected backup server to provide full functionality of the critical applications. If this remove process fails, a reboot of the primary server will be performed resulting in a complete server failover.
Examples of remove failures:
- Unable to unmount file system
- Unable to shut down protected application (oracle, mysql, postgres, etc)
File System Issues
- Disk Full — LifeKeeper’s File System Health Monitoring can detect disk full file system conditions which may result in failover of the file system resource.
- Unmounted or Improperly Mounted File System — User manually unmounts or changes options on an in-service and LK protected file system.
- Remount Failure — The following is a list of common causes for remount failure which would lead to a failover:
- corrupted file system (fsck failure)
- failure to create mount point directory
- mount point is busy
- mount failure
- LifeKeeper internal error
IP Address Failure
When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. LifeKeeper first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, LifeKeeper will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot.
- IP conflict
- IP collision
- DNS resolution failure
- NIC or Switch Failures
Reservation Conflict
- A reservation to a protected device is lost or stolen
- Unable to regain reservation or control of a protected resource device (caused by manual user intervention, HBA or switch failure)
SCSI Device
● Protected SCSI device could not be opened. The device may be failing or may have been removed from the system.
Resources for identifying the cause of a failover
/var/log/lifekeeper.log
This log file, written by LifeKeeper, should be the first place you look in determining what may have caused a failover.
For example, one of the most common reasons is a comm path failure. Below is an example of the entries you will find in the lifekeeper.log when this occurs:
Sep 21 11:06:57 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 1 of 48 on dev 10.236.17.226/10.238.17.226 (lcm driver number = 129).
Sep 21 11:06:57 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 1 of 48 on dev 10.236.17.226/10.237.17.226 (lcm driver number = 1360929).
Sep 21 11:07:02 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 2 of 48 on dev 10.236.17.226/10.238.17.226 (lcm driver number = 129).
After the it reaches the maximum number of heartbeats, the failover begins:
Sep 21 11:10:49 es6ecc08tev lcm[9416]: INFO:lcm.tli_hand:::005257:missed heartbeat 47 of 48 on dev 10.237.17.226/10.236.17.226 (lcm driver number = 71).
Sep 21 11:10:49 es6ecc08tev eventslcm[47082]: WARN:lcd.net:::004258:Communication to es1ecc08tev by 10.237.17.226/10.236.17.226 FAILED
Sep 21 11:10:49 es6ecc08tev eventslcm[47082]: WARN:lcd.net:::004261:COMMUNICATIONS failover from system “es1ecc08tev” will be started.
Sep 21 11:10:49 es6ecc08tev lifekeeper[47121]: NOTIFY:event.comm_down:::010466:COMMUNICATIONS es1ecc08tev FAILED
/var/log/messages
This linux generated file typically contains system messages generated by various processes and services running on the system. These messages can include:
System boot messages: Information about the system boot process, including kernel messages and messages from systemd or other init systems.
Service startup and shutdown messages: Messages indicating when services are started or stopped, including any errors or warnings encountered during the process.
Kernel messages: Information about the operation of the Linux kernel, including hardware detection, device initialization, and kernel errors or warnings.
Network-related messages: Information about network connections, firewall activity, and network configuration changes.
System performance information: Messages related to system performance monitoring, such as CPU usage, memory usage, and disk I/O statistics.
SIOS High Availability and Disaster Recovery
SIOS Technology Corporation provides High Availability and Disaster Recovery products that protect & optimize IT infrastructures with cluster management for your most important applications. Contact us today for more information.
Reproduced with permission from SIOS