Understanding and Avoiding Split Brain Scenarios

Date: September 23, 2021

Tags: high availability - SAP, Linux, SAP S/4HANA, split brain

Understanding and Avoiding Split Brain Scenarios

Split brain. Most readers of our blogs will have heard the term, in the computing context that is, yet we cannot help but to sympathize with those whose first mental image is of the chaos that would result if someone had two brains, both equally in control at the same time.

What is a Failover Cluster Split Brain Scenario?

In a failover cluster split brain scenario, neither node can communicate with the other, and the standby server may promote itself to become an active server because it believes the active node has failed. This results in both nodes becoming ‘active’ as each would see the other as being failed. As a result, data integrity and consistency is compromised as data on both nodes would be changing. This is referred to as split brain.

There are two types of split-brain scenarios which may occur for an SAP HANA resource hierarchy if appropriate steps are not taken to avoid them.

HANA Resource Split Brain: The HANA resource is Active (ISP) on multiple cluster nodes. This situation is typically caused by a temporary network outage affecting the communication paths between cluster nodes.
SAP HANA System Replication Split Brain: The HANA resource is Active (ISP) on the primary node and Standby (OSU) on the backup node, but the database is running and registered as the primary replication site on both nodes. This situation is typically caused by either a failure to stop the database on the previous primary node during failover, having Autostart enabled for the database, or a database administrator manually running “hdbnsutil -sr_takeover” on the secondary replication site outside of the clustering software environment.

Avoiding Split Brain Issues

Recommendations for avoiding or resolving each type of split-brain scenario in the SIOS Protection Suite clustering environment are given below.

While in a split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quickCheck interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136363:WARNING: 
A temporary communication failure has occurred between servers 
hana2-1 and hana2-2. 
Manual intervention is required in order to minimize the risk of 
data loss. 
To resolve this situation, please take one of the following resource 
hierarchies out of service: HANA-SPS_HDB00 on hana2-1 
or HANA-SPS_HDB00 on hana2-2. 
The server that the resource hierarchy is taken out of service on 
will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine which instance contains the most up-to-date or relevant data. This determination must be made by a qualified database administrator who is familiar with the data.
The HANA resource on the node containing the data that needs to be retained will remain Active (ISP) in LifeKeeper, and the HANA resource hierarchy on the node that will be re-registered as the secondary replication site will be taken entirely out of service in LifeKeeper. Right-click on each leaf resource in the HANA resource hierarchy on the node where the hierarchy should be taken out of service and click Out of Service …
Once the SAP HANA resource hierarchy has been successfully taken out of service, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

SAP HANA System Replication Split Brain Resolution

While in this split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quick. Check interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136364:WARNING: 
SAP HANA database HDB00 is running and registered as 
primary master on both hana2-1 and hana2-2. 
Manual intervention is required in order to 
minimize the risk of data loss. To resolve this situation, 
please stop database instance 
HDB00 on hana2-2 by running the command ‘su – spsadm -c 
“sapcontrol -nr 00 -function Stop”’ 
on that server. Once stopped, 
it will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine whether important data exists on the Standby node which does not exist on the Active node. If important data has been committed to the database on the Standby node while in the split-brain state, the data will need to be manually copied to the Active node. This determination must be made by a qualified database administrator who is familiar with the data.
Once any missing data has been copied from the database on the Standby node to the Active node, stop the database on the Standby node by running the command given in the LifeKeeper warning message:

su – adm -c “sapcontrol -nr <Inst#> -function Stop”

where is the lower-case SAP System ID for the HANA installation and <Inst#> is the instance number for the HDB instance (e.g., the instance number, for instance, HDB00 is 00)
Once the database has been successfully stopped, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

Being aware of common split-brain scenarios and taking these steps to mitigate them can save you time and protect data integrity.

Reproduced with permission from SIOS