High Availability, RTO, and RPO

Date: May 25, 2022

Tags: High Availability, Recovery point objectives, Recovery time objective

High availability (HA) is an information technology term that refers to a computer software or component that is operational and available for more than 99.99% of the time. End users of an application, or system, experience less than 52.5 minutes per year of service interruption. This level of availability is typically achieved through the use of high availability clustering, a configuration that reduces application downtime by eliminating single points-of-failure through the use of redundant servers, networks, storage, and software.

What are recovery time objectives (RTO) and recovery point objectives (RPO)?

In addition to 99.99% availability time, high availability environments also meet stringent recovery time and recovery point objectives. Recovery time objective (RTO) is a measure of the time elapsed from application failure to restoration of application operation and availability. It is a measure of how long a company can afford to have that application down. Recovery point objectives (RPO) are a measure of how up-to-date the data is when application availability has been restored after a downtime issue. It is often described as the maximum amount of data loss that can be tolerated when a failure happens. SIOS high availability clusters deliver an RPO of zero and an RTO of minutes.

What is a high availability cluster?

In a high availability cluster, important applications are run on a primary server node, which is connected to one or more secondary nodes for redundancy. Clustering software, such as SIOS LifeKeeper, monitors clustered applications and dependent resources to ensure they are operational on the active node. System level monitoring is accomplished via intervallic heartbeats between cluster nodes. If the primary server fails, the secondary server initiates recovery after the heartbeat timeout interval is exceeded. For application level failures, the clustering software detects that an application is not available on the active node. It then moves the application and dependent resources to the secondary node(s) in a process called a failover, where operation continues and meets stringent RTOs.

In a traditional failover cluster, all nodes in the cluster are connected to the same shared storage, typically a storage area network (SAN). After a failover, the secondary node is granted access to the shared storage, enabling it to meet stringent RPOs.

Reproduced with permission from SIOS