Disaster Recovery Fundamentals
Disaster recovery overview
Disaster recovery refers to the ability to quickly restore/repair a system and minimize damage in the event of a sitewide or even regional failure. Disaster recovery is a crucial part of business continuity management and having a robust disaster recovery protocol in place will help prevent unnecessary data loss and expense associated with system downtime.
What constitutes the ‘disaster’ part of disaster recovery? This could refer to a natural disaster such as earthquakes, floods, etc. but also a wide range of events such as “fire,” “terrorism,” “unauthorized intrusion,” “large-scale hacking,” and “long-term large-scale power outages.” Anything that has the potential to cause catastrophic damage to an IT system if it were to fail.
The Real Impact of System Failure
In addition to potential physical damages and data loss associated with a system failure, the lack of a disaster recovery plan can cause unrecoverable revenue loss for businesses. For every minute of system downtime, this means lost sales and opportunities, potential negative customer experience, tarnished business reputation and high expense in emergency IT repair.
The Importance of Disaster Recovery
For a company that provides mission-critical services, building a business continuity system that can handle unexpected system downtime is essential. Having the ability to prevent failure in the first place, and to quickly recover in the event a local failure or even a sitewide or regional disaster occurs will help to protect data, maintain rapport with customers, and save time and potentially devastating financial loss.
It’s important to recognize that catastrophic system failure is something that will happen, not something that may happen, so putting a proper disaster recovery plan in place will protect your business.
Disaster Recovery Challenges
While a disaster recovery protocol is essential, it is not without its challenges to set up and implement. Here are some common barriers to proper disaster recovery implementation:
Challenge 1: Geographic separation.
The essence of disaster protection is keeping systems and data in a location that is geographically separated from the primary data center or cloud instance so that, in the event of a disaster or cloud outage, the secondary systems can be brought online and operation can continue.
Challenge 2: Network bandwidth Requirements
Replicating data to an offsite location for disaster recovery can mean added network bandwidth requirements and latency issues.
Challenge 3: Data volume continues to increase
The storage capacity requirements on the disaster recovery site will increase over time. A proper disaster recovery plan needs to establish “protection priority” to clarify which data should be protected and optimize available storage resources.
Challenge 4: Recovery procedure at the time of recovery
If a system goes down due to a disaster, service recovery is required. Often, companies find their data is scattered in multiple locations and there aren’t standardized procedures for and recovery, resulting in immense loss of time and expenses. Developing a clear, standardized restoration procedure will eliminate this headache and allow for quick action when it matters most.
Data backup vs Availability Protection
Traditionally, data backup – essentially a process of making a copy of data and applications and moving it to an offsite location — has been performed for the purpose of protecting data in case of IT equipment failure/failure and for recordkeeping/archiving in compliance with regulatory requirements such as the HIPAA (Healthcare Information Portability Accountability Act). To recover operation, any servers, storage, and other hardware, as well as networking affected by the incident need to be replaced or repaired. Servers have to be configured and applications have to be restored, brought back online and connected to recovered data. These steps can months.
Without an availability protection process in place, recovery operations with backup alone can be a time-consuming and expensive process. Availability processes keep fully operational systems ready to take over in the event of a disaster, enabling resumption of service in minutes.
Here are some other common reasons an effective disaster recovery plan is important:
Disaster recovery indicators
The main metrics for disaster recovery are “RPO” and “RTO”.
RPO (Recovery Point Objective)
RPO indicates the point from the time of disaster occurrence to what time in the past the data recovery is guaranteed.
If “RPO = < 5 minutes
When aiming for “RPO = 0 (zero data loss)”, an availability protection mechanism such as failover clustering is required.
RTO (Recovery Time Objective)
RTO is an index that shows how much time your business can allow to pass from initial downtime to restoration of operation. “RTO = 1 month or more”, you may be able to handle data recovery by only doing remote backup and securing a substitute device. But if your “RTO = within a minutes”, failover clustering is required.
Selecting a Disaster Recovery Method
When determining the right disaster recovery method for your business, consider these important factors:
- Criticality of business processes and tolerance for impact
- The data type and capacity that you want to protect
- Recovery requirements – your RPO and RTO
- Budget
Focus on Business Impact
While IT departments take the technical lead in developing disaster recovery measures for IT systems, business owners must consider the impact and extent of system outages to the business impact of each system stop” to ensure the least harmful impact to the business.
Protected data type (data integrity)
It is important to classify the type and importance of protected data. For data that does not require very precise consistency (such as file servers), a simple primary storage backup may be sufficient.
On the other hand, ERP systems and databases such as SQL Server, Oracle, and SAP have multiple services and parts that need to be located on specific servers, started up in specific orders, and managed according to a variety of application-specific best practices. They typically require high availability protection and an application-aware clustering solution to orchestrate failover.
——————————————————————————————————————
Key Disaster Recovery Terms
Remote backup – essentially keeping a copy of applications and data in a geographically separated remote location.
Synchronous Storage Mirroring
Keeping a local and remote copy of storage synchronized for DR protection. In this method, data is written to local storage and immediately replicated to remote storage. The local storage is not “committed” until the process of writing data to the remote location has been completed. This process keeps both locations identical, eliminating discrepancies that may result if data-in-transit at the time of an event fails to write on the remote location. Data integrity is guaranteed between the primary and backup sites.
Asynchronous Storage Mirroring.
This method writes data to the local storage then replicates it to the remote location. It enables greater network utilization efficiency and reduced bandwidth contention when geographic separation causes latency.
“Cold standby” and “hot standby”
Cold standby
A process of keeping a copy of data or secondary system offline in case of disaster. If the primary system goes down, the systems and software have to be manually started up – in some cases configured – and data has to be restored before operation can continue.
Hot standby
This is a process of keeping secondary systems operational and switching over to them in the event of downtime on the primary system.
Disaster Recovery Method Cost Comparisons
The smaller the RPO and RTO, the shorter the downtime, but the cost will increase accordingly.
Considering the cost and asset value of each type of data, it is necessary to find the optimum method for what level of protection is required. A balance between in-house implementation and outsourcing of services will impact costs.
To learn more about high availability and disaster recovery solutions at SIOS, click here.
Reproduced from SIOS