disaster recovery Archives - Page 11 of 23

Protect Systems from Downtime

May 1, 2022 by Jason Aw Leave a Comment

Protect Systems from Downtime

In today’s business environment, organizations rely on applications, databases, and ERP Systems such as SAP, SQL Server, Oracle and more. These applications unify and streamline your most critical business operations. When they fail, they cost you more than just money. It is critical to protect these complicated systems from downtime.

Proven High Availability & Disaster Recovery

SIOS has 20+ years of experience in high availability and disaster recovery. SIOS knows there isn’t a one size fits all solution. Today’s data systems are a combination of on-premise, public cloud, hybrid cloud and multi cloud environments. The applications themselves can create even more complexity. But configuring open-source cluster software can be painstaking, time consuming, and prone to human error.

SIOS has solutions that provide high availability and disaster recovery for critical applications. These solutions have been developed based on our real-world experience, across different industries and use-cases. Our products include SIOS DataKeeper Cluster Edition for Windows and SIOS LifeKeeper for Linux or Windows. These powerful applications provide failover protection. The Application Recovery Kits included with LifeKeeper speeds up application configuration time by automating configuration and validates inputs.

System Protection On-Premises, in the Cloud or in Hybrid Environments

SIOS provides the protection you need for business-critical applications and reduces complexity of managing them, whether on-premises, in the cloud, or in hybrid cloud environments. Learn more about us in the video below or contact us to learn more about high availability and disaster recovery for your business critical applications.

Reproduced with permission from SIOS

How to Get the Most from Your Tech Support Call

April 13, 2022 by Jason Aw Leave a Comment

How to Get the Most from Your Tech Support Call

Technical support experts share their tips on how to fast-track issue resolution

SIOS provides high availability protection for our customers’ most critical applications, databases, and ERPs. When our customers call tech support, there is no time to waste. We’ve earned a reputation (and several awards) for our HA/DR expertise and support excellence.

We’ve asked our tech support team to share the following five questions that can fast-track your issue resolution.

Fast, Accurate Diagnosis

Thorough and accurate tech support is similar to diagnosing an illness. Imagine asking your doctor to treat a headache. The human body is a complex interaction of multiple systems. The source of your problem may not be obvious or even in your head. To diagnose the issue and recommend a treatment, your doctor typically begins with questions aimed at identifying the circumstances that caused your symptoms.

Failover clustering also involves multiple systems at every layer of the IT infrastructure – network, storage, OS, application, database, and server. And like your real headache, your HA issue is often caused by something unrelated to your HA clustering software. Like your doctor, a good support professional will ask a variety of questions to characterize your issue. The more information you can provide about your support issue, the faster and more effectively it can be diagnosed and resolved.

Fast-Tracking Issue Resolution

As an IT best practice, consider logging key information and system changes as an ongoing business exercise. By putting answers to the following key questions at your fingertips, this process will speed the diagnosis and fast-track issue resolution. (It may also help you prevent issues from occurring in the first place).

Can you describe the error you are receiving? What is the exact symptom you are witnessing that is causing concern?
When did it happen (time, time zone you are in?)
A typical diagnostic method is to examine log files from the machine with issues. Log files can be hundreds of lines of message strings or command output. By tracking the precise time you noticed the problematic symptoms, we can significantly narrow the log file examination.
Have you or are you able to upload the logs?
Providing an explanation and description of the error along with the timeframe for which it happened goes a long way in diagnosis provided the logs can be uploaded to the support ticket. In some IT environments uploading the logs requires using corporate-approved file sharing, while dark sites require no electronic distribution of system logs. If logs cannot be provided externally, be sure that the full logs are captured and archived for reference and review with the support agent as the case progresses. Applications and systems, especially those under duress can produce exhaustive and extensive logs that can overwrite critical information.
Which system was the primary cluster node at the time?
Given the interconnected nature of clustering, it is important to inform your tech support representative of whether the cluster node you are calling about was functioning as the primary or secondary node at the time of the issue.
What have you tried to do to remedy the issue?
Great physicians know that their patients have likely tried a home remedy or over-the-counter medication prior to the visit. Knowing this information is helpful in diagnosis and treatment. The same applies with great support technicians. Sharing not only what you were trying to do at the time of the issue, but how you tried to resolve your errors can help them craft a better treatment and recovery plan, and make sure that their recommendations for recovery protect your critical data and applications.

For more than 20 years, SIOS Customer Experience team has been helping enterprise customers implement HA/DR solution for a wide range of use cases. We value our customers and encourage them to contact us whenever they have questions about their HA/DR.

Reproduced with permission from SIOS

Two Truths and a Lie: Understanding the Real Truth About Availability

April 9, 2022 by Jason Aw Leave a Comment

Two Truths and a Lie: Understanding the Real Truth About Availability

We played two truths and a lie at a company event years ago. The game involved putting forth two true statements and one untrue statement to see if you could fool the most people. The winner put forth ideas that all seemed believable or unbelievable, depending on your own personal history. Here is what was said:

While growing up, my hometown had no stoplights.
My grandparents met in the second grade and married in their teens.
After graduation, I attended a prestigious out-of-state university in Georgia before transferring back home to attend an in-state university.

I grew up in a small community with no stoplights, so that one seemed possible, but I was skeptical. I’ve heard stories of people who met at an early age, and got married in their teens, so that was possible but also the one that I might want to flag. The third one also seemed true, but I wondered who would transfer from a prestigious out-of-state university back to the no stoplight hometown to attend an in-state college. For what seemed like an eternity the entire group reasoned and pondered which of the three statements was a lie. And, it seemed as if no one could spot it. Several of us reasoned that if the hometown had no stoplights, would it really have a university as well? A few took the line that it was unlikely he attended the prestigious out-of-state university, given his age, years with the company, and multiple degrees. After final deliberation the verdict was in, the two truths were number one and number two. The lie was number three.

With all the information swirling around about High Availability, you might feel like you are playing a game of “Two Truths and a Lie.” Depending on where you look, you may find statements about availability that seem believable, but are not completely true when you dig in beneath the surface. For example, the following widely accepted statements are not actually true:

Storage availability is all that is needed for high availability

Applications require access to data to be effective and efficient. Your database will need to have access to storage if you are going to successfully run your enterprise. Your other enterprise application likewise must have access to configuration files, data stores, and transaction and error log directories to be usable. But, while reliable, readily accessible, and performant storage is essential for all enterprise systems, websites, databases, applications, and interconnects, storage available alone is not all that is needed for high availability. There are more components that make up a sound, reliable, resilient high availability architecture than just storage.
Platform availability is all that is needed for high availability

With the continued development and growth of cloud computing, many enterprises searching for high availability are confused by the concept of platform availability. Platform availability, sometimes referred to as system availability or infrastructure availability relates to the time that the platform (hardware, network, OS, and related components) are accessible and deliver their intended IT service. Applications and databases absolutely need compute, memory, storage, and network resources to operate properly and efficiently. Every service or function in your data center needs a reliable place to execute its logic, and without the underlying platform, these operations are not possible. Because of this, many consider that platform availability is all that is needed for high availability. As VP of Customer Experience, I have helped customers and partners understand the gaps between an available platform and available applications, databases, and client connectivity. In those conversations, we have discussed real examples of platforms showing no downtime or service issues, while simultaneously the enterprise applications running within that data center or cloud infrastructure are unavailable, unstable, or inaccessible to clients due to non-platform issues.

So What’s the Real Truth?

When our co-worker shared his three statements, we all got it wrong. His hometown was a small community, its borders were buffered by larger towns with a stoplight, but his own town did not have one of its own. And, as it turned out, he graduated early and went to that well-known, prestigious out-of-state institute of technology in Georgia, before getting homesick and transferring to an in-state university back home. So the lie was about his grandparents. While they may or may not have met at an early age, they definitely did not meet in the second grade.

The truth about high availability is that storage availability and platform or infrastructure availability are not enough on their own. In order to create the most robust, available, resilient, and reliable high availability infrastructure you must also include a commercial-grade solution to provide application-aware monitoring, alerting and recovery. You’ll also want that solution to be knowledgeable of your storage’s high availability capabilities, have a strong awareness of the infrastructure’s nuances and gaps, and have the ability to leverage best practices across the entire architecture to help your applications, databases, and services achieve your business objectives.

Minimizing Downtime with High Availability

January 29, 2022 by Jason Aw Leave a Comment

Minimizing Downtime with High Availability

Downtime has become more costly than ever before for modern businesses. The ITIC 2021 Hourly Cost of Downtime Survey found that in 91% of organizations, one hour of downtime in a business-critical system, database, or application costs an average of more than $300,000, and for 18% of large enterprises, the cost of an hour of downtime exceeds $5 million.

High availability (HA) is an attribute of a system, database, or application that’s designed to operate continuously and reliably for extended periods. The goal of HA is to reduce or eliminate unplanned downtime for critical applications. This is achieved by eliminating single points of failure by incorporating redundant components and other technologies in the design of a business-critical system, database, or application.

SLAs and HA Metrics

Service-level agreements (SLAs) are used by service providers to guarantee that a customer’s business-critical systems, databases, or applications are up and running when the business needs them.

IDC has created an SLA model that defines uptime requirements at five levels as follows:

AL4 (Continuous Availability – System Fault Tolerance): No more than 5 minutes and 15 seconds of planned and unplanned downtime per year (99.999% or “five-nines” availability)
AL3 (High Availability – Traditional Clustering): No more than 52 minutes and 35 seconds of planned and unplanned downtime per year (99.99% or “four-nines” availability)
AL2 (Recovery – Data Replication and Backup): No more than 8 hours, 45 minutes, and 56 seconds of planned and unplanned downtime per year (99.9% or “three-nines” availability)
AL1 (Reliability – Hot Swappable Components): No more than 87 hours, 39 minutes, and 29 seconds of planned and unplanned downtime per year (99% or “two-nines” availability)
AL0 (Unprotected Servers): No availability or uptime guarantee

According to ITIC, 89% of surveyed organizations now require “four-nines” availability for their business-critical systems, databases, and applications, and 35% of those organizations further endeavor to achieve “five-nines” availability.

In addition to uptime and availability, two other important HA metrics are Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). RTO is the maximum tolerable duration of any outage and RPO is the maximum amount of data loss that can be tolerated when a failure happens. Unlike RTO and RPO metrics for disaster recovery which are typically defined in hours and days, RTO and RPO metrics for business-critical systems, databases, and applications are often only a few seconds (RTO) and zero (RPO).

HA Clustering

HA clustering typically consists of server nodes, storage, and clustering software.

Traditional Clustering

A traditional, on-premises HA cluster is a group of two or more server nodes connected to shared storage (typically, a storage area network, or SAN) that are configured with the same operating system, databases, and applications (see Figure 1).

Figure 1: Traditional server clustering with shared storage

One of the nodes is designated as the primary (or active) node and the other(s) are designated as secondary (or standby) nodes. If the primary node fails, clustering allows a system, database, or application to automatically fail over to one or more secondary nodes and continue operating with minimal disruption. Since the secondary node is connected to the same storage, operation continues with zero data loss.

However, the use of shared storage in the traditional clustering model creates several challenges, including:

The shared storage itself is a single point of failure that can potentially take all of the connected nodes in the cluster offline.
SAN storage can also be costly and complex to own and manage.
Shared storage in the cloud can add significant, unnecessary cost and complexity and some cloud providers don’t even offer a shared storage option.

SANless Clustering

SANless or “shared nothing” clusters (see Figure 2) address the challenges associated with shared storage. In these configurations, every cluster node has its own local storage. Efficient host-based, block-level replication is used to synchronize storage on the cluster nodes, keeping them identical. In the event of a failover, secondary nodes access an identical copy of the storage used by the primary node.

Figure 2: HA clustering with SANless or “shared-nothing” storage

Clustering Software

Clustering software lets you configure your servers as a cluster so that multiple servers can work together to provide HA and prevent data loss. A variety of clustering software solutions are available for Windows, Linux distributions, and various virtual machine hypervisors. However, each of these solutions limits your flexibility and deployment options and introduces various challenges such as technical complexity and expensive licensing.

Don’t Wait for Disaster to Strike

HA is crucial for business-critical systems, databases, and applications. But with the myriad platforms available, complexity ramps up significantly. That’s why an application-aware solution makes so much sense. What you need is a trusted partner who has extensive expertise in high availability—a partner like SIOS, which has the technological know-how to ensure that your business stays up and running.

Don’t wait for an outage or disaster to find out if you have the resiliency your business needs. Schedule a personalized demo today at https://us.sios.com to see what SIOS can do for your business.

Reproduced from SIOS

Four Reasons To Use An Avoidance Strategy In High Availability

November 28, 2021 by Jason Aw Leave a Comment

Four Avoidance Strategies for Improving Cluster Resilience, Performance, and Outcomes

Simple Steps for Deployment in SIOS Protection Suite Cluster Environment

Avoiding something – we’ve all done it before. An old flame we see in the store while walking with our spouse, a salesperson when we aren’t “ready to buy”, and even a boss while we are out on “vacation”. When I was the manager of a development team, I caught a glimpse of a direct report browsing in a store while they were supposed to be out of the office sick. They ducked between clothing racks and scurried down the next aisle and hurried away. We’ve all done it before, and in some cases, for mental health, physical health, or reasons that remain private and personal, we all need some measures of avoidance. Even in HA. So, how do you add avoidance to your High Availability environment, and why?

Four reasons to use an avoidance strategy in High Availability

Better Performance (minimizing server overload)

One reason to use avoidance strategies in HA is to increase application and server performance. Consider the case of three servers running production workloads, let’s call them Server Alpha, Server Beta, Server Gamma. Servers Alpha and Beta are running critical applications backed by a database, while Server Gamma is running reports and data transformation jobs. In the event of a failure of Server Alpha, a failover to Server Beta would traditionally occur. However, because server Beta is already running a large workload, the resulting additional application load might result in an undesirable server overload and poor performance for both applications. So it might be wise to deploy an avoidance strategy to make sure that Server Gamma is chosen as the failover target.

Performance Optimization

Consider again the scenario of three servers, Alpha, Beta, and Gamma. Servers Alpha and Beta are scaled to handle peak workloads, while Server Gamma is a cost-optimized server. In the event of a failure of Server Alpha and Server Beta, a failover will occur to the cost-optimized server, Gamma. However, this server is not scaled to handle peak workloads, nor the workloads of both Server Alpha and Server Beta at the same time. In this instance, an avoidance strategy can be used to optimize performance by automatically moving one or both of the workloads from Server Gamma as soon as another host is available.

HA Optimization

HA Optimization is another scenario for deploying avoidance strategies. Like the performance optimization strategy, HA optimization is used to ensure that your environment can survive most failure scenarios and that your applications are optimized to provide the highest level of availability possible at any point in time. HA optimization is important for an application such as SAP with replicated enqueue processes. In any SAP environment, you do not want the ASCS (ABAP SAP Central Service) and ERS (enqueue replication services) instance residing on the same server for extended periods of time because of the risk of lost locks and canceled jobs. To prevent this from occurring you can use an avoidance strategy that causes the ERS and ASCS instances to always run on opposite cluster nodes. Consider the case of three servers running production workloads, let’s call them Servers Alpha, Beta, Gamma. Server Alpha is running the ASCS instance, while Server Beta is running the ERS instance. Server Gamma functions as a third node for failovers of both Server Beta (ERS) and Server Alpha (ASCS). If Beta crashes, you wouldn’t want the ERS resource running on the same node as the ASCS instance. To ensure this operation, you can deploy an avoidance strategy that automatically checks first and ensures the two applications are on separate servers, and maintain SAP ASCS/ERS best practices for lock failover.

DR Avoidance

Suppose you have two data centers: City Alpha and City Beta which are about 70 miles apart with most of your clients centrally located between them. However, due to recent changes in internal organizations, mergers/closures and acquisitions, and governance requirements, your IT team has to add a third data center that is located in City Gamma, which is about 350 miles from Alpha and Beta. Now the resources which were primarily protected in Alpha and Beta are also extended to the Gamma location. Given that most of the users and teams are near the Alpha and Beta locations and even the most extreme users are located in neighboring cities, your team needs to avoid a failover to the Gamma location. Like the other strategies, a DR avoidance seeks to optimize performance, in/out regional data costs, latency, and client access by avoiding the DR node should only one node within either region fail. It would also ensure that even if both nodes fail after different times, failover always occurs to the other node in the cluster or data center before moving to DR.

So, how do you deploy an avoidance strategy?

Many providers have affinity rules that can be configured, while others use a combination of server priorities or manual steps. In the case of the SIOS Protection Suite for Linux, you can use a number of built-in methods including:

Resource prioritization

In the event of a failure, resources will fail over to the server where they have the lowest remaining priority and cascade to any additional servers (Alpha, Beta, and Gamma). Server Alpha is the primary server for Resource.HR, Server Beta is the primary server for Resource.MFG, and Server Gamma is the backup server for all resources/servers. Using resource prioritization, Resource.HR would have a priority of one (1) on Server Alpha and a priority of two (2) on Server Gamma. While Resource.MFG could have a priority one (1) on Server Beta and a priority of two (2) on Server Gamma. If customers wanted to optimize the use of the environment, then Resource.HR could have a priority of three (3) on Server Beta and Resource.MFG could have a priority of three (3) on Server Alpha. In the event of a failure of Server Alpha, the resource Resource.HR would fail to Server Gamma first before trying to come in-service (be restored) on Server Alpha.

SIOS Protection Suite for Linux (UI and CLI) allow users to specify a priority for each server and resource combination.

Policy or affinity rules

Policy rules can also be used to prevent a resource recovery from occurring on a given server and thereby allowing a resource to avoid a specified server that may be running a more critical or resource-intensive workload. Typical policies include:

- - - - Constraint policies that will block an application from a specific server by default.
        
        Resource policies that will block an application from a server that does not have sufficient resources
        
        Temporal policies that define a time period that resources are allowed or disallowed from a system
        
        Custom policies that define preferred servers or possible application ownership abilities within the cluster

The SIOS Protection for Linux CLI allows users to specify policy rules which can disable failover to a specific resource for a specified server, provide temporal policies guarding failures, disable failures of a specific application type, constraint policies, and custom policies.

Specific Avoidance Resources

The most granular way to establish a resource avoidance strategy is to deploy specific avoidance scripts within each hierarchy. This method will allow the user to configure specific applications, (eg app1 and app2), to avoid one another whenever possible while allowing other applications to run without restriction. In the case of our three servers, Alpha, Beta, and Gamma, and three resources app1, app2, and app3 this method would provide the greatest flexibility. In this example, app1 and app2 will seek to avoid collocation when a server fails, but app3 will fail to the next available node based on priorities without any collocation restrictions.

For additional examples of avoidance strategies and resources, consider the SIOS Protection Suite for Linux documentation. If a customer has two applications, app1 and app2, that they require to run on different nodes whenever possible, the customer can create two avoidance terminal leaf node resources using the SIOS Protection Suite for Linux gen/app resource and the ‘/opt/LifeKeeper/lkadm/bin/avoid_restore’ script.

Reproduced from SIOS