December 8, 2021 |
Achieving IT Resilience with High AvailabilityAchieving IT Resilience with High AvailabilityWhat is IT Resilience?IT resilience is the ability of an organization to maintain acceptable service levels when there is a disruption of business operations, critical processes, or your IT ecosystem. In this digital age, high availability is critical to your organization’s success. Your customers won’t tolerate a downed website. And you cannot afford a downed ERP, CRM, or other business-critical system either. This is where high availability comes in. Your organization must “check the boxes” on many different technologies and solutions to ensure IT resiliency – not the least among them is ensuring, at a minimum, that you have backup, disaster recovery, cyber resilience, and high availability solutions in place. For purposes of this article, we will be talking about high availability (HA) as one of the key elements required to ensure IT resiliency. What is High Availability?High availability systems ensure that business operations continue – with total transparency to customers and users – when your system, applications, and network goes down. HA is a component of a technology system that eliminates single points of failure to ensure continuous operations or uptime for an extended period. Highly available systems incorporate five design principles: automatic failover, automatic detection of application-level failures, no data loss, automatic and quick fail over to redundant components, and push-button failover and failback for planned maintenance. ————————————————————————————————————————– IT Resilience and High Availability – A Non-Example!This past August, Nissan Group’s data center in Denver crashed because of a power outage. The system impacted was known internally as NNANet. It is a Nissan solution used by employees to order cars/parts, manage product rebate sales, get info on vehicle recalls, file warranty claims needed to price and start service work, and getting financing information. NNANet is described as Nissan’s lifeblood because everything Nissan does goes through NNANet. The system remained down for four days, impacting operations at many retailers and production systems at two factories. The company, its retailers, and customers were all impacted. The ImpactClearly, this is an example where correctly configured, properly located high availability systems would have saved the day or at least minimized the impact of the crash. What was a high availability situation literally turned in to a disaster for Nissan as “commerce among consumers, retailers, distribution networks, manufacturing plants and finance companies.” were all affected for four days.[1] Nissan reset dealer sales goals by 10 percent for the month as a result of the crash. The total financial impact for Nissan and its dealers/retailers/partners remains to be seen. IT Resilience– A Real-World Example!Cayan™ is the leading provider of payment technologies and its Genius Customer Engagement Platform® aggregates and integrates every conceivable transaction technology, payment type, and customer program – both present and future – into a single platform. The Genius platform, as well as other mission-critical applications at Cayan, run on SQL Server. Cayan customers include some of the world’s largest online retailers, companies with no tolerance for downtime. “Our top priority is ensuring that our customers can complete transactions continuously 24 hours a day, seven days a week,” said Paul Vienneau, Chief Technology Officer, Cayan. Cayan needed a high availability and disaster recovery system for their SQL Server database. The company considered a traditional shared storage cluster, but a SAN solution was expensive, complicated to manage, and introduced risk associated with a single point of failure. For these reasons, Cayan IT staff decided to use SIOS #SANLess clusters. SANLess clusters use local storage so there is minimal performance overhead and fast application response times. The SIOS software, SIOS DataKeeper, is integrated with Windows Server Failover Clustering (WSFC). SIOS uses efficient, real-time, data replication to synchronize local storage in the primary and remote cluster nodes, making them appear to WSFC as a virtual SAN. The ImpactSince deploying SIOS SANless clusters, Cayan has not experienced any downtime or data loss. Comments Paul Vienneau, CTO, “We are very pleased with the SIOS DataKeeper software. It met or exceeded our expectations. Implementation and ongoing administration were easy, and we have had zero downtime since we implemented our SIOS SANLess clusters.” There are no customer satisfaction issues to report, no lost revenues, no unproductive employees, no disruption to the business. —————————————————————————————————– SIOS: Achieve IT Resilience with High AvailabilitySIOS DataKeeper™ uses efficient block-level replication to keep local storage synchronized, enabling the secondary nodes in your cluster to continue to operate after a failover with access to the most recent data. SIOS products uniquely protect any Windows- or Linux-based application operating in physical, virtual, cloud or hybrid cloud environments and in any combination of site or disaster recovery scenarios, enabling high availability and disaster recovery for applications such as SAP S/4HANA and databases, including Oracle, SQL Server, DB2, and many others. The “out-of-the-box” simplicity, configuration flexibility, reliability, performance, and cost effectiveness of SIOS products set them apart from other clustering software. In a Windows environment, SIOS DataKeeper Cluster Edition seamlessly integrates with and extends Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data replication mechanism. While WSFC manages the software cluster, SIOS performs the replication to enable disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or impractical, such as in cloud, virtual, and high-performance storage environments. In a Linux environment, SIOS LifeKeeper™ and SIOS DataKeeper for Linux provides a tightly integrated combination of high availability failover clustering, continuous application monitoring, data replication, and configurable recovery policies, protecting your business-critical applications from downtime and disasters. Whether you are in a Windows or Linux environment, SIOS products free your IT team from the complexity and challenges of creating and managing high availability computing infrastructures. They provide the intelligence, automation, flexibility, high availability, and ease-of-use IT managers need to protect business-critical applications from downtime or data loss. SIOS = IT Resilience with HA + DRBackup, high availability, disaster recovery, and cyber resilience are all important elements in achieving IT resilience. With SIOS solutions, you can “check the box” for both high availability and disaster recovery – two solutions in one. With the ability to replicate to multiple targets, you can configure a multi-node failover cluster with nodes located in multiple locations to protect your systems from failures and disasters. For more information, and to ensure IT resilience for your organization, get a free demo of SIOS today. References:
Reproduced with permission from SIOS
|
December 3, 2021 |
How to Achieve High Availability with ClustersHow to Achieve High Availability with ClustersReproduced from SIOS
|
November 28, 2021 |
Four Reasons To Use An Avoidance Strategy In High AvailabilityFour Avoidance Strategies for Improving Cluster Resilience, Performance, and OutcomesSimple Steps for Deployment in SIOS Protection Suite Cluster Environment
Avoiding something – we’ve all done it before. An old flame we see in the store while walking with our spouse, a salesperson when we aren’t “ready to buy”, and even a boss while we are out on “vacation”. When I was the manager of a development team, I caught a glimpse of a direct report browsing in a store while they were supposed to be out of the office sick. They ducked between clothing racks and scurried down the next aisle and hurried away. We’ve all done it before, and in some cases, for mental health, physical health, or reasons that remain private and personal, we all need some measures of avoidance. Even in HA. So, how do you add avoidance to your High Availability environment, and why? Four reasons to use an avoidance strategy in High Availability
One reason to use avoidance strategies in HA is to increase application and server performance. Consider the case of three servers running production workloads, let’s call them Server Alpha, Server Beta, Server Gamma. Servers Alpha and Beta are running critical applications backed by a database, while Server Gamma is running reports and data transformation jobs. In the event of a failure of Server Alpha, a failover to Server Beta would traditionally occur. However, because server Beta is already running a large workload, the resulting additional application load might result in an undesirable server overload and poor performance for both applications. So it might be wise to deploy an avoidance strategy to make sure that Server Gamma is chosen as the failover target.
Consider again the scenario of three servers, Alpha, Beta, and Gamma. Servers Alpha and Beta are scaled to handle peak workloads, while Server Gamma is a cost-optimized server. In the event of a failure of Server Alpha and Server Beta, a failover will occur to the cost-optimized server, Gamma. However, this server is not scaled to handle peak workloads, nor the workloads of both Server Alpha and Server Beta at the same time. In this instance, an avoidance strategy can be used to optimize performance by automatically moving one or both of the workloads from Server Gamma as soon as another host is available.
HA Optimization is another scenario for deploying avoidance strategies. Like the performance optimization strategy, HA optimization is used to ensure that your environment can survive most failure scenarios and that your applications are optimized to provide the highest level of availability possible at any point in time. HA optimization is important for an application such as SAP with replicated enqueue processes. In any SAP environment, you do not want the ASCS (ABAP SAP Central Service) and ERS (enqueue replication services) instance residing on the same server for extended periods of time because of the risk of lost locks and canceled jobs. To prevent this from occurring you can use an avoidance strategy that causes the ERS and ASCS instances to always run on opposite cluster nodes. Consider the case of three servers running production workloads, let’s call them Servers Alpha, Beta, Gamma. Server Alpha is running the ASCS instance, while Server Beta is running the ERS instance. Server Gamma functions as a third node for failovers of both Server Beta (ERS) and Server Alpha (ASCS). If Beta crashes, you wouldn’t want the ERS resource running on the same node as the ASCS instance. To ensure this operation, you can deploy an avoidance strategy that automatically checks first and ensures the two applications are on separate servers, and maintain SAP ASCS/ERS best practices for lock failover.
Suppose you have two data centers: City Alpha and City Beta which are about 70 miles apart with most of your clients centrally located between them. However, due to recent changes in internal organizations, mergers/closures and acquisitions, and governance requirements, your IT team has to add a third data center that is located in City Gamma, which is about 350 miles from Alpha and Beta. Now the resources which were primarily protected in Alpha and Beta are also extended to the Gamma location. Given that most of the users and teams are near the Alpha and Beta locations and even the most extreme users are located in neighboring cities, your team needs to avoid a failover to the Gamma location. Like the other strategies, a DR avoidance seeks to optimize performance, in/out regional data costs, latency, and client access by avoiding the DR node should only one node within either region fail. It would also ensure that even if both nodes fail after different times, failover always occurs to the other node in the cluster or data center before moving to DR. So, how do you deploy an avoidance strategy?Many providers have affinity rules that can be configured, while others use a combination of server priorities or manual steps. In the case of the SIOS Protection Suite for Linux, you can use a number of built-in methods including:
In the event of a failure, resources will fail over to the server where they have the lowest remaining priority and cascade to any additional servers (Alpha, Beta, and Gamma). Server Alpha is the primary server for Resource.HR, Server Beta is the primary server for Resource.MFG, and Server Gamma is the backup server for all resources/servers. Using resource prioritization, Resource.HR would have a priority of one (1) on Server Alpha and a priority of two (2) on Server Gamma. While Resource.MFG could have a priority one (1) on Server Beta and a priority of two (2) on Server Gamma. If customers wanted to optimize the use of the environment, then Resource.HR could have a priority of three (3) on Server Beta and Resource.MFG could have a priority of three (3) on Server Alpha. In the event of a failure of Server Alpha, the resource Resource.HR would fail to Server Gamma first before trying to come in-service (be restored) on Server Alpha. SIOS Protection Suite for Linux (UI and CLI) allow users to specify a priority for each server and resource combination.
Policy rules can also be used to prevent a resource recovery from occurring on a given server and thereby allowing a resource to avoid a specified server that may be running a more critical or resource-intensive workload. Typical policies include:
The SIOS Protection for Linux CLI allows users to specify policy rules which can disable failover to a specific resource for a specified server, provide temporal policies guarding failures, disable failures of a specific application type, constraint policies, and custom policies.
The most granular way to establish a resource avoidance strategy is to deploy specific avoidance scripts within each hierarchy. This method will allow the user to configure specific applications, (eg app1 and app2), to avoid one another whenever possible while allowing other applications to run without restriction. In the case of our three servers, Alpha, Beta, and Gamma, and three resources app1, app2, and app3 this method would provide the greatest flexibility. In this example, app1 and app2 will seek to avoid collocation when a server fails, but app3 will fail to the next available node based on priorities without any collocation restrictions. For additional examples of avoidance strategies and resources, consider the SIOS Protection Suite for Linux documentation. If a customer has two applications, app1 and app2, that they require to run on different nodes whenever possible, the customer can create two avoidance terminal leaf node resources using the SIOS Protection Suite for Linux gen/app resource and the ‘/opt/LifeKeeper/lkadm/bin/avoid_restore’ script. Reproduced from SIOS
|
November 23, 2021 |
Introduction To Clusters – Part 2Introduction To Clusters – Part 2What Types of Clusters Are There and How Do They Work?An Overview of HA Clusters, and Load Balancing ClustersClustering helps improve reliability and performance of software and hardware systems by creating redundancy to compensate for unforeseen system failure. If a system is interrupted due to hardware or software failure or natural disaster, this can have a major impact on business and revenue, wasting crucial time and expense to get things back up and running. This is where clustering comes in. There are three main types of clustering solutions – HA clusters, load balancing clusters, and HPC clusters. Which type will best increase system availability and performance for your business? Let’s have a look at the three types of clustering solutions in more detail below. What is HA Clustering?High Availability clustering, also known as HA clustering, is effective for mission-critical business applications, ERP systems, and databases, such as SQL Server SAP, and Oracle that require near-continuous availability. HA clustering can be divided into two types, “Active-Active” configuration and active-passive configuration. Let’s take a look at the difference between these two HA clustering types. HA Clustering Type 1: Active-Active ConfigurationIn the active-active configuration, processing is performed on all nodes in the cluster. For example, in the case of two-node clustering, both nodes are active. If one node stops, the processing will be taken over the other. However, if each node is operating at close to 100% and one node stops, it will be difficult for another node to take on the additional processing load. Therefore, capacity planning with a margin is important for HA clustering. HA Clustering Type 2: Active-Standby ConfigurationLet’s use our two-node example again. In the active-standby configuration, one node is configured as the active node and the other node is configured as the standby node. The active node and the standby node exchange signals called “heartbeats” to constantly check whether they are operating normally. If the standby node cannot receive the heartbeat of the active node, the standby node determines that the active node has stopped and will take over the processing of the active node. This mechanism is called “failover”. Conversely, the mechanism that recovers the stopped operating node and transfers the processing back to the recovered active node is called “failback.” In an active/standby configuration, when a failure occurs, the simple switch from the active node to the standby node makes recovery relatively easy. However, it is necessary to consider that the resources of the standby node when the operating node is operating normally will be wasted. Two Components of HA Clustering: Application and StorageFor an HA cluster to be effective, two areas need to be addressed: application orchestration and storage protection. Clustering software monitors the health of the application being protected and, if it detects an issue, moves operation of that application over to the standby node. The standby node needs access to the most up-to-date versions of data – preferably identical to the data that the primary node was accessing before the incident. This can be accomplished in two ways: shared storage, share-nothing storage. In the shared storage model, both cluster nodes access the same storage – typically a SAN. In shared-nothing (aka SANless) configurations, local storage on all nodes are mirrored using replication software. Clustering software products vary widely in their ability to monitor and detect issues that may cause application failure and in their ability to orchestrate failovers reliably. Many clustering products only detect whether the application server is operational, but do not detect a wide range of software, services, network, and other issues that can cause application failure. Application Awareness is EssentialSimilarly, complex ERP and database applications have multiple component parts that have to be stored on the correct server or instance, started up in the right order, and brought on line in accordance with complex best practices. Choose a clustering software with specialized software called application recovery kits designed specifically to maintain best practices for the application/database-specific requirements. There are multiple ways to configure an HA Cluster:Traditional Two Node Clusters with Shared Storage
Two Node SANless ClusterClusters can be configured using local LAN and high speed synchronous block-level replication. Real-time replication can be used to synchronize storage on the primary server with storage on a standby server located in the same data center, in your disaster recovery site, or both. This allows you to build high availability and disaster recovery configurations flexibly; Two node or multi-nodeSIOS block level replication is highly optimized for performance. You can even use super fast, high-speed locally attached storage such as PCIe flash type storage devices on your physical servers to achieve very low cost, high performance, high availability configurations. Your data is protected on the flash device and your application too.
Third Node for Disaster ProtectionThis configuration uses a SAN-based cluster and adds a third, SANless node into a remote data center or the cloud and achieve full disaster recovery protection. In the event of a disaster, the standby remote physical server is brought into service automatically with no data loss, eliminating the hours needed for restoration from backup media. What is a Load Balancing Cluster?Load balancing clustering is a mechanism that can be used as a single system by distributing processing to multiple nodes using a load balancer to improve performance by distributing processing. While it can isolate a failed node to prevent node failure from affecting the entire system, the load balancer is a critical single point of failure risk and not a high availability option. It is only effective for applications such web server load balancing. If the load balancer itself fails, the entire system stops. What is HPC Clustering?You can also use clustering for performance instead of high availability. High-Performance Computing clusters, or HPC clusters combine the processing power of multiple (sometimes thousands of nodes) to get the CPU performance needed in CPU-intensive environments such as scientific and technological environments requiring large-scale simulations, CAE analysis, and parallel processing. Are you ready to find the right HA clustering solution for your business? Learn more about SIOS High Availability clustering here. Reproduced with permission from SIOS |
November 18, 2021 |
Introduction To Clusters – Part 1Introduction To Clusters – Part 1What is clustering in the first place?Clustering technology is a technology that allows you to connect multiple servers to act as a single functional unit. |