May 25, 2022 |
High Availability, RTO, and RPOHigh Availability, RTO, and RPOHigh availability (HA) is an information technology term that refers to a computer software or component that is operational and available for more than 99.99% of the time. End users of an application, or system, experience less than 52.5 minutes per year of service interruption. This level of availability is typically achieved through the use of high availability clustering, a configuration that reduces application downtime by eliminating single points-of-failure through the use of redundant servers, networks, storage, and software. What are recovery time objectives (RTO) and recovery point objectives (RPO)?In addition to 99.99% availability time, high availability environments also meet stringent recovery time and recovery point objectives. Recovery time objective (RTO) is a measure of the time elapsed from application failure to restoration of application operation and availability. It is a measure of how long a company can afford to have that application down. Recovery point objectives (RPO) are a measure of how up-to-date the data is when application availability has been restored after a downtime issue. It is often described as the maximum amount of data loss that can be tolerated when a failure happens. SIOS high availability clusters deliver an RPO of zero and an RTO of minutes. What is a high availability cluster?In a high availability cluster, important applications are run on a primary server node, which is connected to one or more secondary nodes for redundancy. Clustering software, such as SIOS LifeKeeper, monitors clustered applications and dependent resources to ensure they are operational on the active node. System level monitoring is accomplished via intervallic heartbeats between cluster nodes. If the primary server fails, the secondary server initiates recovery after the heartbeat timeout interval is exceeded. For application level failures, the clustering software detects that an application is not available on the active node. It then moves the application and dependent resources to the secondary node(s) in a process called a failover, where operation continues and meets stringent RTOs. In a traditional failover cluster, all nodes in the cluster are connected to the same shared storage, typically a storage area network (SAN). After a failover, the secondary node is granted access to the shared storage, enabling it to meet stringent RPOs. Reproduced with permission from SIOS
|
|||||||||||||||||||||
May 21, 2022 |
SIOS Protection Suite for Linux Evaluation Guide for AWS Cloud EnvironmentsSIOS Protection Suite for Linux Evaluation Guide for AWS Cloud EnvironmentsGet Started Evaluating SIOS Protection Suite for Linux in AWSUse this step-by-step guide to configure and test a two-node cluster in AWS to protect resources such as Oracle, SQL Server, PostgreSQL, NFS, SAP, and SAP HANA. Before You Begin Your EvaluationReview these links to understand key concepts you’ll need before you begin your failover clustering project in AWS.
Configuring Network ComponentsThis section outlines the computing resource required for each node, the network structure and the process required to configure these components.
Creating an Instance on AWS EC2 from Scratch
Configure Linux Nodes to Run SIOS Protection Suite for LinuxInstall SIOS Protection Suite for LinuxLogin and Basic ConfigurationProtecting Critical Resources
Once the IP resource is protected, initiate a switchover (where the “standby” node becomes the “active” node) to test the functionality. |
|||||||||||||||||||||
May 16, 2022 |
Performance Of Azure Shared Disk With Zone Redundant Storage (ZRS)Performance Of Azure Shared Disk With Zone Redundant Storage (ZRS)On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk. What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ). With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for Zone Redundant Storage, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline. There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.
I also found an interesting note in the documentation. “Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.” While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing. Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.
The DiskSpd test that I ran used the following parameters. diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application. A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ. Here are the results.
In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead. Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using Zone Redundant Storage. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so. You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state. I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of Zone Redundant Storage, I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it. |
|||||||||||||||||||||
May 14, 2022 |
Availability SLAs: FT, High Availability and Disaster Recovery – Where to startAvailability SLAs: FT, High Availability and Disaster Recovery – Where to startIt’s fair to say that in this modern era where many aspects of our lives are technology-driven, we live in a very instantaneous world. For example, at a click of a button, our weekly grocery order arrives on our doorstep. We can instantly purchase tickets for events or travel. Or even these days, order a brand-new car without having to go anywhere near a showroom and deal with a pushy salesperson. We are spoilt in this world of convenience. But let’s spare a thought for all the vendors and the service providers who must underpin this level of service. They have to maintain a high level of investment to ensure that their underlying infrastructures (and specifically their IT infrastructures) are built and operated in a way where they can support this “always-on” expectation. Applications and databases have to be always running, to meet both customer demand and maximise company productivity and revenue. The importance of IT business continuity is as critical as it’s ever been. Many IT availability concepts are floated about such as fault tolerance (FT), high availability (HA) and disaster recovery (DR). But this can raise further questions. What’s the difference between these availability concepts? Which of them will be right for my infrastructure? Can they be combined or interchanged? The first and foremost step for any availability initiative is to establish a clear application/database availability service level agreement (SLA). This then defines the most suitable availability approach. What is an SLA?To some extent, we all know what an SLA is, but for this discussion, let’s make sure we’re all on the same wavelength. The availability SLA is a contract between a service provider and their end-user that defines the expected level of application/database uptime and accessibility a vendor is to ensure and outlines the penalties involved (usually financial) if the agreed-upon service levels are not met. In the IT world, the SLA is forged from two measures of criticality to the business – Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). Very simply, the RTO defines how quickly we need the application operation to be restored in the event of a failure. The RPO defines how current our data needs to be in the event of a recovery scenario. Once you can identify these metrics for your applications and databases, this will then define your SLA. The SLA is measured as a percentage, so for example, you may come across terms such as 99.9% or 99.99% available. These are measures of how many minutes of uptime and availability IT will guarantee for the application in a given year. In general, more protection means more cost. It’s therefore critical to estimate the cost of an hour of downtime for the application or database and use this SLA as a tool for selecting a solution that makes good business sense. Once we have our SLA, we can make a business decision about which type of solution – FT, HA, DR, or a combination thereof — is the most suitable approach for our availability needs. What is Fault Tolerance (FT)?FT delivers a very impressive availability SLA at 99.999%. In real-world terms, an FT solution will guarantee no more than 5.25 minutes of downtime in one year. Essentially, two identical servers are run in parallel to each other, processing transactions on both servers at the same time in an active-active configuration in what’s referred to as a “lockstep” process. If the primary server fails, the secondary server continues processing, without any interruption to the application or any data loss. The end-user will be blissfully unaware that a server failure has occurred. This sounds fantastic! This sounds superb! Why would we need anything else? But hold on…as awesome as FT sounds on paper, there are some caveats to consider. The “lockstep” process is a strange beast. It’s very fussy about the type of server hardware it can run on, particularly in terms of processors. This limited hardware compatibility list forces FT solutions to sit in the higher end of the cost bracket, which could very much be in the hundreds of thousands of dollars by the time you factor in two or more FT clusters with associated support and services. Software Error VulnerabilityFT solutions are also designed with hardware fault tolerance in mind and don’t pay much attention to any potential application errors. Remember, FT solutions are running the same transactions and processes at the same time, so if there’s an application error on the primary server, this will get replicated on the secondary server too. What is High Availability (HA)?For most SLAs, FT is simply too expensive to purchase and manage for average use cases. In most cases, HA solutions are a better option. They provide nearly the same level of protection at a fraction of the cost. HA solutions provide a 99.99% SLA which equates to about 52 minutes of downtime in one year, by deploying in an Active-Standby manner. The reduced SLA is introduced as there’s a small period of downtime where the Active server has to switch over to the Standby server before operations are resumed. OK, this is not as impressive as an FT solution, but for most IT requirements, HA meets SLAs, even for supercritical applications such as CRM and ERP systems. Equally important, High Availability solutions are more application agnostic, and can also manage the failover of servers in the event of an application failure as well as hardware or OS failures. They also allow a lot more configuration flexibility. There is no FT-like hardware compatibility list to deal with, as on most occasions they will run on any platform where the underlying OS is supported. How does Disaster Recovery (DR) fit into the picture?Like FT and HA, DR can also be used to support critical business functions. However, DR can be used in conjunction with FT and HA. Fault Tolerance and High Availability are focussed on maintaining uptime on a local level, such as within a datacentre (or cloud availability zone). DR delivers a redundant site or datacentre to failover to in the event a disaster hits the primary datacentre. What does it all mean?At the end of the day, there’s no wrong or right availability approach to take. It boils down to the criticality of the business processes you’re trying to protect and the basic economics of the solution. In some scenarios, it’s a no-brainer. For example, if you’re running a nuclear power plant, I’d feel more comfortable that the critical operations are being protected by an FT system. Let’s face it, you probably don’t want any interruptions in service there. But for most IT environments, critical uptime can also be delivered with HA at a much more digestible price point. How to choose: FT, HA and DR?
IT systems are robust, but they can go wrong at the most inconvenient times. FT, HA and DR are your insurance policies to protect you when delivering SLAs to customers in this instant and convenience-led world. Reproduced with permission from SIOS |
|||||||||||||||||||||
May 9, 2022 |
How to Avoid IO Bottlenecks: DataKeeper Intent Log Placement Guidance for Windows Cloud DeploymentsHow to Avoid IO Bottlenecks: DataKeeper Intent Log Placement Guidance for Windows Cloud DeploymentsTo ensure optimal application performance, when deploying SIOS DataKeeper it is important to place the intent log (bitmap file) on the lowest latency disk available, avoiding an IO bottleneck. In AWS, GCP and Azure, the lowest latency disk available is an ephemeral drive. However, in Azure, the difference between using an ephemeral drive vs Premium SSD is minimal, so it is not necessary to use the ephemeral drive when running DataKeeper in Azure. In AWS and GCP however, it is imperative to relocate the intent log to the ephemeral drive, otherwise write throughput will be significantly impacted. When leveraging an ephemeral disk for the bitmap file there is a tradeoff. The nature of an ephemeral drive is that the data stored on it is not guaranteed to be persistent. In fact, if the cloud instance is stopped from the console, the ephemeral drive attached to the instance is discarded and a new drive is attached to the instance. In this process the bitmap file is discarded and a new, empty bitmap file is put in its place. There are certain scenarios where if the bitmap file is lost, a complete resync will occur. For instance, if the primary server of a SANless cluster is shutdown from the console a failover will occur, but when the server comes back online a complete resync will occur from the new source of the mirror to the old source. This happens automatically, so the user does not have to take any action and the active node stays online during this resync period. There are other scenarios where bitmap file placement can also impact performance. For instance, if you are replicating NVMe drives you will want to carve out a small partition on the NVMe drive to hold the bitmap file. A general rule of thumb is that the bitmap file should be on the fastest, lowest latency disk available on the instance. It should also be located on a disk that is not overly taxed with other IO operations. Information on how to relocate the intent log can be found in the DataKeeper documentation. Additional information regarding how the intent log is used can be found in the DataKeeper documentation. Reproduced with permission from SIOS |