November 20, 2022 |
The simple days of HA & DR are goneThe simple days of HA & DR are goneFlipping through the TV channels I stumbled on the scene in the movie “He’s Just Not That Into You” with Drew Barrymore, saying what most of us in 2022 are feeling about Technology and especially high availability and disaster recovery: “I miss the days when you had one phone number and one answering machine and that one answering machine had one cassette tape and that one cassette tape either had a message from a guy or it didn’t. And now you just have to go around checking all these different portals just to get rejected by seven different technologies. It’s exhausting.” Sometimes, don’t you wish there was only one cloud or maybe even no cloud platform; one DB running on one OS; and only a front end application to worry about. But, the world has changed and is moving faster, and becoming more complicated. Advances in technology, the fallout of mergers and acquisitions, and the increasing appetites and pace of our 24/7 society, with billions of consumers looking for the latest deal and the best experience, means that the simple days are gone. 4 hard truths about your availability
Of course your enterprise environment isn’t simple. You have legacy systems and applications, the kind that have been around almost since punch cards. You have new systems, made for the new generation of applications and databases. In addition you have solutions that were created a decade ago to bridge the gap or span the time between migrating from one platform to another, but despite your best efforts, these systems linger. Added to these challenges is a growing set of systems and IT resources from the merger and acquisition of Company U. Delivering HA is not as simple as you think in the new era.
As VP of Customer Experience, we’ve seen the damage caused by bad architecture. While deploying HA software can definitely help improve an application and database’s availability, HA software will never fully overcome incomplete requirements, poor networking, lack of redundant hardware, or other missing architectural components. Our team once worked with a customer to correct an undersized environment that left their system unstable during peak operating times. Because of their bad architecture, which included networking and hardware instability, their teams frequently found themselves scrambling to recover from avoidable downtime issues. In order to have a complete and sound, highly available and resilient solution you will need to deploy great software as a part of a sound architecture.
Developing an enterprise grade, highly available resilient HA solution, built on a solid architecture with the ability to grow is not a simple process. Designing and architecting for resilience, application and data availability is not as easy as grabbing a box of cake mix off the shelf. Throw in an array of tools, processes from different teams, a mixture of SLA’s, and the varieties of OS, applications, databases, and platforms and you have a recipe for needing help. Recently, I interviewed a 20 year veteran working in an enterprise support environment. He described how many of his peers, and even at times himself, have not been able to handle the weight of maintaining critical enterprise availability. Your admins, not only need help when they have been up since 2am dealing with a catastrophic, multi-system, multi-application, nearly complete data center collapse, but also in the day to day hard work of enterprise availability in one of the most technologically complex eras ever.
“While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware.” There are many other reasons for application downtime that aren’t covered by cloud provider SLAs including:
As VP of Customer Experience we’ve seen a thing or two, including a denial of service attack caused by a failed exit in a recursion routine, system exhaustion, security software quarantine of healthy, critical applications, kernel panics, and virtual machines that randomly reboot. If your HA strategy is relying solely on the SLAs of your hypervisor, your solution may not be as highly available as you think. You need to protect critical applications with clustering software that can monitor and detect issues, respond to problems reliably, and if necessary move operations to a standby server to ensure that your products and services remain reliable and available when and where they are needed. Our single data center has become a series of cloud platforms, spanning dozens of data centers. Our skunk work application has become a part of the bevy of critical front end, middleware and backend solutions that we must manage across Windows, Linux, and a few different *Nix varieties. The march of technology means that our high availability has become more complex and requires better architecture. It also means that our teams need more help to manage it all, and if we aren’t careful it could mean that we remain vulnerable and exposed. Which of the four truths is your team facing most? Cassius Rhue, VP Customer Experience Reproduced with permission from SIOS |
November 15, 2022 |
What Does the New Driver in SIOS LifeKeeper for Windows Do For You?What Does the New Driver in SIOS LifeKeeper for Windows Do For You?Making data protection in shared and SAN-less environments stronger for years to come.What does Coca-Cola, KitKat, SalesForce, and SIOS LifeKeeper for Windows have in common? Here are a few hints:
These companies made significant improvements to their iconic products, services and solutions to better serve their customers, adapt and prepare for the future, and capitalize on their strengths. In a similar fashion, SIOS has made dramatic improvements to our SIOS LifeKeeper for Windows product. Prior to LifeKeeper for Windows version 8.9.0, shared storage functionality, including I/O fencing and drive identification and management was handled by the NCR_LKF driver. Starting with the SIOS LifeKeeper for Windows release version 8.9.0, SIOS Technology Corp. redesigned the shared storage driver architecture. Beginning with the current release, the NCR_LKF driver has been removed and replaced by the SIOS ExtMirr driver, the engine behind the SANless storage replication of SIOS DataKeeper / SIOS DataKeeper Cluster Edition. Five significant benefits of the NCR_LKF architectural change in SIOS LifeKeeper for Windows:
The ExtMirr driver provides a more modern filter driver to manage the shared storage functionality. While the NCR_LKF driver focused on “keeping the lights on” and the “data safe”, the architecture of the driver lagged behind more modern drivers. The ExtMirr driver maintains that data protection, while being more compatible, more modern, and more easily supported in newer versions of the Windows OS.
The driver used in both SIOS DataKeeper and SIOS DataKeeper Cluster Edition includes a robust fencing architecture. While the NCR_LKF driver was capable of I/O fencing, the new driver is more robust and has been tested in SAN and SANless environments. The enhanced I/O fencing leverages volume lock and node ownership information within the protected volume.
Leveraging the I/O fencing for the ExtMirr driver used in the DataKeeper products means that the LifeKeeper for Windows solution increases in integration with the DataKeeper product line. The ExtMirr driver also includes the latest Microsoft driver signing and works seamlessly with Operating Systems that enforce driver signing and Secure Boot.
The ExtMirr driver gives customers and administrators a large set of command-line utilities for obtaining and administering the status of the volume. The emcmd commands are native to both of the SIOS DataKeeper products. They can now be used for easier administration with the SIOS LifeKeeper shared volume configurations. Customers and partners who leverage both shared storage and replicated configurations with the LifeKeeper for Windows products now have a single command line set of tools to know and use. The emcmd tools replace the previous volume.exe, volsvc, and similar NCR_LKF filter driver tools for administration (lock, unlock, etc).
With the addition of the ExtMirr driver into SIOS LifeKeeper for Windows, the shared storage configurations, as well as replication configurations, will now see a boost in updates, new features, and fixes. While the NCR_LKF driver provided a solid foundation and stable base for I/O fencing, switching to the ExtMirr driver means that customers will see the same strength and stability, with faster updates for new product support Aligning the two products to a single driver may not be as flashy as the SalesForce Classic to Lightning update, but it adds significant functionality, increases the strength and longevity of both the SIOS DataKeeper and SIOS LifeKeeper solutions, and will make data protection in shared and SAN-less environments stronger for years to come. Cassius Rhue, VP Customer Experience Reproduced with permission from SIOS |
November 11, 2022 |
How to recreate the file system and mirror resources to ensure the size information is correctHow to recreate the file system and mirror resources to ensure the size information is correctWhen working with high availability (HA) clustering, it’s essential to ensure that the configuration of all nodes in the cluster are parallel with one another. These ‘mirrored’ configurations help to minimize the failure points on the cluster, providing a higher standard of HA protection. For example, we have seen situations in which the mirror-size was updated on the source node but the same information was not updated on the target node. The mirror size mismatch prevented LifeKeeper from starting on the target node in a failover. Below are the recommended steps for recreating the mirror resource on the target node with the same size information as the source: Steps:
Then, select the File System resource (/mnt/sps) for the Child Resource Tag. This will result in two hierarchies, one with the IP resource (VIP) and one with the file system resource (/mnt/fs) and the mirror resource (datarep-sps).
Example: mount /dev/sdb1 /mnt/sps
When the resource “extend” is done select “Finish” and then “Done”.
Reproduced with permission from SIOS |
November 9, 2022 |
Explaining the Subtle but Critical Difference Between Switchover, Failover, and RecoveryExplaining the Subtle but Critical Difference Between Switchover, Failover, and RecoveryHigh availability is a speciality and like most specialities, it has its own vocabulary and terminology. Our customers are typically very knowledgeable about IT but if they haven’t been working in an HA environment, some of our common HA terminology can cause a fair amount of confusion – for them and for us. They are simple-sounding but with very specific meaning in the context of HA.Three of these terms are discussed here – swithover, failover, and recovery. What is a Switchover?A switchover is a user-initiated action via the high availability (HA) clustering solution user interface or CLI. In a switchover, the user manually initiates the action to change the source or primary server for the protected application. In a typical switchover scenario, all running applications and dependencies are stopped in an orderly fashion, beginning with the parent application and concluding when all of the child/dependencies are stopped. Once the applications and their dependencies are stopped, they are then restarted in an orderly fashion on the newly designated primary or source server. For example, if you have resources Alpha, Beta, and Gamma. Resource Alpha depends on resources Beta and Gamma. Resource Beta depends on resource Gamma. In a switchover event, resource Alpha is stopped first, followed by Beta, and then finally Gamma. Once all three are stopped, the switchover continues to bring the resources into an operational state on the intended server. The process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a switchover operation requires more time as resources must be stopped in a graceful and orderly manner. A switchover is often performed when there is a need to update software versions while maintaining uptime, performing maintenance work (via rolling upgrades) on the primary production node, or doing DR testing. Key Takeaway: If there was no failure to cause the action, then it was a switchover What is a Failover?A failover operation is typically a non-user initiated action in response to a server crash or unexpected/unplanned reboot. Consider the scenario of an HA cluster with two nodes, Node A and Node B. In this scenario, all critical applications Alpha, Beta, and Gamma are started and operational on Node A. In this scenario, a failover is what takes place when Node A experiences an unexpected/unplanned reboot, power-off, halt, or panic. Once the HA software detects that Node A is no longer functioning and operationally available within the cluster (as defined by the solution), it will trigger a failover operation to restore access of the critical applications, resources, services and dependencies on the available cluster node, Node B in this case. In a failover scenario, because Node A has experienced a crash (or other simulated immediate failure) there are no processes to stop on Node A, and consequently once proper detection and fencing actions have been processed, Node B will immediately begin the process of restoring resources. As in the switchover case, the process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a failover operation requires less time than a switchover. This is because the processing of a failover does not require any resources to be stopped (or quiesced) on the previous primary (in-service or active) node. Key Takeaway: A failover occurs in response to a system failure. What is Recovery?A recovery event is easy to confuse with a failover. A recovery event occurs when a process, server, communication path, disk, or even cluster resource fails and the high availability software operates in response to the identified failure. Most HA software solutions are capable of multiple ways of handling a recovery event. The most prominent methods include:
Due to the number of variations in recovery policy it is easy to see a recovery event that resembles the behavior of a switchover. This is often the case in methods 1 and 5. In these scenarios applications and services are gracefully stopped in an orderly fashion before being started on the remote node. Methods 2 and 3, customers will often see a behavior similar to a failover. In methods 2 and 3, the primary server is restarted or fenced by the HA software which creates an observable behavior similar to a failover. Method 4 is typically an option that is rarely used, but is a hybrid of both a switchover and a failover. Method 4 begins with a graceful stop of the applications and services, followed by a restart of the applications and services (much like a switchover). However, if the local restart of the applications and services fails, the system will be restarted (much like a failover), but without actually failing to the remote cluster node. While rare, Method 4 is often invoked in cases where an unbalanced cluster is present, or used with a policy based methodology. Key Takeaway: A recovery event depends on the method chosen HA terminology between vendors is an area where common terms can take on different meanings. As you deploy and maintain your cluster solution with enterprise applications, be sure that you understand the solution provider terms for failover, switchover and recovery. And, while you are at it, make sure you know whether the restaurant will put the sauce on the side (in a saucer), or on the side (your mashed potatoes) Reproduced with permission from SIOS |
November 3, 2022 |
Best Practices for Downloading SAP ProductsBest Practices for Downloading SAP ProductsThis blog is an attempt to demystify some of the steps required to download SAP and related applications and patches, as it can be complicated to the inexperienced user. An SAP Support login will be required before you can proceed with the steps outlined below.. It’s a good idea to download and install the “SAP Download Manager” which is found on the bottom of the page below. The Download Manager allows you to select multiple packages to be downloaded at the same time.This allows unattended download of multiple packages. Follow this link for SAP instructions on how to install and configure the software download manager. Once you download and execute the DLManager.jar, you will be prompted with the configuration assistant: Click Next Enter your SAP login credentials, if you need a proxy then you can configure it. Enter the location where downloads will be saved. Click Finish. Now the Download Manager is running and you will add files into the basket to download them, see below. Click the Double green >> arrow to download all items in the download manager. Installations & UpgradesScroll to the top of software downloads: What we’re interested in here is primarily “Installations and Upgrades”. This is where complete SAP version images are available. For HANA scroll to H For Hana I select “H” and then find “SAP HANA Platform Edition 2.0”. Lots of HANA, Find and select “SAP HANA PLATFORM EDITION” Clicking on this gives me the option to select “Installation”. Now we are presented with a list of available current software releases, for HANA it’s currently either version 2.0 SP5 or SP6. You need to select the hardware platform you want, in our case Linux x86_64. If we wanted to use the download manager we would simply click the shopping cart (circled red), or we can download directly through our browser by clicking the link (circled green). HANA comes in the form of a ZIP that needs to be uploaded to your Linux VM and then unpacked using unzip. Most of the SAP packages come in .SAR format and this requires SAPCAR to extract, SAPCAR is the SAP utility that’s used to compress or uncompress files. You can search for SAPCAR and download the version appropriate for your platform, SAPCAR is typically used with -xvf options e.g. ./SAPCAR -xvf SAP.SAR Support Packages & Patches“Support Packages and Patches” would get you certain patch levels that can be applied to base product levels. “Databases” is used to support a third party database for use with SAP (other than HANA). Once we select “Support Packages and Patches” we are presented with several options on how we want to locate software. I normally use “By Alphabetical Index (A-Z)”. H for SAP HANA Then the software component you want to patch, e.g SAP HANA PLATFORM EDITION Again, select which subcomponent you want to patch, e.g. SAP HANA PLATFORM EDITION 2.0 Finally, choose the exact patch level you want for that selected subcomponent. Finally, you are ready for the fun part…installing SAP! If you need help with ensuring your SAP infrastructure is highly available, please reach out to SIOS. We would be glad to speak with you. Reproduced with permission from SIOS |