March 8, 2022 |
Highly Available or Highly Vulnerable? A Checklist for High AvailabilityHighly Available or Highly Vulnerable? A Checklist for High AvailabilityIt’s no secret that businesses of all sizes have an ever-growing need for IT systems. But IT systems are only effective for these businesses and their clients if they are operational, resilient and highly available. As enterprises look to build out their enterprise availability, having a baseline for weighing and assessing your vulnerability can be the difference that produces a successful merger of infrastructure, software, services and support that increases your success. Sometimes, the most basic of checklists can help you sort through whether or not your solution is highly available or highly vulnerable? Does your organization have the proper infrastructure to support high availability?
They deploy software but have instability within the network infrastructure, servers, and datacenter itself. Cloud addresses a lot of the infrastructure issues, but not all cloud platforms are architected the same. Be sure to understand your datacenter, on-premises or cloud. Does your organization have a runbook (or playbook) in place that covers design, architecture, and process?
If you answered, what is a runbook or playbook then your first step is to find or create one. A runbook (or playbook) helps your organization maintain systems and processes with respect to the highly available system architecture. Some companies use automated tools to create scripts that deploy and configure servers, others use a version-controlled document to outline how all things work together to provide resilience and success. Your team needs to have a place that newcomers and existing team members can go to to understand the environment, the process, and the tools being used. Does your organization have resources dedicated to maintaining high availability best practices?
“I didn’t set these systems up,” the IT Admin stated, “I just inherited these systems with some other servers.” The lament was an honest and often observed phenomenon in organizations. Whether it is the result of mergers and acquisitions, cost reductions, outsourcing, or general staff turnover, a key component of a highly available enterprise is sufficient staffing. A key to a highly vulnerable enterprise is a lack of staffing, undertrained or undersupported staffing. Does your organization have proper change management controls in place?
Change management is important. Change management controls and polices are an absolute must in reducing risk and making sure that your systems are available. A user without proper restraints can add packages or updates that destroy stability, or make changes that disrupt the organization for hours. In addition, not having a defined policy often creates drift between what is expected (documented) and the actual (what is in place). Change management is also critical to ensure that your standby cluster is at the same patch and software levels as the primary/source system, and that QA (or Pre-Production) are not grossly deviating from Production. Does your organization have proper access controls in place?
Our Services team joined a customer call and waited, and waited, and waited for the administrator with permissions to run a set of elevated commands to join the session to configure and update their software. Weeks later, our team joined a different customer call and watched in horror as multiple users, all with administrative privileges, ran a bevy of commands on the same cluster. The difference in the two calls pointed out with stunning clarity that access controls are important. A highly available enterprise needs to ensure that proper access controls are in place that prevents users from running elevated commands that could damage the configuration or diminish its operation. Be sure that users have limits on what they can do based on their roles, needs, and even experience. Does your company have a regular test process?
Testing takes time, but in my role of assisting customers with their cloud migrations and high availability deployments, the time has always been well spent. Often, the difference between the highly available and the highly vulnerable can come down to the customer or partner’s test process. As solutions become more complex, testing and validation are becoming more and more essential to reducing risk and vulnerabilities. If everything goes from design to production, you’re running a highly vulnerable system. But, if you’ve got tests and checkpoints, a process to verify changes before they make it into production your risks are significantly reduced. As VP of Customer Experience, our services team worked with a banner customer who deployed their systems for an entire year in QA before completing their go-live migration. Over that year they simulated outages, disasters, customer loads, downtime, maintenance, patching strategies, backups, recovery from backup, and a bevy of other test suites. Consequently, they’ve had remarkable results in performance, process adherence, high availability, and enterprise success. While no checklist will be able to cover every potential vulnerability in high availability, answering these questions will give you a strong foundation for understanding if your enterprise is highly available or highly vulnerable. Reproduced with permission from SIOS |
March 3, 2022 |
Disney’s Encanto – Lessons on High Availability, IT Teams & downtimeLessons on High Availability, IT Teams, and defeating downtime from Disney’s EncantoOver the weekend I’ve joined the masses of people who have tuned in to Disney’s Encanto and become a fan of the story, a student of the lessons and opportunities, and an absolute fan of Lin-Manuel Miranda. What does Disney’s Encanto provide in relation to High Availability, Clustering, and Resiliency? Lessons on High Availability, IT Teams, and defeating downtime from Disney’s Encanto In Encanto you quickly learn that the Family Madrigal is a special family. In one of the opening songs, “The Family Madrigal” we understand that all of the members of the family have unique and special gifts; superhuman strength, the ability to hear for miles, prophecy and prediction, the power to conjure beautiful flowers and plants, the ability to shape-shift, the ability to heal, and the ability to control the weather. Well, everyone it seems has a ‘gift’ except Mirabel. Lesson 1: You don’t need superhuman gifts to make a difference.Mirabel, while not gifted like the other siblings and members of the family, is the central figure in understanding the health, and disease of the family. Moreover, she is able to help the family put things back together when it all falls apart, without the other gifts. You need High Availability, but you don’t have to break the budget, develop supernatural abilities, or depend on a miracle to achieve it. As the movie continues, Pepa’s youngest son Antonio is readied for his gift ceremony. However, during the party and celebration Abuela notices cracks in the foundation of Casita. But her warnings go unheeded. Lesson 2: Don’t ignore the cracks.When Mirabel sees the cracks it leads her on a quest to find out what is endangering Casita and how she can help. Initially, she is ignored by the others and even rebuked. How will you respond if you see cracks or shortcomings in your IT infrastructure, or cracks in your architecture and design? Will you ignore the cracks, pretend they aren’t seen or even rebuke the team for finding them? Don’t ignore the cracks. Responding to the first sign of an issue is most often the perfect way to prevent a greater issue. On her quest to find answers and save the miracle’s magic, Dolores tells Mirabel to talk to her super-strong older sister, Luisa who initially suggests that everything is okay and that there is absolutely nothing wrong. But Luisa eventually begins to reveal that the weight of knowing there is an is becoming too much for her to carry alone. Lesson 3: The weight of HA is too big for a single person or team.As Luisa put it, “It is pressure that breaks the camel’s back, pressure that’ll never stop.”. Developing an High Availability solution, designing and architecting for resilience and data availability is not a simple process, and it is definitely not a task for a single person or single team. Your DBA, IT Admin, and ERP Administrators cannot handle the weight of maintaining critical enterprise availability alone. Likewise, a one-dimensional approach cannot carry the weight of four (4) nines of availability. Instead, it takes a fully aligned team working in concert with a complete HA solution to understand, design, develop, and deploy the tools and techniques. How well are the roles and responsibilities on your IT teams distributed and defined? Ensure no one is bearing the responsibility for HA alone. When Mirabel seeks Bruno for the answers she is looking for, everyone says, “We Don’t Talk About Bruno.” Bruno’s gift is precognition, but because of his warnings and seemingly negative visions, he disappeared. Lesson 4: Don’t be afraid of the person who sees trouble ahead.As VP of Customer Experience, I’ve helped customers perform health assessments for their infrastructure and clustering solutions. When the health check completes, not all customers are happy to hear that they have issues to resolve. We all do all we can to avoid the bad news. But, ignoring upgrades, forgetting to do maintenance, and downplaying risks identified by the Bruno of your team will not make the trouble disappear. In fact, it may make your worst fears a reality. Mirabel eventually finds a secret passage leading to Bruno and discovers that Bruno never left, but felt that he had to destroy her vision to protect her and himself. Lesson 5: Corporate culture can crush or create higher availabilityYour culture can either crush or create a space for higher availability and resiliency. Mirabel asks Bruno if he has been patching the cracks in Casita, but Bruno replies that he is afraid of the cracks. Lesson 6: Don’t be afraid of the cracksHA requires continuous, coordinated ongoing effort. An essential part of the effort is finding solutions and fixes for those IT cracks that could jeopardize your application or the gaps between architecture and execution. Even as Bruno (or Hernando) tries to patch the cracks, it is apparent that the foundational issues are too much for spackle and superficial solutions. Lesson 7: Spackle won’t fix a foundational problemTake a look at your infrastructure and look at the ways in which problems are being addressed. Are you deploying workarounds, band-aids, and temporary “hacks”, or are you looking at architectural and foundational solutions that address the root cause of the problem with your clusters, enterprise availability, and execution during disasters? Lesson 8: Find your JorgeIf you’ve been deploying more hacks and workarounds than root cause solutions, find your Jorge. Find a skilled team member, partner, or solution provider and give them permission to grapple with implementing the foundational solution that will fix the problem or strengthen the infrastructure. Bruno sees another vision that Casita could be saved if Mirable hugged Isabela. Mirabel offers Isabela an opportunity to blossom but Abuela doesn’t see it that way. An argument between Mirabel and Abuela ensues,and Abuela blames Mirabel for the cracks in ‘Casita’. Mirabel blames Abuela for her impossible demands, unrealistic expectations, and misplaced hopes. Lesson 9: Blame creates more problemsPass the Blame is a great party game, but it is not great for HA, cluster resilience, or data protection. I once helped a customer whose organization illustrated the unproductiveness of blame. After a proof of concept cluster hit an issue causing a delay, the Project Manager blamed the application team for the delay. The applications team blamed the backup administrator, who in turn blamed the infrastructure admin. Throughout the blaming session, their cluster remained unavailable, the proof-of-concept remained stalled, and the only progress being made was in the cracks of anger growing between teams. It was only when they put these differences aside that they could make the adjustments they needed to resolve their issue and continue with a successful POC. ‘Casita’ collapses and Mirabel runs away. Later, Alma finds Mirabel and after reconciling they join the family and village in building back Casita Better than ever. Lesson 10: Build it back strongerOf course, the final scenes of Encanto are filled with lessons in the confession of Alma (Abuela) such as:
But the most important of the final lessons is to build back better, stronger, and together. After every unplanned or planned outage, there will be lessons learned from root cause analysis, experiences and fresh understanding. As a result of this, there will also be an opportunity to build back a stronger solution and architecture for your high availability and disaster recovery. Consider the case of a customer who was able to create a standard deployment pipeline and QA system after discovering an outage was caused by code deployed directly to production. Or another customer who uncovered that disk and database warnings were being suppressed for weeks before the outage. Don’t waste the time and opportunity that comes when you have downtime. Be sure to work together to avoid the silos, dependencies on single strengths, or placing the hope of your infrastructure on the wrong thing. Of course, you should watch the whole movie for yourself, but there are even more lessons for HA as you walk through the magic and music of the movie and pick up on the lives and lessons from a few of the other characters
The movie closes with a great reunion and Mirabel and the Madrigals stand in front of the finished house. When Mirabel touches the doorknob to the door the ‘Casita’ springs back to life and the home along with the magical gifts of the family all return. Try these ten lessons for High Availability from Encanto, enjoy the movie, and remember “There is nothing you can’t do… together” with your team of customers, partners, solution providers, and administrators. |
February 27, 2022 |
How To Activate a License for SIOS Protection Suite for LinuxHow To Activate a License for SIOS Protection Suite for LinuxSince you have acquired your SIOS Protection Suite for Linux software, you will need to activate your license. This seven-minute video will help you get started. It walks you through all of the steps needed to begin running your SIOS Protection Suite for Linux software. Watch as a SIOS support representative demonstrates the steps that are necessary to install SIOS licenses: how to insert entitlement/activation IDs, how to obtain and insert host IDs, and activation file download. The video illustrates where to access software for download, how to view and validate host name and ID from purchased or trial entitlements, and how to download the activation files contained in your welcome email to complete the process. You will also learn how to access our SIOS Documentation portal, where you can find release notes, installation guides, technical documentation and in depth information on SIOS Protection Suite for Linux as well as a wide range of topics for every SIOS product. Receive helpful tips and convenient insights on how to complete the steps quickly and easily. See how simple it is to start running SIOS Protection Suite for Linux. How To Activate a License for SIOS Protection Suite for Linux Reproduced with permission from SIOS |
February 23, 2022 |
How To Install A SIOS Protection Suite for Linux License KeyHow To Install A SIOS Protection Suite for Linux License KeyOnce you have installed SIOS Protection Suite for Linux software and have activated your license, you will need to install your license key before you can begin to run it. This 4 minute video will review how to install SIOS Protection Suite for Linux software and demonstrate how to activate your license to get started using your SIOS Protection Suite for Linux software. Watch as a SIOS support representative shows you how to check that your SPS image file is mounted, to ensure you have the license file, and how to install and enter the complete path name. Use our simple license key manager to validate your activated licenses from purchased entitlements, download and apply license keys and start your SIOS Protection Suite for Linux software. This video also walks through the process of how to access our SIOS Documentation portal, where you can find release notes, installation guides, technical documentation and information detailing SIOS Protection Suite for Linux as well as a wide range of topics on everything SIOS. View tips and convenient insights on how to complete steps fast and simply. Now you can begin protecting your critical applications with SIOS Protection Suite for Linux. How To Install A SIOS Protection Suite for Linux License Key |
February 19, 2022 |
How to Eliminate Single Points of Failure in the Cloud with High Availability ClusteringHow to Eliminate Single Points of Failure in the Cloud with High Availability ClusteringWhen providing high availability protection, it is a general principle to ensure all components are redundant to avoid Single Points of Failure (SPOF). That is, ensure that no single element causes the entire system to stop if it fails. However, it is important to note that the operational infrastructure is hard to access in the public cloud. In a cloud-based high availability cluster, there is a possibility that the standby node(s) will be located on the same host server, in the same rack, and using the same network switch as the operating node. Unless you configure these elements with redundancy, any of them could be a SPOF and put the application at risk for catastrophic failure. It is necessary to ensure cluster nodes are on different cloud “regions” and “availability zones” that physically separate the data center and operational infrastructure in different geographic locations. What are the main principles for ensuring availability?You cannot expect the various components that make up a physical IT infrastructure to operate according to specifications forever. Parts wear out, systems become incompatible, and settings change. Although regular maintenance can reduce the risk of downtime, it’s likely that something will fail over the course of the product lifecycle. In some rare cases, you may have a serious bug that is latent in the OS or embedded software that causes the application to stop working. As you may have already noticed, the High Availability cluster configuration is exactly in line with this principle, and a single point of failure is eliminated by making the important server and its resources redundant to the active system (production system). However, it is important to remember two things. One, the server hardware is not the only critical component. The second point, other critical SPOF components may be invisible to you in a public cloud infrastructure. Beware of the pitfalls of a single point of failure hidden in the cloud’s invisible infrastructureMost public clouds operate in a so-called “multi-tenant” mode. That is, they run the VMs of multiple companies on the same physical host server. And with a regular contract, you can’t specify which host server your system runs on. This may cause problems as the standby node in your cloud cluster may be placed on the same host server that operates the active node. Even if you configure an HA cluster configuration, if the host server goes down, the operating node and the standby node will both go down too. In this scenario, your cloud operator decides when and how your system will be restored. The host server that operates the active node and the host server that operates the standby node may be in the same rack. In this case, the rack becomes a SPOF, so if a failure occurs there both the active and standby nodes under it will also fail. Furthermore, in the upper layers of your infrastructure such as network switches that bundle multiple racks, gateways and routers, and power supply units in data centers, the operating system node and the standby system node may coexist in the same system. And if these key components aren’t redundant, then you have an inescapable single point of failure. Again, for a company that is a public cloud user, such a data center infrastructure is a black box. It may impossible to see into the detailed configuration to identify SPOFs. Public cloud availability zones and regions should be leveraged for availabilityHow can we explicitly avoid hidden single points of failures in the public cloud? The most robust method is to use the “Availability Zones” and “Regions” prepared on the cloud side. An Availability Zone is an independent physical separation of the infrastructure within your data center. And regions are independent data centers that are geographically separated. Some public clouds allow you to deliberately use these Availability Zones or regions for different purposes. For example, Amazon Web Service (AWS) has 12 regions worldwide. In addition, Microsoft Azure has 22 regions. By constructing an HA cluster configuration in which operating nodes and standby nodes are distributed in different availability zones across these two or more regions, almost all SPOFs can be avoided with certainty. If you adhere to these best practices, you can confidently ensure availability, DR (Disaster Recovery) and BCP (Business Continuity Planning). |