April 24, 2023 |
Ten Questions to Consider for Better High Availability Cluster MaintenanceTen Questions to Consider for Better High Availability Cluster MaintenanceMaintenance is a part of every company’s lifecycle. Every infrastructure is constantly moving and changing, even those that are moving towards end of life. Your team has likely had a lot of success doing what you’ve done in the past, but as systems become more complicated and complex, what you have deemed success in the past may need a refresh. Here are ten questions to improve cluster maintenance, maximize high availability, and minimize downtime. How to Ensure High Availability During System Maintenance
Different from unplanned downtime, these are windows in which multiple teams, systems, and interconnected resources are simply not available for planned activities. For example, one company is required to do monthly system compliance checks and safety inspections. During this time, the business operations are shuttered by inspectors, auditors, and similar.
As VP of Customer Experience we’ve worked closely with a number of teams who have blackout dates for certain events and activities. Your team is likely responsible for more than one set of systems and servers, and reports to multiple teams with critical applications and infrastructure. Understanding which days are best for the team helps you avoid distractions, conflicts, and lost time due to known resource constraints.
Critical infrastructure typically includes many other providers and vendors who are not directly related to the company’s staffing. These resources include OS, security and HA vendors and consultants, as well as architects from the infrastructure providers and other partners. Understanding in advance what days are best or included in your support tiers is critical to proper scheduling and staffing. With the rise in global teams finding the right time for all of these resources is another question that is important to answer. What is the best time for resources in EST, IST, EMEA, and other regions?
Think beyond simple maintenance of the application to include the entire environment where it is running. Recently, a customer who was planning to upgrade their application decided to upgrade their OS at the same time. Unfortunately, this slight change in scope came with larger than expected consequences. Their application did not support the newly upgraded OS and problems ensued. Be sure that the scope of the maintenance window is well-defined and that outcomes for that scope are detailed. It is not enough to say, the environment works. Detail expected versions, behavior, and measurable outcomes wherever possible. See more about IT Resilience.
Ideally we’d all love to have all the time to perform maintenance, but having customers located around the world means there is little tolerance for planned downtime windows – even for critical tasks. As you plan for maintenance, what length of down time is anticipated? Can you realistically meet your maximum allowed windows? If not, then you will need to replan the maintenance events.
While we hope nothing goes wrong, we should be aware that we are dealing with software, complex environments and configurations, and lots of moving pieces being handled by numerous teams. A rollback plan – that is, a means of returning the systems to the pre-maintenance versions and settings – is essential. Be sure that if something goes wrong you have a rollback plan, for example full backups or machine images. See more about disaster recovery.
As VP of Customer Experience our team was involved in a maintenance activity that encountered an unforeseen delay due to key team members that were missing. As you lay out your plan and architecture be sure to identify the team members as well as the IT roles and responsibilities required. As Sr. Support Engineer Greg Tucker reminds customers, HA touches every layer of your environment including storage, network, compute, OS, security, policies, etc.
Success is wonderful, but it can also make you complacent or comfortable. After years of success, your process may no longer be well documented or actively being followed. Answering these questions can make sure your team continues to have success.
Kudos for continuing to test maintenance steps. Be sure that issues resolved in test environments are properly added to the production maintenance plans. The SIOS Customer Success team has seen customers perform QA tests, uncover false assumptions and make necessary corrections, but fail to place those corrections in their production checklist.
Now that you’ve looked over the plans, timing, teams, roles, and architecture one last question remains: who or what is missing? As a last step, look over your plans and ask the question: “Who is missing from our plans?” Also, consider asking “What is missing from our plans?” As VP of Customer Experience I have worked with our team to review activity plans for countless customers. One of the most memorable maintenance plan reviews uncovered a series of steps within the rollback plan that included restoring servers from cloned images and data from backup. However, the image cloning and data backup steps were not included in the task list. They had been overlooked and assumed to have been done earlier in the process. System Maintenance is a Critical Element to Maintaining High AvailabilitySystem maintenance is a critical and necessary part of maintaining computer systems. The maintenance could be to correct errors, introduce new software functionality, or adapt a system to a new use case. When the systems in question are business critical systems that are essential for the organization to maintain business continuity, having a thought out plan is essential. Consider these ten questions and others of your own to make sure that your maintenance satisfies the needs of the business without unnecessary risk or delay. Contact SIOS today for High Availability and Disaster Recovery Solutions. Reproduced with permission from SIOS |
April 18, 2023 |
Video: HANA Multitarget Feature DemoVideo: HANA Multitarget Feature DemoThe HANA application recovery kit (ARK) for SIOS LifeKeeper for Linux provides HANA-specific intelligence and automation of manual configuration and administration tasks saving time and eliminating risk of human error. SIOS LifeKeeper auto-validates user input during configuration and automates monitoring, failover, and replication management to deliver reliable failover in compliance with SAP best practices. Reproduced with permission from SIOS |
April 15, 2023 |
We Built HANA Multitarget to be a Game-changerWe Built HANA Multitarget to be a Game-changerOn behalf of the SIOS engineering team that created the new HANA Multitarget feature in SIOS LifeKeeper for Linux v. 9.7.0, we are proud and excited of our accomplishment. It took an experienced team of software developers months of planning before we even started the implementation. We worked through a number of customer use cases, technical requirements, and an impossible list of interdependencies to create a feature that is both unique and powerful. An Engineer’s Perspective on the HANA Multitarget FeatureHANA clustering environments are intrinsically complicated. That’s why customers who want to add a third node to their HANA cluster using competing clustering software have to use some pretty complex scripting and continue to script any changes to the cluster in the event of a failover or failback. With these products, after a failover occurs you have to do a lot of manual verification steps to be sure it’s ok to perform a takeover. Unlike those products, LifeKeeper 9.7.0 accesses detailed information about all the HANA nodes in your cluster that make it a much more stable and reliable HA environment. For example, it can determine which nodes are available and capable of a takeover and can also see if there was data loss after the failover occurred. This is very important, especially if multiple nodes have failed. The complexity of managing both failover and replication reliably in a multinode environment increases exponentially with every added node. For example, how will the clustering software choose which node to failover to? With data stored in three nearly identical locations, which storage is most current and accurate? How do you guard against a “split brain” scenario where data on different nodes diverge? What should the failover and replication steps be if two nodes fail? Three? We faced the challenge of thinking through the various combinations of failure scenarios and ensuring that SIOS maintained data protection and a reliably failover in each of them. LifeKeeper monitors the environment at a deeper level than competing products and has stringent requirements for managing failovers. The new 9.7.0 version of LifeKeeper has an enhanced ability to keep track of the HSR hierarchy, and to manage failovers of complex three and four-node HSR clusters to ensure they are fast and highly reliable. We set out to create the most automated and reliable multitarget clustering environment for HANA in the industry and I believe we succeeded. Why HANA Multitaget Provides the Most Reliable Clustering Environment
We believe the new LifeKeeper HANA Multitarget is a game-changer that gives customers the most automated, reliable failover clustering solution in the industry. Watch a demo of the new feature to see its capabilities. Contact SIOS today for more information on the HANA Multitarget feature for LifeKeeper. Reproduced with permission from SIOS |
April 12, 2023 |
The Essential Need for Always-On ApplicationsThe Essential Need for Always-On ApplicationsFrom Mission-Critical to Everyday OperationsIn today’s “always-on” world, technology plays a vital role in organizations’ efficiency and competitiveness. Some applications are classified as “mission critical,” signifying their essentiality to an organization’s core operations. These applications require high reliability and availability since any downtime or malfunction can have significant consequences. Additionally, everyday applications used in day-to-day business operations are critical to an organization’s success. Therefore, guaranteeing the high availability and reliability of all critical applications is crucial. In this blog, we will delve into why always-on applications are fundamental in today’s fast-paced and competitive business landscape. Additionally, we will explore what organizations can do to ensure their applications are highly available and reliable. Customers and Employees Expect 24/7 Application AvailabilityFirst and foremost, customers and employees expect applications to be available 24/7, from any device and any location. In today’s digital age, application downtime or slow performance can lead to lost revenue, damage to reputation, and even the loss of customers. For example, consider an e-commerce website that experiences downtime during a critical sales period. Customers may become frustrated and abandon their shopping carts, resulting in lost revenue and potentially damaging the company’s reputation. Similarly, if an employee cannot access an essential application, they may not be able to complete their work, leading to lost productivity and potentially impacting the overall success of the organization. Furthermore, as more organizations move their operations to the cloud, ensuring the availability and reliability of applications has become even more critical. In a cloud environment, complex systems, such as ERPs, may be running across multiple servers, data centers, and even geographic regions. This complexity can make it more challenging to identify and address issues quickly, making it essential for organizations to have robust monitoring and alerting systems in place to ensure they can respond promptly to any problems that arise. How Organizations Can Ensure Applications Are Always-OnSo, what can organizations do to ensure their applications are always-on? One approach is to implement a robust disaster recovery plan that includes redundant systems and failover mechanisms. This approach can help ensure that if one component fails, another can take over seamlessly without causing downtime or disruption. Organizations must also invest in the necessary infrastructure and tools to monitor their applications continuously and proactively address any issues before they become critical. Additionally, organizations can leverage technologies such as automation to improve the availability and reliability of their applications. For example, automating routine tasks can help reduce the risk of human error, and address issues quickly before they become critical. Whether they’re mission-critical or not, it’s vital to ensure that applications are available and reliable to maintain productivity, efficiency, and customer satisfaction. Organizations need to invest in the necessary infrastructure, tools, and processes to guarantee that their applications are highly available and reliable. They should also be ready to act quickly in response to any issues that may arise. Ultimately, an always-on approach to applications is a key factor for organizations to succeed in today’s fast-paced and highly competitive business environment. Reproduced with permission from SIOS |
April 6, 2023 |
Webinar: High Availability for Financial ServicesWebinar: High Availability for Financial ServicesRegister for the On-Demand WebinarMinutes or even seconds of downtime can be critical for businesses providing financial services and 24-hour transactions. Watch this webinar to learn cost-efficient best practices to ensure your transactional, processing and administrative financial systems in Windows or Linux environments will be protected and will continue to operate through hardware failures, administrator errors, routine maintenance and site-wide disasters. Reproduced with permission from SIOS |