Ten Questions to Consider for Better High Availability Cluster Maintenance
Maintenance is a part of every company’s lifecycle. Every infrastructure is constantly moving and changing, even those that are moving towards end of life. Your team has likely had a lot of success doing what you’ve done in the past, but as systems become more complicated and complex, what you have deemed success in the past may need a refresh. Here are ten questions to improve cluster maintenance, maximize high availability, and minimize downtime.
How to Ensure High Availability During System Maintenance
-
What are the best days for the business stakeholders?
Different from unplanned downtime, these are windows in which multiple teams, systems, and interconnected resources are simply not available for planned activities. For example, one company is required to do monthly system compliance checks and safety inspections. During this time, the business operations are shuttered by inspectors, auditors, and similar.
-
What are the best dates for the team to schedule maintenance?
As VP of Customer Experience we’ve worked closely with a number of teams who have blackout dates for certain events and activities. Your team is likely responsible for more than one set of systems and servers, and reports to multiple teams with critical applications and infrastructure. Understanding which days are best for the team helps you avoid distractions, conflicts, and lost time due to known resource constraints.
-
What dates and times coordinate best with partners, consultants, and non-company contractors?
Critical infrastructure typically includes many other providers and vendors who are not directly related to the company’s staffing. These resources include OS, security and HA vendors and consultants, as well as architects from the infrastructure providers and other partners. Understanding in advance what days are best or included in your support tiers is critical to proper scheduling and staffing.
With the rise in global teams finding the right time for all of these resources is another question that is important to answer. What is the best time for resources in EST, IST, EMEA, and other regions?
-
What is the intended scope of the maintenance? What is the desired outcome of the maintenance activities? Think holistically.
Think beyond simple maintenance of the application to include the entire environment where it is running. Recently, a customer who was planning to upgrade their application decided to upgrade their OS at the same time. Unfortunately, this slight change in scope came with larger than expected consequences. Their application did not support the newly upgraded OS and problems ensued. Be sure that the scope of the maintenance window is well-defined and that outcomes for that scope are detailed. It is not enough to say, the environment works. Detail expected versions, behavior, and measurable outcomes wherever possible. See more about IT Resilience.
-
What is the length of time for the maintenance window (anticipated, allowed)?
Ideally we’d all love to have all the time to perform maintenance, but having customers located around the world means there is little tolerance for planned downtime windows – even for critical tasks. As you plan for maintenance, what length of down time is anticipated? Can you realistically meet your maximum allowed windows? If not, then you will need to replan the maintenance events.
-
What’s the rollback plan?
While we hope nothing goes wrong, we should be aware that we are dealing with software, complex environments and configurations, and lots of moving pieces being handled by numerous teams. A rollback plan – that is, a means of returning the systems to the pre-maintenance versions and settings – is essential. Be sure that if something goes wrong you have a rollback plan, for example full backups or machine images. See more about disaster recovery.
-
Who are the individual team members involved, what are their roles and responsibilities? Are all the required roles and responsibilities clearly defined?
As VP of Customer Experience our team was involved in a maintenance activity that encountered an unforeseen delay due to key team members that were missing. As you lay out your plan and architecture be sure to identify the team members as well as the IT roles and responsibilities required. As Sr. Support Engineer Greg Tucker reminds customers, HA touches every layer of your environment including storage, network, compute, OS, security, policies, etc.
-
Where is the maintenance plan documented? When was the last time the plan was reviewed, updated, and tested?
Success is wonderful, but it can also make you complacent or comfortable. After years of success, your process may no longer be well documented or actively being followed. Answering these questions can make sure your team continues to have success.
-
What issues were resolved in test/QA prior to the production plans?
Kudos for continuing to test maintenance steps. Be sure that issues resolved in test environments are properly added to the production maintenance plans. The SIOS Customer Success team has seen customers perform QA tests, uncover false assumptions and make necessary corrections, but fail to place those corrections in their production checklist.
-
Who or what is missing from your plans?
Now that you’ve looked over the plans, timing, teams, roles, and architecture one last question remains: who or what is missing? As a last step, look over your plans and ask the question: “Who is missing from our plans?” Also, consider asking “What is missing from our plans?” As VP of Customer Experience I have worked with our team to review activity plans for countless customers. One of the most memorable maintenance plan reviews uncovered a series of steps within the rollback plan that included restoring servers from cloned images and data from backup. However, the image cloning and data backup steps were not included in the task list. They had been overlooked and assumed to have been done earlier in the process.
System Maintenance is a Critical Element to Maintaining High Availability
System maintenance is a critical and necessary part of maintaining computer systems. The maintenance could be to correct errors, introduce new software functionality, or adapt a system to a new use case. When the systems in question are business critical systems that are essential for the organization to maintain business continuity, having a thought out plan is essential. Consider these ten questions and others of your own to make sure that your maintenance satisfies the needs of the business without unnecessary risk or delay.
Contact SIOS today for High Availability and Disaster Recovery Solutions.
Reproduced with permission from SIOS