High Availability Archives - Page 19 of 46

Disney’s Encanto – Lessons on High Availability, IT Teams & downtime

March 3, 2022 by Jason Aw Leave a Comment

Lessons on High Availability, IT Teams, and defeating downtime from Disney’s Encanto

Over the weekend I’ve joined the masses of people who have tuned in to Disney’s Encanto and become a fan of the story, a student of the lessons and opportunities, and an absolute fan of Lin-Manuel Miranda. What does Disney’s Encanto provide in relation to High Availability, Clustering, and Resiliency?

Lessons on High Availability, IT Teams, and defeating downtime from Disney’s Encanto
[Warning: movie spoilers ahead]

In Encanto you quickly learn that the Family Madrigal is a special family. In one of the opening songs, “The Family Madrigal” we understand that all of the members of the family have unique and special gifts; superhuman strength, the ability to hear for miles, prophecy and prediction, the power to conjure beautiful flowers and plants, the ability to shape-shift, the ability to heal, and the ability to control the weather. Well, everyone it seems has a ‘gift’ except Mirabel.

Lesson 1: You don’t need superhuman gifts to make a difference.

Mirabel, while not gifted like the other siblings and members of the family, is the central figure in understanding the health, and disease of the family. Moreover, she is able to help the family put things back together when it all falls apart, without the other gifts. You need High Availability, but you don’t have to break the budget, develop supernatural abilities, or depend on a miracle to achieve it.

As the movie continues, Pepa’s youngest son Antonio is readied for his gift ceremony. However, during the party and celebration Abuela notices cracks in the foundation of Casita. But her warnings go unheeded.

Lesson 2: Don’t ignore the cracks.

When Mirabel sees the cracks it leads her on a quest to find out what is endangering Casita and how she can help. Initially, she is ignored by the others and even rebuked. How will you respond if you see cracks or shortcomings in your IT infrastructure, or cracks in your architecture and design? Will you ignore the cracks, pretend they aren’t seen or even rebuke the team for finding them? Don’t ignore the cracks. Responding to the first sign of an issue is most often the perfect way to prevent a greater issue.

On her quest to find answers and save the miracle’s magic, Dolores tells Mirabel to talk to her super-strong older sister, Luisa who initially suggests that everything is okay and that there is absolutely nothing wrong. But Luisa eventually begins to reveal that the weight of knowing there is an is becoming too much for her to carry alone.

Lesson 3: The weight of HA is too big for a single person or team.

As Luisa put it, “It is pressure that breaks the camel’s back, pressure that’ll never stop.”. Developing an High Availability solution, designing and architecting for resilience and data availability is not a simple process, and it is definitely not a task for a single person or single team. Your DBA, IT Admin, and ERP Administrators cannot handle the weight of maintaining critical enterprise availability alone. Likewise, a one-dimensional approach cannot carry the weight of four (4) nines of availability. Instead, it takes a fully aligned team working in concert with a complete HA solution to understand, design, develop, and deploy the tools and techniques. How well are the roles and responsibilities on your IT teams distributed and defined? Ensure no one is bearing the responsibility for HA alone.

When Mirabel seeks Bruno for the answers she is looking for, everyone says, “We Don’t Talk About Bruno.” Bruno’s gift is precognition, but because of his warnings and seemingly negative visions, he disappeared.

Lesson 4: Don’t be afraid of the person who sees trouble ahead.

As VP of Customer Experience, I’ve helped customers perform health assessments for their infrastructure and clustering solutions. When the health check completes, not all customers are happy to hear that they have issues to resolve. We all do all we can to avoid the bad news. But, ignoring upgrades, forgetting to do maintenance, and downplaying risks identified by the Bruno of your team will not make the trouble disappear. In fact, it may make your worst fears a reality.

Mirabel eventually finds a secret passage leading to Bruno and discovers that Bruno never left, but felt that he had to destroy her vision to protect her and himself.

Lesson 5: Corporate culture can crush or create higher availability

Your culture can either crush or create a space for higher availability and resiliency.

Mirabel asks Bruno if he has been patching the cracks in Casita, but Bruno replies that he is afraid of the cracks.

Lesson 6: Don’t be afraid of the cracks

HA requires continuous, coordinated ongoing effort. An essential part of the effort is finding solutions and fixes for those IT cracks that could jeopardize your application or the gaps between architecture and execution.

Even as Bruno (or Hernando) tries to patch the cracks, it is apparent that the foundational issues are too much for spackle and superficial solutions.

Lesson 7: Spackle won’t fix a foundational problem

Take a look at your infrastructure and look at the ways in which problems are being addressed. Are you deploying workarounds, band-aids, and temporary “hacks”, or are you looking at architectural and foundational solutions that address the root cause of the problem with your clusters, enterprise availability, and execution during disasters?

Lesson 8: Find your Jorge

If you’ve been deploying more hacks and workarounds than root cause solutions, find your Jorge. Find a skilled team member, partner, or solution provider and give them permission to grapple with implementing the foundational solution that will fix the problem or strengthen the infrastructure.

Bruno sees another vision that Casita could be saved if Mirable hugged Isabela. Mirabel offers Isabela an opportunity to blossom but Abuela doesn’t see it that way. An argument between Mirabel and Abuela ensues,and Abuela blames Mirabel for the cracks in ‘Casita’. Mirabel blames Abuela for her impossible demands, unrealistic expectations, and misplaced hopes.

Lesson 9: Blame creates more problems

Pass the Blame is a great party game, but it is not great for HA, cluster resilience, or data protection. I once helped a customer whose organization illustrated the unproductiveness of blame. After a proof of concept cluster hit an issue causing a delay, the Project Manager blamed the application team for the delay. The applications team blamed the backup administrator, who in turn blamed the infrastructure admin. Throughout the blaming session, their cluster remained unavailable, the proof-of-concept remained stalled, and the only progress being made was in the cracks of anger growing between teams. It was only when they put these differences aside that they could make the adjustments they needed to resolve their issue and continue with a successful POC.

‘Casita’ collapses and Mirabel runs away. Later, Alma finds Mirabel and after reconciling they join the family and village in building back Casita Better than ever.

Lesson 10: Build it back stronger

Of course, the final scenes of Encanto are filled with lessons in the confession of Alma (Abuela) such as:

Don’t hide what you see or ignore the cracks
Tell the truth to the people who matter(or in your case the team and business)
Build a culture that allows others to be more than their role or gift
Seek (and accept) help earlier
Pain doesn’t need to be a prison
Leaders have power to bring people together or push them away

But the most important of the final lessons is to build back better, stronger, and together. After every unplanned or planned outage, there will be lessons learned from root cause analysis, experiences and fresh understanding. As a result of this, there will also be an opportunity to build back a stronger solution and architecture for your high availability and disaster recovery.

Consider the case of a customer who was able to create a standard deployment pipeline and QA system after discovering an outage was caused by code deployed directly to production. Or another customer who uncovered that disk and database warnings were being suppressed for weeks before the outage. Don’t waste the time and opportunity that comes when you have downtime. Be sure to work together to avoid the silos, dependencies on single strengths, or placing the hope of your infrastructure on the wrong thing.

Of course, you should watch the whole movie for yourself, but there are even more lessons for HA as you walk through the magic and music of the movie and pick up on the lives and lessons from a few of the other characters

Camilo: Be adaptable and flexible
Luisa: Strength is good, but don’t deny the places where you feel weak and vulnerable
Pepa: You’ve got control of your attitudes and actions
Dolores: Listen to everything, but also take action
Bruno: If you love what you do, you’ll keep looking for solutions
Isabela: It doesn’t have to be perfect to work. HA is an ongoing battle: design, develop, test, deploy, repeat
Kid with the coffee: Too much coffee isn’t good for you
Mirabel: You need someone with hope and vision to succeed
Alma: Don’t blow your second chances and don’t lose sight of what really matters

The movie closes with a great reunion and Mirabel and the Madrigals stand in front of the finished house. When Mirabel touches the doorknob to the door the ‘Casita’ springs back to life and the home along with the magical gifts of the family all return. Try these ten lessons for High Availability from Encanto, enjoy the movie, and remember “There is nothing you can’t do… together” with your team of customers, partners, solution providers, and administrators.

How to Eliminate Single Points of Failure in the Cloud with High Availability Clustering

February 19, 2022 by Jason Aw Leave a Comment

How to Eliminate Single Points of Failure in the Cloud with High Availability Clustering

When providing high availability protection, it is a general principle to ensure all components are redundant to avoid Single Points of Failure (SPOF). That is, ensure that no single element causes the entire system to stop if it fails. However, it is important to note that the operational infrastructure is hard to access in the public cloud.

In a cloud-based high availability cluster, there is a possibility that the standby node(s) will be located on the same host server, in the same rack, and using the same network switch as the operating node. Unless you configure these elements with redundancy, any of them could be a SPOF and put the application at risk for catastrophic failure.

It is necessary to ensure cluster nodes are on different cloud “regions” and “availability zones” that physically separate the data center and operational infrastructure in different geographic locations.

What are the main principles for ensuring availability?

You cannot expect the various components that make up a physical IT infrastructure to operate according to specifications forever. Parts wear out, systems become incompatible, and settings change. Although regular maintenance can reduce the risk of downtime, it’s likely that something will fail over the course of the product lifecycle.

In some rare cases, you may have a serious bug that is latent in the OS or embedded software that causes the application to stop working.

As you may have already noticed, the High Availability cluster configuration is exactly in line with this principle, and a single point of failure is eliminated by making the important server and its resources redundant to the active system (production system). However, it is important to remember two things. One, the server hardware is not the only critical component. The second point, other critical SPOF components may be invisible to you in a public cloud infrastructure.

Beware of the pitfalls of a single point of failure hidden in the cloud’s invisible infrastructure

Most public clouds operate in a so-called “multi-tenant” mode. That is, they run the VMs of multiple companies on the same physical host server. And with a regular contract, you can’t specify which host server your system runs on. This may cause problems as the standby node in your cloud cluster may be placed on the same host server that operates the active node. Even if you configure an HA cluster configuration, if the host server goes down, the operating node and the standby node will both go down too. In this scenario, your cloud operator decides when and how your system will be restored.

The host server that operates the active node and the host server that operates the standby node may be in the same rack. In this case, the rack becomes a SPOF, so if a failure occurs there both the active and standby nodes under it will also fail.

Furthermore, in the upper layers of your infrastructure such as network switches that bundle multiple racks, gateways and routers, and power supply units in data centers, the operating system node and the standby system node may coexist in the same system. And if these key components aren’t redundant, then you have an inescapable single point of failure. Again, for a company that is a public cloud user, such a data center infrastructure is a black box. It may impossible to see into the detailed configuration to identify SPOFs.

Public cloud availability zones and regions should be leveraged for availability

How can we explicitly avoid hidden single points of failures in the public cloud? The most robust method is to use the “Availability Zones” and “Regions” prepared on the cloud side.

An Availability Zone is an independent physical separation of the infrastructure within your data center. And regions are independent data centers that are geographically separated. Some public clouds allow you to deliberately use these Availability Zones or regions for different purposes.

For example, Amazon Web Service (AWS) has 12 regions worldwide. In addition, Microsoft Azure has 22 regions. By constructing an HA cluster configuration in which operating nodes and standby nodes are distributed in different availability zones across these two or more regions, almost all SPOFs can be avoided with certainty. If you adhere to these best practices, you can confidently ensure availability, DR (Disaster Recovery) and BCP (Business Continuity Planning).

Reproduced with permission from SIOS

How to Protect Applications in Cloud Platforms – Clusters for Microsoft Azure High Availability

February 15, 2022 by Jason Aw Leave a Comment

Clusters for Microsoft Azure High Availability

High Availability & Clustering Solutions for Azure

What is Azure Clustering?

An Azure cluster is a set of technologies that are configured to ensure high availability protection for applications running Microsoft Azure cloud environments. In an Azure cluster environment, two or more nodes are configured in a failover cluster and monitored with clustering software. The application runs on a primary node in the cluster. If clustering software detects an application operation failure, it orchestrates a failover of the application operation to secondary node(s) in the cluster. SIOS DataKeeper Cluster Edition clustering software is a unique add-on to Microsoft Windows Server Failover Clusters (WSFC) that enables Microsoft clusters to run in Azure and Azure Stack. SIOS Protection Suite for Linux protects critical Linux applications like SAP, HANA Oracle, MySQL, or Postgres in Azure and Azure Stack. SIOS clusters uniquely enable cluster failover across Azure regions and availability zones for true 99.99% uptime and disaster recovery protection.

Register Now for the SIOS Cloud Availability

Symposium

Microsoft Azure-Certified Software for HA Clusters w/WSFC

SIOS DataKeeper Cluster Edition software is Microsoft Azure-certified and available in the Azure Marketplace. It is the only Azure-certified software that enables customers to create a SANless high availability cluster in Azure or Azure Stack using Microsoft Windows Server Failover Clustering (WSFC).

By adding SIOS DataKeeper software to WSFC they can quickly and easily protect business-critical Windows environments from downtime and data loss in a cloud or any combination of physical, virtual, or hybrid cloud environment. Now, for the first time, customers using SAN-based Windows server failover clusters to protect their most important applications are free to move them to Azure or Azure Stack and achieve the high availability protection they need.

Find a step-by-step guide to creating an HA failover cluster in Azure here.

Find SIOS DataKeeper in the Azure Marketplace here.

Azure Site Recovery Compatibility for High Availability and Disaster Protection

SIOS DataKeeper Cluster Edition is the only high availability solution certified for use with Microsoft Azure Site Recovery for cost-efficient high availability and disaster recovery protection for business-critical applications in Azure.

SIOS DataKeeper’s compatibility enables customers to protect important applications, including SAP, SQL Server, and Oracle, in Azure cloud environments. SIOS DataKeeper Cluster Edition provides a simple way to use Windows Server Failover Clustering – including SQL Server Always On Failover Clustering – in a cloud environment. Customers can replicate the cluster to a geographically separated location using Azure Site Recovery for cost-efficient, robust disaster protection. Learn more about SQL Server High Availability in Azure.

Together SIOS DataKeeper and Microsoft Azure Site Recovery enable the only option for local high availability protection along with disaster recovery in a highly flexible and on-demand solution.

Protect Linux Applications in Azure

SIOS Protection Suite for Linux lets you run your business-critical applications in Azure or Azure Stack without sacrificing performance, high availability or disaster protection.

Learn more about SIOS SANless Software for Cloud High Availability.

Protect SAP Applications in Azure

SIOS Protection Suite and SIOS DataKeeper Cluster Edition provide comprehensive, fully SAP-certified protection for your SAP applications and data, including high availability, data replication, and disaster recovery in an easy, cost-efficient solution that can operate in the cloud, on-premises or in hybrid cloud configurations.

Learn more about High Performance and High Availability for SAP on Azure

Microsoft High Availability for SAP HANA database on Azure using SIOS Protection Suite

Learn more about SIOS Protection Suite for SAP.

See our latest blog posts about cloud high availability here.

Reproduced with permission from SIOS

How to Protect Applications in Cloud Platforms – AWS EC2 High Availability Clustering

February 11, 2022 by Jason Aw Leave a Comment

How to Protect Applications in Cloud Platforms – AWS EC2 High Availability Clustering

Clusters for AWS High Availability

High Availability and Clustering Solutions for Applications in AWS

What is AWS Clustering?

An AWS cluster is a set of technologies that are configured to ensure high availability protection for applications running AWS EC2 environments and monitored with clustering software. In an AWS cluster environment, two or more nodes in AWS are configured in a failover cluster. The application runs on a primary node in the cluster. If clustering software detects an application operation failure, it orchestrates a failover of the application operation to secondary node(s) in the cluster. To simplify and accelerate the deployment of high availability clusters in AWS, SIOS high availability clustering software is available on the AWS Marketplace. It can be deployed automatically using an AWS QuickStart or via bring-your-own perpetual licensing model. SIOS clusters uniquely enable cluster failover across AWS regions and availability zones for true 99.99% uptime and disaster recovery protection.

SIOS Delivers High Availability in AWS

To simplify and accelerate the deployment of high availability clusters in the cloud, SIOS High Availability Clustering Software is available on AWS Marketplace. It can be deployed automatically using an AWS QuickStart. The AWS Quick Start deployment is ideal for organizations making their first venture into high availability clusters in the cloud.

AWS High Availability with SIOS DataKeeper

SIOS DataKeeper Cluster Edition is the first high availability and disaster recovery solution to combine fully automated, application-centric clustering and efficient data replication. Seamless integration into Windows Server Failover Clustering environments, enable high availability clusters to work in a cloud where shared storage is not possible. SIOS DataKeeper synchronizes local storage in real time using highly efficient block-level replication to create a SANless cluster. System administrators and managers have a chance to try the AWS Quick Start program. They can use the SIOS Amazon Machine Images (AMIs) on AWS Marketplace to see firsthand how easy it is to deploy a two-node SQL Server cluster in the cloud with SIOS DataKeeper. The SIOS AMIs on AWS Marketplace provides an easy, convenient way for customers to purchase SIOS DataKeeper software to protect business-critical applications in AWS.

SIOS Technology has achieved AWS Microsoft Workloads Competency Status. This designation recognizes that SIOS provides proven technology and deep expertise in helping customers in the migration, deployment and management of Microsoft-based applications on AWS, specifically with workloads based on Microsoft SQL Server.

SIOS Protection Suite for Linux Provides Real High Availability for AWS

Cloud providers like AWS provide availability options. But, they do not provide the level of high availability and breadth of protection across the whole application infrastructure that customers demand and that you achieved by using clusters before there were clouds. AWS customers are aware of this. They know that they need real availability and clustering software tools that provide actual levels of high availability (at least 99.99% uptime). As a result, AWS has partnered with SIOS. Our SIOS Protection Suite for Linux to achieve these desired levels of high availability with Linux clustering for our mutual customers and the critical applications they are moving, to the AWS cloud.

SIOS Protection Suite for Linux provides a tightly integrated combination of high availability failover clustering, continuous application monitoring, data replication, and configurable recovery policies. SIOS Protection Suite for Linux includes SIOS LifeKeeper, SIOS DataKeeper, and multiple Application Recovery Kits (ARKs) to protect your business-critical applications and data from downtime and disasters.

SIOS Quick Start deployments for AWS

SIOS delivers the same High Availability capabilities that are available through a Windows Server Failover Cluster in the cloud and configurable on AWS quickly and easily – saving months of effort, improving operational flexibility and drastically lowering costs to set up and maintain. Take a look at how our customers, Gulliver and Epicure, use SIOS High Availability Clustering Software in AWS.

The AWS Marketplace offering and SIOS DataKeeper and SIOS Protection Suite Quick Start deployments for AWS are comprehensive solutions that help simplify the transition to operating high availability in the cloud. Ultimately leading to freeing IT staff to support additional business-driving initiatives.

AWS Quick Starts are automated reference deployments for key workloads on AWS. Each QuickStart launches, configures and runs the AWS service required to deploy a specific workload on AWS. This is done by using AWS best practices for security and availability. QuickStarts eliminate manual steps with a single click. They are fast, low-cost, and customizable. Several SIOS DataKeeper AMIs are available for purchase on AWS Marketplace. Therefore, it enables customers to add high availability to an existing deployment or to deploy a two-node SQL Server cluster in AWS. Read the white paper.

Purchase SIOS DataKeeper through the AWS Marketplace
View the Quick Start for SIOS Protection Suite for Linux on AWS
Learn more about high availability for SQL on AWS
Request a Free Trial of SIOS High Availability Clustering Software

See our latest blog posts about cloud high availability here.

Reproduced with permission from SIOS

Seven Essentials in High Availability Team Transition

February 3, 2022 by Jason Aw Leave a Comment

Seven Essentials in High Availability Team Transition (Navigating the Great Resignation)

Unless you’ve been under a rock or frozen in time you’ve likely heard from one source or another that employers and employees are in the midst of a trend being called “The Great Resignation”. As reported in US News and World Report, “According to the U.S. Bureau of Labor Statistics, 4 million Americans quit their jobs in July 2021 and the trend isn’t slowing down.” No matter your company size or current revenue stream, if it hasn’t already, this trend will impact your IT team in the near future. Yes, let that sink in. The same team that is responsible for ensuring your mission-critical application availability is vulnerable in one way or another to the effects of “The Great Resignation.”

So, how do you recognize the warning signs, come to terms with the reality, and navigate with empathy and clarity through “The Great Resignation” so that it doesn’t cause a “Great Disaster” for your critical applications?

Here are technical and non-technical tips for sound High Availability (HA) best practices in the midst of change:

1. Don’t Quit

Don’t quit. Seriously! As colleagues and good people are choosing to change jobs, careers, or otherwise leave the workforce it can be tempting to quit. Especially when you begin to consider the prospect of carrying your already heavy workload with an even shortened bench. But don’t quit.

2. Identify Key Risks to High Availability

Of course this process of identifying risks is two-pronged. After a resignation, your team is at risk from further personnel changes. But, your High Availability is also at risk due to a loss of capacity, technical knowledge, or expertise. To prevent your enterprise from experiencing unplanned downtime in the wake of new team resignations, you’ll need to identify key areas of risks. Some technical risks include:

Cloud expertise and knowledge
Database Administration
Storage Administration and configuration
Tacit High Availability product knowledge
Emergency Coverage (Staffing)
Technical Leadership
Documentation

3. Managers: Assess Your Company

Many times as people begin to leave a company, it is very easy to say that it is “them, not us!” We want to focus on all the reasons why their issues led to them leaving, quitting, or choosing a different career or job. It is quite possible that their reason for leaving is entirely personal, however sometimes, the issue is in the mirror and it is not them, but us. Why does figuring out whether it is a problem with them or you matter for HA? Well, if the problem is with your company, such as it’s mission, vision, culture around HA and IT, or hiring and staffing issues for IT and HA system management, then simply adding an additional headcount will be a temporary fix. In addition, the risks to the team morale, commitment, and knowledge transfer may be further eroded as the focus remains on blame shifting versus issue resolution.

4. Team Leads: Assess Your Team

Almost every company has had someone quit their team over the past two years. No matter whether they were seeking higher pay, staying at home to care for family members, retiring or pursuing other options, they have left. If you’ve lost a team member, it is essential to assess the remaining team. This assessment will be both technical and non-technical in nature. Technically, you will need to:

a. Identify current skills, abilities and knowledge gaps

What skills are remaining on the team, and what is the level of technical expertise and ability? Where are the knowledge gaps between, especially those between theory and practice?

b. Understand both existing and missing roles.

Many of your team members may be covering multiple roles and responsibilities. The loss of a single team member may actually mean the loss of coverage for multiple roles and responsibilities.

c. Evaluate immediate training or augmentation needs

Where are you covered, but needing additional training to stabilize and solidify the team? What areas do you lack coverage that can be mitigated by training of existing personnel or some form of contract professional services? As VP of Customer Experience, see this firsthand. Our team recently worked with a company needing professional services after losing key team members responsible for their HA environment.

Non-Technically, you will need to:

a. Understand how remaining team members feel

Even prior to the COVID pandemic and period of “The Great Resignation,” many teams were running on fumes. A 24/7 world of HA leaves a lot of work to be done with normal team numbers, norms, and tasks. If your team has been impacted, it is as critical as a down production server to check in and listen to the stories of remaining team members. Find out who is depleted, burned out, confused, nearing a collapse or conversely, full alive and ready for a new challenge. Be sure to listen to verbal and non-verbal cues, empathize (not just with the loss of a colleague, but with their emotions, concerns, and fears).

b. Understand the reasons that the remaining team members are still on board

Knowing how team members feel is both a technical and non-technical necessity, but nearly equal to this task is discovering their reasons for staying. Of course, some reasons may surprise you. Author and speaker Carey Nieuwhof states that some team members are only staying because they “feel trapped on the team because they didn’t leave first.” Other reasons team members stay may not surprise you, but regardless of the reason, comfort, opportunity, salary, location, stock options, passion, teamwork, culture, all of the reasons your team members stay for are important.

c. Evaluate the impact of being short-handed

There is obviously a technical component of being short handed previously discussed; assessing skills gaps, etc. But there is a corollary to the technical assessment of being short handed, and that is non-technical. Be sure to assess and evaluate the impact that being short handed, even if only momentarily, will have on the mental, emotional, and personal health of remaining team members. Early in my career as a manager, our team dealt with a downsizing event that left several employees emotionally vulnerable and mentally exhausted. This led to higher fatigue, more mental fog, and increased rates of defects and mistakes by those team members. If your team is severely impacted mentally and physically by being short-handed, the risk to your HA could increase. Your team may be scrambling to pick up the slack, and they may rally quickly to cover for the leader or team member who has resigned, but it is critical that you understand if those who remain are also exhausted, feeling trapped, or at risk to leave.

5. Identify the Critical Technical Tasks, Priorities and Assign Responsibilities

Years ago, a senior executive left the company. Despite having transitioned his roles and tasks throughout nearly a year of transition, there were still roles and tasks that surprised the remaining staff. In today’s wave of resignations you don’t have a full year of transition. Furthermore, if your team has experienced more than one resignation, you probably haven’t completed the analysis and transition of the first person so it is very critical to identify and prioritize the most critical tasks, and assign responsibilities. Be sure to list out tasks such as: security scans, updates, maintenance, backups, tests, new application deployments, cost analysis, cloning and redeployment of images, patch application, and vulnerability remediation. These tasks will all remain necessary despite the losses and can have devastating effects if left to linger.

6. Make a Short-term Plan for Maintenance and Operation

Tasks, roles and responsibilities still need to be covered. Critical issues will need to be addressed. Unplanned downtime will not wait to happen after you have rebuilt your staff, trained existing personnel, and fitted your company to be more resilient to the transitions and changes of the Great Resignation. In order to navigate in the short term, you will need to develop a smart, realistically achievable short term plan. This plan should map out the procedures, tasks and processes identified so that maintenance and operation can continue. Furthermore, it should define how existing critical infrastructure policies can be managed carefully through the tumultuous seasons to come.

7. Focus on the Future

The previous steps have led up to this. With an assessment of the current team, and identification of your key risks, and a transition plan in place the next step is to focus on the future. You still have a mission. You still have critical applications that need to be highly available. You still have data that needs to be protected, mined, replicated, and available for your business. Start making plans for the future team.

Build roles and responsibilities.
Update architectures and documentation
Evaluate opportunities for growth and alignment
Plan for new hires, including time for onboarding
Allocate time and resources to creating and updating onboarding materials
Focus on team health
Apply risk mitigation strategies for the near term and plan for the long term

Not all of the news about “The Great Resignation” is bad news for your team and HA. In the wake of team members leaving for new or different positions and opportunities, you have a real and rare opportunity to take all the information of your assessments and turn them into tools for growth and alignment and a better HA future. Building this brighter future includes defining the duties, roles, and skills needed, updating architectures and designs, planning for new hires and services engagements, and focusing on building a healthier team.

I discussed this subject in more detail in this recent TFir interview.

-Cassius Rhue, VP, Customer Experience

Reproduced from SIOS