High Availability Archives - Page 18 of 46

How COVID-19 Impacts High Availability

April 17, 2022 by Jason Aw Leave a Comment

How COVID-19 Impacts High Availability

Compared to friends, family, and those who have required treatment, hospitalization, or intensive care, my COVID symptoms have been mild. This is likely the result of reasonably good health, both doses of the vaccine, a booster shot, and early detection and treatment. And, my heart goes out to every family who has lost a loved one to any aspect of this pandemic, and to all those who have lost opportunities and special moments. As I and several members of our SIOS team recover from COVId-19, we wanted to share five things that your IT Team may be dealing with as they fight COVID and enterprise downtime, and five things you can do to help them.

Five COVID Concerns Facing IT Teams

Personal and Family Concerns and Fears

Initially, my symptoms were barely noticeable, a slight irritation in my throat, and a little sinus drainage, which I self-diagnosed as seasonal allergies. But when the issues worsened, accompanied by a bad cough I became worried. Of course, we’d all like to think that our work performance and responsibilities remain unchanged, but the reality may be a little harder to assess. Despite initial negative tests, I continued to develop symptoms that eventually impacted my ability to work, increased my personal health concerns, and raised a number of fears. If your team has been directly affected by COVID-19, understand that they are likely dealing with personal concerns, fears, and worries in addition to the real health challenges that may impact their schedules, tasks, and activities.In the midst of their personal concerns each team member is likely also dealing with larger concerns, namely concerns about family. During my illness, thankfully, my children all remained well. However, my wife was not so lucky. She became ill three days after my symptoms and remained ill longer and with more severe symptoms and setbacks. While we have the benefits of a large family unit, a licensed teenage driver, and an extra car not driven by COVID-positive parents, your team may not have these luxuries. And even if they do, it does not give them freedom from concern or reduce the amount of time and mental energy they need to apply to sanitize the home, keep their children in school and healthy, and deal with regulations, mandates, and close contact issues. Not to mention concerns over income and expenses. Team members facing personal and family concerns may experience difficulty concentrating, short-tempers, and difficulty meeting deadlines and schedules.
FODO – Fear of Disappointing Others

Even without COVID-19 illness, businesses worldwide are feeling the impact of a smaller workforce. The events aptly described as the “Great Shift”, “Great Resignation”, or “Great Shuffle” have already dramatically reshaped workforces, including those dealing with HA, leaving teams with fewer people to carry on critical tasks. This deficit in team members can lead those with COVID to battle a Fear of Disappointing Others (FODO). Sick team members may continue to try to work out of loyalty to the team or a fear of disappointing bosses, peers, or stakeholders. This FODO often leads to workers who are already functioning in a stressed environment (see #1 and 2 above) to attempt to maintain pre-COVID levels of activity. While heroic, it is also counterproductive to personal and professional recovery.
Fatigue

As I continue to deal with COVID-19 symptoms, one of the biggest issues I continue to face is fatigue. Initially, that fatigue, which was driven by FODO, prevented me from getting adequate rest and recovery. Because I had seen how shorthanded our team was and witnessed others try to brave their illness to keep up with demand, I tried to do the same. But, without warning I found myself drained, not at the end of the day, but for periods of time throughout the day. For me, starting the day before 5am and continuing to focus on work, tasks, strategy, and personnel matters for 8 to 12 hours was normal. (We can debate later if that was ever healthy). Now some felt like climbing Everest before 8 AM. The best advice I received was from a friend and co-worker who said, “Don’t fight it. When your body says rest, rest!”
Brain Fog

Around the same time that I started feeling sick, a colleague shared that they felt like they were in a fog following their bout with COVID symptoms. Like me, they were fully vaccinated and their symptoms and duration were mild. In fact, they actually never tested positive. Nevertheless, they spent days with what we both termed “brain fog.” An experience that we describe as slowness to recall details, a sense of knowing the answer, but lacking some mental sharpness that is somehow different from the physical fatigue and mental fatigue. In some instances, it appears as a slower response to a question, a pause in the keystrokes, or a delay before the light comes on in the room.
Failed Recovery

Five days into COVID, I woke up from an early night’s rest feeling better than ever. I jumped into my regular routine and by noon discovered that I had not fully recovered. Instead I was exhausting a small store of energy gained by sleeping well the night before. Trying to fight through this exhaustion created a new setback in my recovery. The following day I felt worse than before. The agony of a failed recovery and a concern about how to avoid more setbacks was added to my fatigue and fog.

So, what should IT team leads, stakeholders and managers do when their teams experience an issue with COVID-19.

Five Ways to Help IT Teams Battling COVID

Practice Empathy

Be mindful that COVID affects each person and family differently. Some of your coworkers and administrators will have minor issues, no symptoms, and no complications. While others, single parents, multi-generational families, or families with children or vulnerable persons will have many more issues and concerns. Know that the virus also impacts each person uniquely. Even within my own family my symptoms and those of my wife were different. While I experienced greater fatigue, she experienced more headaches. Have patience for coworkers who may be dealing with brain fog, juggling work schedules, caring for sick loved ones, or dealing with myriad issues related to COVID.
Assess needs

Unlike the flu or common cold, COVID recovery is irregular. A team member may show up at work one day feeling much improved and stay home sick the next. Your business still has technical needs and requirements for high availability and disaster recovery. However, with persons in and out of availability due to illness, be sure to understand the current roles and responsibilities required within the team. When an individual is out sick, be sure to assess their role, their impact to the team, their level of responsibility to the infrastructure,etc. You may also need to assess who within the team or organization can provide coverage in the event of a critical downtime event.
Prioritize issues

Help your team by prioritizing key issues. Under normal circumstances, your IT team is balancing dozens of requests ranging from the trivial (USB keyboard) to the critical (issues related to downtime, security threats, or storage issues). While it may be obvious to you and the team, other stakeholders may need to understand the status of the IT team and how operations will be handled until a return to more “normal” staffing occurs.
Be sure your Processes are up-to-date

As team members swap in and out, it is critical that IT maintenance and management processes are kept up to date. These processes will help each member of the team service your enterprise effectively and efficiently when performing a task that is not their normal responsibility. It will also reduce the amount of time each team member needs to spend researching the status of the systems they are covering while a coworker recuperates.
Give People Time

I’ve rushed back into the routine more than I should have, only to suffer the consequences of setbacks and greater fatigue on the following day. As a leader or individual contributor on a team, be sure to give yourself and your team time to “get back to normal.”

As the pandemic continues, we all hope for a future that greatly resembles normalcy, including less illness, fear and worry. In the meantime, being more aware of the concerns your team members are facing during COVID illness and recovery will greatly help you proactively prepare and weather the current storm. In addition, key lessons learned from this pandemic can be applied across a number of other organizational, employee life, and global concerns.

Reproduced with permission from SIOS

How to Get the Most from Your Tech Support Call

April 13, 2022 by Jason Aw Leave a Comment

How to Get the Most from Your Tech Support Call

Technical support experts share their tips on how to fast-track issue resolution

SIOS provides high availability protection for our customers’ most critical applications, databases, and ERPs. When our customers call tech support, there is no time to waste. We’ve earned a reputation (and several awards) for our HA/DR expertise and support excellence.

We’ve asked our tech support team to share the following five questions that can fast-track your issue resolution.

Fast, Accurate Diagnosis

Thorough and accurate tech support is similar to diagnosing an illness. Imagine asking your doctor to treat a headache. The human body is a complex interaction of multiple systems. The source of your problem may not be obvious or even in your head. To diagnose the issue and recommend a treatment, your doctor typically begins with questions aimed at identifying the circumstances that caused your symptoms.

Failover clustering also involves multiple systems at every layer of the IT infrastructure – network, storage, OS, application, database, and server. And like your real headache, your HA issue is often caused by something unrelated to your HA clustering software. Like your doctor, a good support professional will ask a variety of questions to characterize your issue. The more information you can provide about your support issue, the faster and more effectively it can be diagnosed and resolved.

Fast-Tracking Issue Resolution

As an IT best practice, consider logging key information and system changes as an ongoing business exercise. By putting answers to the following key questions at your fingertips, this process will speed the diagnosis and fast-track issue resolution. (It may also help you prevent issues from occurring in the first place).

Can you describe the error you are receiving? What is the exact symptom you are witnessing that is causing concern?
When did it happen (time, time zone you are in?)
A typical diagnostic method is to examine log files from the machine with issues. Log files can be hundreds of lines of message strings or command output. By tracking the precise time you noticed the problematic symptoms, we can significantly narrow the log file examination.
Have you or are you able to upload the logs?
Providing an explanation and description of the error along with the timeframe for which it happened goes a long way in diagnosis provided the logs can be uploaded to the support ticket. In some IT environments uploading the logs requires using corporate-approved file sharing, while dark sites require no electronic distribution of system logs. If logs cannot be provided externally, be sure that the full logs are captured and archived for reference and review with the support agent as the case progresses. Applications and systems, especially those under duress can produce exhaustive and extensive logs that can overwrite critical information.
Which system was the primary cluster node at the time?
Given the interconnected nature of clustering, it is important to inform your tech support representative of whether the cluster node you are calling about was functioning as the primary or secondary node at the time of the issue.
What have you tried to do to remedy the issue?
Great physicians know that their patients have likely tried a home remedy or over-the-counter medication prior to the visit. Knowing this information is helpful in diagnosis and treatment. The same applies with great support technicians. Sharing not only what you were trying to do at the time of the issue, but how you tried to resolve your errors can help them craft a better treatment and recovery plan, and make sure that their recommendations for recovery protect your critical data and applications.

For more than 20 years, SIOS Customer Experience team has been helping enterprise customers implement HA/DR solution for a wide range of use cases. We value our customers and encourage them to contact us whenever they have questions about their HA/DR.

Reproduced with permission from SIOS

Two Truths and a Lie: Understanding the Real Truth About Availability

April 9, 2022 by Jason Aw Leave a Comment

Two Truths and a Lie: Understanding the Real Truth About Availability

We played two truths and a lie at a company event years ago. The game involved putting forth two true statements and one untrue statement to see if you could fool the most people. The winner put forth ideas that all seemed believable or unbelievable, depending on your own personal history. Here is what was said:

While growing up, my hometown had no stoplights.
My grandparents met in the second grade and married in their teens.
After graduation, I attended a prestigious out-of-state university in Georgia before transferring back home to attend an in-state university.

I grew up in a small community with no stoplights, so that one seemed possible, but I was skeptical. I’ve heard stories of people who met at an early age, and got married in their teens, so that was possible but also the one that I might want to flag. The third one also seemed true, but I wondered who would transfer from a prestigious out-of-state university back to the no stoplight hometown to attend an in-state college. For what seemed like an eternity the entire group reasoned and pondered which of the three statements was a lie. And, it seemed as if no one could spot it. Several of us reasoned that if the hometown had no stoplights, would it really have a university as well? A few took the line that it was unlikely he attended the prestigious out-of-state university, given his age, years with the company, and multiple degrees. After final deliberation the verdict was in, the two truths were number one and number two. The lie was number three.

With all the information swirling around about High Availability, you might feel like you are playing a game of “Two Truths and a Lie.” Depending on where you look, you may find statements about availability that seem believable, but are not completely true when you dig in beneath the surface. For example, the following widely accepted statements are not actually true:

Storage availability is all that is needed for high availability

Applications require access to data to be effective and efficient. Your database will need to have access to storage if you are going to successfully run your enterprise. Your other enterprise application likewise must have access to configuration files, data stores, and transaction and error log directories to be usable. But, while reliable, readily accessible, and performant storage is essential for all enterprise systems, websites, databases, applications, and interconnects, storage available alone is not all that is needed for high availability. There are more components that make up a sound, reliable, resilient high availability architecture than just storage.
Platform availability is all that is needed for high availability

With the continued development and growth of cloud computing, many enterprises searching for high availability are confused by the concept of platform availability. Platform availability, sometimes referred to as system availability or infrastructure availability relates to the time that the platform (hardware, network, OS, and related components) are accessible and deliver their intended IT service. Applications and databases absolutely need compute, memory, storage, and network resources to operate properly and efficiently. Every service or function in your data center needs a reliable place to execute its logic, and without the underlying platform, these operations are not possible. Because of this, many consider that platform availability is all that is needed for high availability. As VP of Customer Experience, I have helped customers and partners understand the gaps between an available platform and available applications, databases, and client connectivity. In those conversations, we have discussed real examples of platforms showing no downtime or service issues, while simultaneously the enterprise applications running within that data center or cloud infrastructure are unavailable, unstable, or inaccessible to clients due to non-platform issues.

So What’s the Real Truth?

When our co-worker shared his three statements, we all got it wrong. His hometown was a small community, its borders were buffered by larger towns with a stoplight, but his own town did not have one of its own. And, as it turned out, he graduated early and went to that well-known, prestigious out-of-state institute of technology in Georgia, before getting homesick and transferring to an in-state university back home. So the lie was about his grandparents. While they may or may not have met at an early age, they definitely did not meet in the second grade.

The truth about high availability is that storage availability and platform or infrastructure availability are not enough on their own. In order to create the most robust, available, resilient, and reliable high availability infrastructure you must also include a commercial-grade solution to provide application-aware monitoring, alerting and recovery. You’ll also want that solution to be knowledgeable of your storage’s high availability capabilities, have a strong awareness of the infrastructure’s nuances and gaps, and have the ability to leverage best practices across the entire architecture to help your applications, databases, and services achieve your business objectives.

Improving Your Cloud Adoption Journey

March 19, 2022 by Jason Aw Leave a Comment

Improving Your Cloud Adoption Journey

In some way or another the world changing events of 2020 and 2021 have reshaped nearly everything that we knew, and high availability was no exception as many companies fast tracked their cloud adoption journey. Despite closures and restrictions, many IT teams traded on-prem data centers for the cloud. Many are asking, ‘Now what? Here are five things to do to fix your cloud journey in 2022.

1. Add high availability to the cloud

In the push to the cloud many IT and business leaders found themselves rushing to move services and applications from data centers that they were closing due to COVID-19 into the cloud. Others rushed to the cloud, not because of data center closures, but to deal with the wave of exploding demand from the sudden increase in remote working. For some, the journey to the cloud was so fast that high availability wasn’t included, Now they’ve discovered (the hard way) that applications still crash in the cloud and that unexpected outages and unplanned downtime are still the nemesis of AWS, Azure and GCP – just as they were in their previous data center.

The first step in fixing your cloud journey is to add high availability. This will mean several things to your enterprise:

Designing and architecting a highly available and redundant architecture
Choosing software and services that will protect critical components and applications
Defining and documenting associated processes and procedures, and at least a minimal governance
Deploying production copies for quality assurance, procedural testing, and chaos testing

2. Expand for higher availability for disaster recovery

Of course not everyone made the move to cloud without considering some form of high availability. Some IT teams had the foresight to not leave HA on-premises, but in the rush to cloud moved all of their critical servers to the same cloud Availability Zone. While having some HA protections is better than complete vulnerability, if you’ve only deployed your servers and applications in a single Availability Zone (AZ), now is the time to expand to multi-AZ for your standby cluster node, or even build in disaster recovery by deploying a third node in a different region. SIOS has helped dozens of customers plan multiple-AZ architectures and add disaster recovery solutions.

3. Build your cloud journey team

Overnight some companies, and their IT teams, went from being fully on-premises to wrestling with Cloud Formation Templates, QuickStart Guides, IAM roles, internal load balancers, Overlay IPs, and deciphering what exactly that VM size means. Now is the time to build a team to support the journey to the cloud. This will mean several things:

a. Adding capacity. Unless you were able to pull off a complete lift and shift, you likely have the same staff managing cloud and on-premises applications. Legacy solutions are known for being temperamental and requiring a lot of work to keep them stable and availableto navigate the cloud journey ahead you’ll need capacity capable of addressing availability requirements, understanding cloud architecture, and plotting the course forward for enterprise needs.

b. Augmenting skills with training. Give your IT team training for the cloud. To manage and plan the course forward, look for ways to augment the IT excellence within your organization with additional training on cloud solutions, architecture, best practices, and trade-offs. A confidently trained staff will not only pay dividends in increased availability, but they will also pay dividends by addressing availability, maintenance, and growth in an economic, scalable and logical way. Translation: they’ll avoid wasting money as they build out the rest of your cloud infrastructure.

4. Integrate automation and analytics to ensure uptime

As VP of Customer Experience at SIOS Technology Corp. I have worked with several companies who made the move to cloud in 2021 without sacrificing HA, DR or their team. If you took achieving the required number of nines of uptime (99.99%) seriously, and having a disaster plan that was non-negotiable, then it’s time to add the rigor of analytics and additional monitoring. Ensure that your availability solution has application aware automation and orchestration for recovery in the event of a disaster or unplanned downtime. Add analytics and automation to solidify your solution and take your cloud migration up another notch from one of reactive failovers, to proactive notification and mitigation of the failure before it occurs. Imagine being notified of underperforming applications, or of increasing latency, errors, or VM non-responsive behavior in time to avoid downtime in the peak business times. Analytics are also important as they can reveal systems and applications that may have escaped your original availability architecture.

5. Update IT processes and governance

Many things we think of as a failure are rooted in a failure of process. Make sure that your organization’s processes are up to date, well-documented, properly communicated, and adhered to. These processes should contain a few key minimums related to who, what, when, where and how all tied back to the business strategies, goals, and organizational needs as they pertain to the customer.

Make sure that ownership and sign-off processes for your new cloud environment are well-documented. I have seen firsthand the frustration that comes from conflicting, clashing, or unresolved roles and responsibilities for customers who have moved from hardware teams that acquire infrastructure to cloud teams. Muddling through a migration is one set of pain points, digging out of a disaster without clear governance is a much bigger, more costly issue.

If you’ve made the leap to cloud, staying there and making it work for you is the next part of the journey. If your cloud journey was sudden or rocky, consider these five points for improving your cloud journey and know that SIOS Technology can help you improve not only your high availability in the cloud, but also your processes for running in the cloud.

-Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

Highly Available or Highly Vulnerable? A Checklist for High Availability

March 8, 2022 by Jason Aw Leave a Comment

Highly Available or Highly Vulnerable? A Checklist for High Availability

It’s no secret that businesses of all sizes have an ever-growing need for IT systems. But IT systems are only effective for these businesses and their clients if they are operational, resilient and highly available. As enterprises look to build out their enterprise availability, having a baseline for weighing and assessing your vulnerability can be the difference that produces a successful merger of infrastructure, software, services and support that increases your success.

Sometimes, the most basic of checklists can help you sort through whether or not your solution is highly available or highly vulnerable?

Does your organization have the proper infrastructure to support high availability?

Do your data centers have environmental sensors in place to measure building systems?
Do your data centers have 24x7x365 operations?
Does your data center include redundant power and network connectivity from diverse sources?
Does your data center include multiple layers of host and storage services?
As VP of Customer Experience, I have seen customers attempt to create a highly available solution without addressing fundamental foundational issues within their infrastructure.

They deploy software but have instability within the network infrastructure, servers, and datacenter itself. Cloud addresses a lot of the infrastructure issues, but not all cloud platforms are architected the same. Be sure to understand your datacenter, on-premises or cloud.

Does your organization have a runbook (or playbook) in place that covers design, architecture, and process?

Is your runbook well documented, publicized and easily accessible?
Are routine parts of your runbook sufficiently automated?
Who has access to your enterprise runbook?
Is it current and currently maintained?
Is there version control for your runbook and any automation tools therein?

If you answered, what is a runbook or playbook then your first step is to find or create one. A runbook (or playbook) helps your organization maintain systems and processes with respect to the highly available system architecture. Some companies use automated tools to create scripts that deploy and configure servers, others use a version-controlled document to outline how all things work together to provide resilience and success. Your team needs to have a place that newcomers and existing team members can go to to understand the environment, the process, and the tools being used.

Does your organization have resources dedicated to maintaining high availability best practices?

Does your organization give these employees and contractors support and training?
Does your organization give these teams autonomy to adapt and create better best practices?

“I didn’t set these systems up,” the IT Admin stated, “I just inherited these systems with some other servers.” The lament was an honest and often observed phenomenon in organizations. Whether it is the result of mergers and acquisitions, cost reductions, outsourcing, or general staff turnover, a key component of a highly available enterprise is sufficient staffing. A key to a highly vulnerable enterprise is a lack of staffing, undertrained or undersupported staffing.

Does your organization have proper change management controls in place?

Do you have a regular update policy and schedule?
Do you have a defined process on patch maintenance?
Do you have a review process in place for patches (vulnerabilities, threats, etc)?

Change management is important. Change management controls and polices are an absolute must in reducing risk and making sure that your systems are available. A user without proper restraints can add packages or updates that destroy stability, or make changes that disrupt the organization for hours. In addition, not having a defined policy often creates drift between what is expected (documented) and the actual (what is in place). Change management is also critical to ensure that your standby cluster is at the same patch and software levels as the primary/source system, and that QA (or Pre-Production) are not grossly deviating from Production.

Does your organization have proper access controls in place?

Do you have account management tiers for server administration?
Do you have controls to prevent accidental downtime?

Our Services team joined a customer call and waited, and waited, and waited for the administrator with permissions to run a set of elevated commands to join the session to configure and update their software. Weeks later, our team joined a different customer call and watched in horror as multiple users, all with administrative privileges, ran a bevy of commands on the same cluster. The difference in the two calls pointed out with stunning clarity that access controls are important. A highly available enterprise needs to ensure that proper access controls are in place that prevents users from running elevated commands that could damage the configuration or diminish its operation. Be sure that users have limits on what they can do based on their roles, needs, and even experience.

Does your company have a regular test process?

Does your organization test in a pre-production or QA environment prior to production?
Does your organization perform regular backups and backup testing?
Does your organization practice disaster recovery scenarios and chaos testing for continuous improvement?

Testing takes time, but in my role of assisting customers with their cloud migrations and high availability deployments, the time has always been well spent. Often, the difference between the highly available and the highly vulnerable can come down to the customer or partner’s test process. As solutions become more complex, testing and validation are becoming more and more essential to reducing risk and vulnerabilities. If everything goes from design to production, you’re running a highly vulnerable system. But, if you’ve got tests and checkpoints, a process to verify changes before they make it into production your risks are significantly reduced. As VP of Customer Experience, our services team worked with a banner customer who deployed their systems for an entire year in QA before completing their go-live migration. Over that year they simulated outages, disasters, customer loads, downtime, maintenance, patching strategies, backups, recovery from backup, and a bevy of other test suites. Consequently, they’ve had remarkable results in performance, process adherence, high availability, and enterprise success.

While no checklist will be able to cover every potential vulnerability in high availability, answering these questions will give you a strong foundation for understanding if your enterprise is highly available or highly vulnerable.

Reproduced with permission from SIOS