SIOS APAC Portal

May 26, 2024	SIOS Technology helps strike the balance between high availability and cloud costs SIOS Technology helps strike the balance between high availability and cloud costs Finding the right balance between high availability and cost optimization can be challenging. Dave Bermingham, Senior Technical Evangelist at SIOS Technology, talks about some of the key factors influencing cloud costs and some of the strategies for optimizing costs. He says, “We focus on practical and effective strategies that will help reduce the costs associated with not only deploying high availability, but also in minimizing unexpected downtime besides minimizing downtime associated with planned maintenance.” Key factors influencing cost in cloud environments Key factors influencing the cloud cost and optimizing costs in high availability configurations include efficient resource management, strategic architecture decisions, and continuous monitoring. Bermingham discusses how it is crucial to choose the right instance, type, and size to match the workload requirements and how autoscaling can help reduce costs and optimize cloud spend. Bermingham highlights the importance of considering data transfer costs if the high availability solutions are deployed across multiple time zones and strategies you can use to minimize the charges. Other key considerations include optimizing storage and implementing effective governance and cost management policies. Finding the balance between high availability and cost optimization in the cloud Bermingham explains that although high availability will incur some expense this counteracts the costs associated with any downtime, which can be substantial. It is important to strike a balance between high availability and minimizing cloud costs by creating systems that are modular and scalable and with an operational strategy that embraces a DevOps culture and utilizes CI/CD practices. Cloud cost optimization and high availability challenges Bermingham highlights common pitfalls of optimizing costs without compromising high availability, such as underestimating the complexity of cloud cost management and neglecting the importance of application performance monitoring. Inadequate training on cloud cost optimization best practices and implementing HA solutions can often lead to inefficient resource utilization and unplanned downtime. How SIOS Technology’s high availability solutions help Bermingham explains how SIOS Technology can help address these challenges with HA solutions that simplify and automate HA in different cloud environments to minimize costs, minimize downtime, and manage maintenance. Reproduced with permission from SIOS
May 22, 2024	SIOS LifeKeeper for Linux v 9.8.1 improves the way companies manage HA/DR SIOS LifeKeeper for Linux v 9.8.1 improves the way companies manage HA/DR In today’s tech-driven landscape, companies are seeking innovative solutions to effectively maintain their complex application environments. In this video, Todd Doane, sales engineer at SIOS Technology, explains how the latest version of SIOS LifeKeeper for Linux helps companies in safeguarding critical enterprise systems against downtime and disasters. “The release features a new Web Management Console. It’s self-contained and does not require additional installations or third-party add-ons,” says Doane. Reproduced with permission from SIOS
May 17, 2024	Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications GenApp or QSP? Both solutions are supported by LifeKeeper and help protect against downtime for critical applications, but understanding the nuances between these solutions is important to choosing the correct one for your specific needs. Here are some features, benefits and potential use cases for you to decide which may work best in your environment.. GenApp, short for Generic Application, is a resource type that allows you to manage custom applications within LifeKeeper. With the flexible framework you can use your own scripts to do a variety of tasks that your application might require to automate the failover and recovery process. This flexibility allows granular control of how LifeKeeper handles startup, shutdown, monitoring, logging actions and more to ensure your applications’ high availability. QSP or Quick Service Protection is designed to be a quick and easy way to protect an OS service. QSP automates the monitoring, failover and recovery of these applications with built-in adjustable timeouts for these actions. Additionally, you can create a dependency relationship so that services can be started and stopped in conjunction with other applications that require the service. How do I choose the right solution? The first thing you need to determine is if your application can be recovered by stopping and restarting the service or daemon. If so, then QSP is probably the best and quickest solution for keeping your application up and running. This is because it requires no coding and within minutes you can add the application as a QSP resource within the LifeKeeper GUI. Additionally, it is part of the core product and any coding updates are included in new product releases. However, if your application requires anything other than simple health check and restart capabilities at the OS service level to recover properly, then you will want to explore GenApps. Creating the custom scripts for the GenApp resource type will require more in depth technical skill and long term maintenance, however the flexibility to do whatever tasks needed to keep your application running smoothly is critical, especially for niche applications. These tasks could be anything from monitoring, logging, cleanup tasks or configuration changes. Want more technical details? GenApps and QSP are supported on both LifeKeeper for Linux and Windows, more technical details can be found at the links below. GenApp for LifeKeeper for Linux GenApp for LifeKeeper for Windows QSP for LifeKeeper for Linux QSP for LifeKeeper for Windows Reproduced with permission from SIOS
May 11, 2024	What Causes Failovers to Happen? What Causes Failovers to Happen? Working in support, one of the most common questions we get from customers is “What prompted the failover from my primary node to the secondary node?”. There are several reasons this might occur… and we will attempt to explain the most common causes and how you can identify these. Before we get started, let’s differentiate between a ‘failover’ and a ‘switchover’, since many customers use these terms interchangeably. A ‘switchover’ is the act of manually moving your hierarchy from the primary node to the secondary node. This can be done thru the GUI, by performing an ‘In Service’ on the secondary node or thru the command line: perform_action -a restore -t $LKTag (bring hierarchy into service) A ‘failover’, on the other hand, is performed without any manual interaction… and is defined as automatic switching to a backup server upon the failure of the previously active server, application, or hardware/network.. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while a switchover is intentional and requires human intervention. The following are the most common ‘failures’ that initiate a ‘failover’: 1. Server Level Causes Server Failure Primary server loses power or is turned off. CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause the system to become unresponsive such that LifeKeeper may detect a server as down and initiate a failover. Quorum/Witness – As part of the I/O fencing mechanism of quorum/witness, when a primary server loses quorum, a “fastboot”:, “fastkill” or “osu” is performed (based on settings) and a failover is initiated. When determining when to fail over, the witness server allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. Communication (Heartbeat) Failure LifeKeeper has a built-in “heartbeat” signal that periodically notifies each server in the configuration that its paired server is operating. By default, LifeKeeper sends the heartbeat between servers every five seconds (this is adjustable for busy clusters). If a communication problem causes the heartbeat to skip two beats but it resumes on the third heartbeat, LifeKeeper takes no action. However, if the communication path remains dead for three beats, LifeKeeper will label that communication path as dead. It will initiate a failover if the redundant communication path is also dead (we recommend two paths). The following can result in a missed heartbeat: Network connection to the primary server is lost. Network latency. Heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and LifeKeeper initialization problems. Failed NIC. Failed network switch. Manually pulling/removing network connectivity Primary server loses power or is turned off. CPU Usage caused by excessive load — Under very heavy I/O loads, delays and low memory conditions can cause the system to become unresponsive such that LifeKeeper may detect a server as down and initiate a failover. Adjusting the heartbeat parameter: LCMNUMHBEATS=Y (where Y is the number of heartbeats before logging a communication path failed error in the log). The default is 3 and can be changed if your systems are busy or across a WAN to avoid false communication path failures. LCMHBEATTIME=5 (this is the interval in seconds and this is the default and should not be changed). These tunables are NOT in the /etc/default/LifeKeeper file by default. You will need to add them to change the heartbeat values. After adding these tunables and values in /etc/default/LifeKeeper you need to stop LifeKeeper and restart it. You can use the command lkstop -f, which stops LifeKeeper but does not bring down the protected applications. And you need to do this on both systems. This will allow 5 times Y seconds before LifeKeeper marks the communication paths as failed. What is a Split-Brain, and what causes it? If a single comm path is used and the comm path fails, then LifeKeeper hierarchies may try to come into service on multiple systems simultaneously. This is known as a false failover or a “split-brain” scenario. In the “split-brain” scenario, each server believes it is in control of the application and thus may try to access and write data to the shared storage device. To resolve the split-brain scenario, LifeKeeper may cause servers to be powered off or rebooted or leave hierarchies out-of-service to assure data integrity on all shared data. Additionally, heavy network traffic on a TCP comm path can result in unexpected behavior, including false failovers and the failure of LifeKeeper to initialize properly. The following are scenarios that can cause split-brain: Any of the comm failures listed above Improper shutdown of LifeKeeper Server resource starvation Losing all network paths DNS or other network glitch System lockup Using Quorum/Witness to prevent a Split Brain The Quorum/Witness Server Support Package for LifeKeeper (steeleye-lkQWK, hereinafter “Quorum/Witness Package”) combined with the existing failover process of the LifeKeeper core allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that local site failovers and failovers to nodes across a WAN can be done while greatly reducing the risk of split-brain situations. In a distributed system that takes network partitioning into account, there is a concept called quorum to obtain consensus across the cluster. A node having quorum is a node that can obtain consensus of all the clusters and is allowed to bring resources in service. On the other hand, a node not having quorum is a node that cannot obtain consensus of all the clusters and it is not allowed to bring resources in service. This will prevent split brain from happening. To check whether a node has quorum is called quorum check. It is expressed as “quorum check succeeded” if it has quorum, and “quorum check failed” if it does not have quorum. In case of a communication failure, using one node where failure occurred and another multiple nodes (or other devices) will allow a node to get a “second opinion” on the status of the failing node. The node to get a “second opinion” is called a witness node (or a witness device), and getting a “second opinion” is called witness checking. When determining when to fail over, the witness node (the witness device) allows resources to be brought in service on a backup server only in cases where it verifies the primary server has failed and is no longer part of the cluster. This will prevent failovers from happening due to simple communication failures between nodes when those failures don’t affect the overall access to, and performance of, the in-service node. During actual operation, the witness node (the witness device) will be consulted when LifeKeeper is started or the failed communication path is restored. Witness checking can only be performed for nodes having quorum. 2. Resource Failure Causes LifeKeeper is designed to monitor individual applications and groups of related applications, periodically performing local recoveries or notifications when protected applications fail. Related applications, by example, are hierarchies where the primary application depends on lower-level storage or network resources. LifeKeeper monitors the status and health of these protected resources. If the resource is determined to be in a failed state, an attempt will be made to restore the resource or application on the current system (in-service node) without external intervention. If this local recovery fails, a resource failover will be initiated. Application Failure An application failure is detected, but the local recovery process fails. Remove Failure – During the resource failover process, certain resources need to be removed from service on the primary server and then brought into service on the selected backup server to provide full functionality of the critical applications. If this remove process fails, a reboot of the primary server will be performed resulting in a complete server failover. Examples of remove failures: Unable to unmount file system Unable to shut down protected application (oracle, mysql, postgres, etc) File System Issues Disk Full — LifeKeeper’s File System Health Monitoring can detect disk full file system conditions which may result in failover of the file system resource. Unmounted or Improperly Mounted File System — User manually unmounts or changes options on an in-service and LK protected file system. Remount Failure — The following is a list of common causes for remount failure which would lead to a failover: corrupted file system (fsck failure) failure to create mount point directory mount point is busy mount failure LifeKeeper internal error IP Address Failure When a failure of an IP address is detected by the IP Recovery Kit, the resulting failure triggers the execution of the IP local recovery script. LifeKeeper first attempts to bring the IP address back in service on the current network interface. If the local recovery attempt fails, LifeKeeper will perform a failover of the IP address and all dependent resources to a backup server. During failover, the remove process will un-configure the IP address on the current server so that it can be configured on the backup server. Failure of this remove process will cause the system to reboot. IP conflict IP collision DNS resolution failure NIC or Switch Failures Reservation Conflict A reservation to a protected device is lost or stolen Unable to regain reservation or control of a protected resource device (caused by manual user intervention, HBA or switch failure) SCSI Device ● Protected SCSI device could not be opened. The device may be failing or may have been removed from the system. Resources for identifying the cause of a failover /var/log/lifekeeper.log This log file, written by LifeKeeper, should be the first place you look in determining what may have caused a failover. For example, one of the most common reasons is a comm path failure. Below is an example of the entries you will find in the lifekeeper.log when this occurs: _{Sep 21 11:06:57 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 1 of 48 on dev 10.236.17.226/10.238.17.226 (lcm driver number = 129).} _{Sep 21 11:06:57 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 1 of 48 on dev 10.236.17.226/10.237.17.226 (lcm driver number = 1360929).} _{Sep 21 11:07:02 es1ecc08tev lcm[46893]: INFO:lcm.tli_hand:::005257:missed heartbeat 2 of 48 on dev 10.236.17.226/10.238.17.226 (lcm driver number = 129).} After the it reaches the maximum number of heartbeats, the failover begins: _{Sep 21 11:10:49 es6ecc08tev lcm[9416]: INFO:lcm.tli_hand:::005257:missed heartbeat 47 of 48 on dev 10.237.17.226/10.236.17.226 (lcm driver number = 71).} _{Sep 21 11:10:49 es6ecc08tev eventslcm[47082]: WARN:lcd.net:::004258:Communication to es1ecc08tev by 10.237.17.226/10.236.17.226 FAILED} _{Sep 21 11:10:49 es6ecc08tev eventslcm[47082]: WARN:lcd.net:::004261:COMMUNICATIONS failover from system “es1ecc08tev” will be started.} _{Sep 21 11:10:49 es6ecc08tev lifekeeper[47121]: NOTIFY:event.comm_down:::010466:COMMUNICATIONS es1ecc08tev FAILED} /var/log/messages This linux generated file typically contains system messages generated by various processes and services running on the system. These messages can include: System boot messages: Information about the system boot process, including kernel messages and messages from systemd or other init systems. Service startup and shutdown messages: Messages indicating when services are started or stopped, including any errors or warnings encountered during the process. Kernel messages: Information about the operation of the Linux kernel, including hardware detection, device initialization, and kernel errors or warnings. Network-related messages: Information about network connections, firewall activity, and network configuration changes. System performance information: Messages related to system performance monitoring, such as CPU usage, memory usage, and disk I/O statistics. SIOS High Availability and Disaster Recovery SIOS Technology Corporation provides High Availability and Disaster Recovery products that protect & optimize IT infrastructures with cluster management for your most important applications. Contact us today for more information. Reproduced with permission from SIOS
May 5, 2024	Three Tips for Better Support Three Tips for Better Support Betsy was a 1999 Amazon Green Ford F-150, the first vehicle I ever purchased. I’m not sure how my truck got the name Betsy or why it stuck, but it did. For over 17 years, Betsy did everything from cruise the beach to race on the race strip, haul tons of landscaping supplies, and take my growing family across the southeast. After a lot of miles and years of learning how to care for a truck, she began showing the wear. On a particular afternoon drive, I noticed the temperature gauge creeping up to H (High). After a few conversations, I took Betsy to the Service department at a local dealership for the start of a week-long, self-inflicted ordeal. On the first visit, I hastily provided the high-level details. “After a few minutes, the truck runs hot,” I said. Six hours and $100 later, I retrieved my truck. The technician couldn’t reproduce the issue. So, I was sent home with a diagnostic fee and a request to come back if it happens again. On the second visit, I hastily added that the problem happened after 18 minutes or 14 miles of driving more than 45 minutes on the commute. Six hours and about $375 later, I retrieved my truck. The technician was able to reproduce the problem with the new details, and they replaced the thermostat and the hoses. On the third visit, the call from the technician came early, “Mr. Rhue, you are going to need a new radiator.” That’s the short version of the story. The longer version includes my failure to explain to the service technician that in between the first and second visit I had already replaced the thermostat. It also leaves out the fact that I performed a flush and fill of the radiator fluid and most likely left the hose clamp loose in the process. Most of all, it leaves out the fact that my neighbor, a mechanic, told me before the truck ever had this problem, to replace the radiator and perform other preventative maintenance. Now, what does any of this have to do with better Customer Experience? Here are three lessons from my self-inflicted ordeal that will improve your customer experience, not just your next automotive service. First, get and give all the details. On my first visit, I hastily provided the minimum details to the service technician. As a result, the proper resolution could not be achieved. Many events in the world occur at the most inopportune times, and bring with them a lot of pressure and time constraints, but it is still a best practice to provide your Customer Experience team with as many details as possible. When did you notice the issue, or when did the problem happen? What did you notice or what were the symptoms of the issue? What other things were going on at the time? Give thought to any other supporting details that you may be able to provide, including error messages and error codes, software system logs, client logs, and any pictures capturing error conditions or symptoms. Many times we like to think things in software are unrelated, when in fact they are very much related. Second, describe what you have done (good or bad). When I came in for the second visit, I did myself and the technicians another great disservice. Rather than explaining all the things that I had already tried (good and bad), and sharing about the failed attempts to resolve the problem, I delayed my resolution. If I had shared the fact that I had already replaced the thermostat, performed flushing and refilling of the radiator, perhaps the technician would have looked elsewhere for the problem. When you share what you have done to remedy the problem, and what you may have done to make it worse, it helps your Customer Experience team improve their responses, hone in on other problem areas, eliminate spurious red-herrings (unrelated issues or things masquerading as real problems), and provide an overall more excellent experience. Lastly, execute on previous recommendations. Before the problem surfaced, my neighbor provided recommendations based on his years of experience and the age of my truck. He told me to replace the radiator, perform some preventative maintenance, and do routine checkups for the overall health of the truck. Most likely, your Customer Experience team has recommendations in their knowledge base related to your product and years of experience that relate to operating in an enterprise availability requirement. Use those for preventative maintenance, proactive adjustments, and checking your availability environment for its adherence to those best practices. But most importantly, when they make a recommendation, execute it. In the end, you’ll save a lot of time, money and hassle. Two days after the third visit, the backorder for a new radiator arrived and I replaced my radiator. I continued to drive Betsy for several more years before finally exchanging it for a family SUV. Reproduced with permission from SIOS

Results 51-55 of 956
< Page 11 of 192 >

Join Our Mailing List

First Name Last Name Email Address
Search