Date: March 1, 2017
Tags: #over provisioning, Machine Learning, rogue VMDKs, snapshot waste
Right-Sizing VMware Environments with Machine Learning
According to leading analysts, today’s virtual data centers are as much as 80 percent overprovisioned – an issue that is wasting tens of thousands of dollars annually. The risks of overprovisioning virtual environments are urgent and immediate. IT managers face a variety of challenges related to correctly provisioning a virtual infrastructure. They need to stay within budget while avoiding downtime, delivering high performance for end-user productivity, ensuring high availability and meeting a variety of other service requirements. IT often deals with their fear of application performance issues by simply throwing hardware at the problem and avoiding any possibility of under-provisioning. However, this strategy is driving costly over spending and draining precious IT time. And even worse, when it comes time to compare the economics of on-premises hosting vs cloud, the costs of on-premises infrastructures are greatly inflated when the resources aren’t efficiently being used. This can lead to poor decisions when planning a move to the cloud.
With all of these risks in play, how do IT teams know when their VMware environment is optimized?
Having access to accurate information that is simple to understand is essential. The first step in right-sizing application workloads is understanding the patterns of the workloads and the resources they consume over time. However, most tools take a simplistic approach when recommending resource optimization. They use simple averages of metrics about a virtual machine. This approach doesn’t give accurate information. Peaks and valleys of usage and interrelationships of resources cause unanticipated consequences for other applications when you reconfigure them. To get the right information and make the right decisions for right-sizing, you need a solution such as SIOS Iq. SIOS iQ applies machine learning to learn patterns of behavior of interrelated objects over time and across the infrastructure to accurately recommend optimizations that help operations, not hurt them. Intelligent analytics beats averaging every time.
The second step towards a right-sizing strategy is eliminating the fear of dealing with performance issues when a problem happens or even preventing one in the first place. This means having confidence that you have the accurate information needed to rapidly identify and fix an issue instead of simply throwing hardware at it and hoping it goes away.
Today’s tools are not very accurate. They lead IT through a maze of graphs and metrics without clear answers to key questions. IT teams typically operate and manage environments in separate silos — storage, networks, applications and hosts each with its own tools. To understand the relationships among of all the infrastructure components requires a lot of manual work and digging. Further, these tools don’t deliver information, they only deliver marginally accurate data. And they require IT to do a lot of work to get that inaccurate data. That’s because they are threshold-based. IT has to set individual thresholds for each metric they want to measure – CPU utilization, memory utilization, network latency, etc.. A single environment may need to set, monitor, and continuously tune thousands of individual thresholds. Every time the environment is changed, such as when a workload is moved or a new VM is created, the thresholds have to be readjusted. When a threshold is exceeded, these tools often create thousands of alerts, burying important information in “alert storms” with no root cause identified or resolution recommended.
Even more importantly, because these alerts are triggered off measurements of a single metric on a single resource, IT has to interpret the meaning and importance. Ultimately the accuracy of interpretation is left to the skill and experience of the admin. When systems are changing and growing so fast and IT simply can’t keep up with it all- and the easiest course of action is to over-provision; wasting time and money in the process. Moreover, the actual root cause of the problem is often never fully addressed.
IT teams need smart tools that leverage advanced machine learning analytics to provide an aggregated, analyzed view of their entire infrastructure. A solution such as SIOS iQ helps to optimize provisioning, characterize underlying issues and identify and prioritize problems in virtual environments. SIOS iQ doesn’t use thresholds. It automatically analyzes the dynamic patterns of behavior between the related components in your environment over time. It automatically identifies a wide variety of wasted resources (rogue vmdks, snapshot waste, idle VMs). It also recommends changes to right-size all over- and under-provisioned VMs.
When it detects anomalous patterns of behavior, it provides a complete analysis of the root cause of the problem, the components affected by the problem, and recommended solutions to fix the problem. It not only recommends optimal provisioning of vCPU, vMem, and VMs, but also provides a detailed analysis of cost savings that its recommendations can deliver. Learn more about the SIOS iQ Savings and ROI calculator.
Here are three ways machine learning analytics can help avoid overprovisioning:
- Understand the causes of poor performance: By automatically and continuously observing resource utilization patterns in real-time, machine learning analytics can identify over- and undersized VMs and recommended configuration settings to right-size the VM for performance. If there’s a change, machine learning can dynamically update the recommendations.
- Reduce dependency on IT teams for resource sizing: App owners are often requesting as much storage capacity as possible, while VMware admins want to limit storage as much as possible. Machine learning analytics takes the guess work out of resource sizing and eliminates the finger-pointing that often happens among enterprise IT teams when there’s a problem.
- Eliminate unused or wasted IT resources: SIOS iQ will provide a saving and ROI analysis of wasted resources, including over-provisioned VMs, rogue VMDKs, unused VMs, and snapshot waste. It also provides recommendations for eliminating them and calculates the associated costs saving in both CapEx and Opex.
- Determine whether a cluster can tolerate host failure: With machine learning analytics, IT pros can easily right-size CPU and storage without putting SQL Server or end user productivity at risk. IT teams gain a deeper understanding into the capacity of the organization’s hosts and know whether a cluster can tolerate failure or other issues.
To learn more about how right-sizing your VMware environment with machine learning can save time and resources, check out our webinar: “Save Big by Right Sizing Your SQL Server VMware Environment.”