SIOS Technologies and Storage Switzerland recently joined forces for a highly attended webinar. The attendees asked some outstanding questions about flexible HA and DR for virtual servers and cloud environments. Joining Storage Switzerland Founder and Chief Steward George Crump on the Webinar is Director of Field Engineering Tony Tomarchio from SIOS.
Question 1
“What are the connectivity requirements between two geographically dispersed data centers, such as bandwidth and latency, in order for your high availability solution to work?”
Tony: It all comes down to the workload that you need to protect. We don’t have a minimum requirement per se. It depends on the I/O activity on your system, specifically data rate of change, which is how rapidly your disks being written. Let’s say you have one server, which on average is writing 3MB/sec to disk. SIOS Software wants to replicate that data as fast as you write it to disk locally to replicate it out. You need to look at the servers you want to protect. In Windows this is very easy to do. You can pull up PerfMon, look at disk stats and let that run for some representative time period. That will tell you exactly how much bandwidth you need to support real-time replication.
As far as the latency aspect of this question goes, we support both synchronous and asynchronous replication. Generally, you would go with synchronous replication if you have a high-speed low latency network connection. Synchronous gives you maximum data protection and zero data loss because it’s a double commit. The write isn’t considered complete until it’s made it to both source and targets. But you do have to factor in the round trip latency between source/target as that will have an effect on your write performance.
If you have higher latency and you’re more performance sensitive, you might go with asynchronous. But you have to understand that in the event of a failure there could be some in flight data that might not make it from source to target. So there could be some data loss. That is the classic trade-off between synchronous and asynchronous.
To summarize, we don’t have a minimum requirement as far as bandwidth and latency. It really comes down to how busy your servers are, how rapidly they write to disk, what your performance tolerances are, and how much data loss, if any, you can withstand.
Question 2
“How are the SAN-less clusters beneficial in a VMware environment?”
Tony: If you deploy VMware, it’s got built-in features such as VMware HA. That is a partial solution from an HA perspective. If you look at what VMware HA does, it protects you against host failures. If a host fails, it reboots a virtual machine onto another physical host in the VMware cluster. If you have an issue with the networking or the application that’s inside the virtual machine, essentially the virtual machine is a block box and that type of issue won’t necessarily be protected. Adding on application level availability and clustering at the guest level can provide you with a higher level of availability.
The other challenge that I mentioned before with doing guest level clustering in that type of environment is that you have to pass the storage up to the virtual machine. Usually you’ll have to configure raw device mapping (RDM), and then you’ll lose things like VMotion. You’re giving up some virtualization features for HA, but with a SAN-Less cluster solution from SIOS you can have both. We’re doing everything from inside the guest. There are no specific changes you need to make at the hypervisor level.
Question 3
“What storage configuration provides the best HA?”
George: I’m a SAN guy. But clearly from a cost perspective, the ability to use external drives has to be appealing both from a cost and familiarity perspective. What is your stance on that?
Tony: Certainly SANs are robust and have a lot of redundancy, such as redundant controllers, disks and so on. But at the end of the day shared storage in your cluster represents a single point of failure. Again, it may not be a hardware failure. Many times it’s a configuration or user error that causes connectivity loss to the SAN that can take down your entire cluster.
By going to a SAN-less configuration, you’re eliminating storage as a single point of failure and achieving a higher level of availability.
If you already have made the investment in a SAN, I’m not saying don’t use it. You can certainly use existing storage and server resources that you have in your infrastructure. But if that’s the way you were clustering in the past, let’s say two servers and a SAN, you might consider adding a third node with it’s own independent storage. It could be a different SAN or it could be local storage. That way you’ve now got another node in the cluster so you can withstand one more host failure. You’re also eliminating storage as a failure point, so technically you’re stepping up one notch in the availability chain.
George: I wrote a paper on this and there was a section called, “What can go wrong with your SAN Array.”
One of the interesting things that can go wrong is most systems have a RAID or something going on to protect it in case of a drive failure. But one of the things I find that takes people off guard is what the performance is like while the RAID rebuild is happening. It typically leaves you with two choices. You can turn down the speed at which the rebuild is happening leaving you exposed for a longer period of time, or you can speed up the RAID rebuild which typically hurts the disk I/O performance.
It sounds to me that in your environment, I can fail to a separate stand alone system and let the RAID rebuild happen all by itself on a separate primary system. That would be able to work, wouldn’t it?
Tony: Yes, you could certainly do that if performance during that operation was a concern. Basically what happens at the physical level, like a RAID rebuild, is transparent to our software. That’s one of the reasons why you can mix and match servers.
One of the requirements with our Windows solution is that drive letters match and all of the volumes are the same size. Whether it’s a single disk or it’s a RAID 0, 1, 5 or 10, under the covers that are all transparent. So yes, if that was a concern, you could potentially failover to another node in the cluster and let everything run off of that while your RAID rebuilds on the other side.
Question 4
“Can you use your software for anything other than SQL?”
Tony: Yes, you can use it with any cluster-able service or application. Most commonly we’re protecting SQL. We have solutions for Linux where we do a lot of SAP, Oracle and NFS type clusters. It’s really all over the map.
We can also protect custom applications. That is one of the benefits of having a block level replication technology. It is server, storage and even application agnostic. You just tell us which partitions or volumes you want to replicate and whatever data happens to live in there and we’re going to protect. From that regard, this can be used for much more than just SQL.
Question 5
“How is application performance impacted by running the SAN-less software?”
Tony: This comes back to the mode of replication. We support both synchronous and asynchronous replication. With asynchronous replication, you’re not going to see any kind of performance impact. If synchronous, then you’re only going to see an impact on writes to disk because it’s a double commit. Reads are not impacted.
If you go with synchronous replication, you’ll want to have a low latency network connection to minimize the overhead that synchronous replication imposes on writes.