High Availability and Load Balancing in the Cloud
Doesn’t it feel like there is little information out there on how to set up your built-on-the-cloud services to ensure high uptime and performance? The cloud certainly makes things easier but is not a magic bullet that gives you free uptime without a need of architecture for it. The old rules of not relying on single points of failure and trying to scale horizontally as much as possible still apply.
I co-founded a VPN service (shameless plug: https://lokun.is) on top of the GreenQloud cloud where uptime and performance matters a lot. Our customers are constantly connected and even 30 seconds of downtime will affect them. They also make difficult demands about steady performance. To live up to their requirements, we created a system using a large number of small instances with as few single points of failure as possible. I’m in somewhat of a unique situation in that after deploying some of this, I joined team GreenQloud (Iceland is a small place) as a network and infrastructure engineer, so I got to see our setup from the other side too and learned a lot in the process. I have documented some of the things I found out and lessons learned within this blog post. While it is written using GreenQloud as an example, it should be general enough to apply to most cloud/VPS providers.
Individual instances can and will go down
I once read a survey that said the average instance in a public compute cloud will have less than 6 months of uptime before an unscheduled service disruption or reboot. I feel this sounds about right and with some cheaper ”VPS” providers it feels like it might be closer to 2 months or less. This needs to be planned for and if an instance goes down, it should be automatically removed from the load balancing pool. Your customers never even have to notice.
Individual instances can vary in performance, especially when cloud providers oversell badly
Load balancing methods that spread load randomly or round robin are not optimal when you cant guarantee your servers are all equally powerful. While GreenQloud makes this a bit more comfortable in that we don’t oversell and run identical hardware, I have still measured up to 25% performance difference between identical instances. At another provider that does oversell I have measured up to 400% difference! The effect can become even more dramatic with I/O. The best solution is to load balance based on measured load of instances in the pool. Have instances report their load frequently to what ever load balancing scheme you use and send new connections to those with sufficient capacity left.
Make sure your instances are spread across different physical hosts
While individual instances and even physical hosts can and will go down, multiple physical hosts failing at once is an extremely unlikely disaster-type scenario. By spreading your instances across different physical hosts you make sure a single failed server isn’t going to down your service. Cloud providers do not normally have information visible for the customer to see which physical host an instance ended up running on, but we and most others who care about their customers will answer if you ask and move instances as you wish. Where a new instance ends up running is largely up to the allocation algorithm the cloud provider uses. At GreenQloud, we pool together the hosts with most resources free and select a new host randomly within that pool, thus making it possible but not likely that a new instance is on the same host as last one. The other popular scheme that many providers use and we even used at one point in the past is trying to “fill up” one host before moving on to the next one. This makes it very likely that two or more instances created at a similar point in time will end up on the same host. Better be safe than sorry, especially if you don’t know which scheme your provider uses. Send in a support ticket and ask for your instances to be distributed as evenly as possible across the physical infrastructure. It is also a good test to see what your provider’s customer service is ready to do for you.
No matter how redundant and reliable the physical network core is, there probably is a single point of failure for some or all of your instances
Here is a well known secret in the cloud industry: Virtual network appliances are almost never deployed in a redundant configuration. Virtual switches are usually one per account per physical host and virtual routers are usually one per account. Same with firewalls and load balancers. At GreenQloud, every account gets it’s own private .1q vlan that is then connected to the network core with that account’s virtual router which is hosted on a compute host just like a normal instance. “VPS” providers that don’t include a private network for each account usually do something similar except one virtual router may be shared by a large number of customers. They way around this is simple. Use more than one account! Accounts should be redundant just like everything else. Which brings me to the last point.
There can be more (and surprising) single points of failure for each account
Cloud providers use software like OpenStack, CloudStack, Eucalyptus or other options to manage what happens in the cloud. There are many things that can go wrong but any problems are usually limited to the account they happen to. Having multiple accounts can even protect against mistakes by the so called humans in the billing department. (Just kidding, the guys at billing are great!)
The setup I ended up being happy with
We at lokun opt to use DNS load balancing because it gives us lots of control and happens to work well with our application. It also allows the “load balancer” to be hosted far away from our actual instances since it doesn’t need to sit between them and the users. Another option is to use two traditional load balancers which would sit near the virtual routers (GreenQloud’s included load
balancers actually run as a part of the account’s virtual router). Below is a diagram of lokun’s basic structure. Assuming the network core is redundant and reliable you can see there is no single point of failure.