Call Us Today!
403.775.0590

Why there is no High Availability Shared Hosting

andreas July 25, 2012 Hosting

Over the last months, we've been working on putting together a high-availability shared hosting platform. This was going to be a ground-breaking first-of-it's-kind.

The way high availability (HA) works, is that each service is performed by a cluster of servers. In it's simplest terms, for example, you would have two identical web servers with the same content, and a load-balancer up-front, splitting incoming requests evenly over both servers.

If one of the servers were to fail, the other would be able to continue serving the same content on it's own, until the failed server came back online again.

The setup to do this is extremely complicated. To eliminate any single point of failure, all data has to be replicated in real-time over multiple servers. When it comes to databases, this is even more complicated, because MySQL's replication system is very imperfect, and not guaranteed to recover from a failure.

Clustering is an excellent way to deal with hardware failures. In our setup though, we were looking at cloud servers, which pretty much eliminate any hardware failure concerns.

So, all that was left, was to handle service failures. This happens fairly often, unfortunately. Most often, it's Apache, the web server service, MySQL, the database service, or one of many mail services.

Either way, the idea was that, if a service fails, then it would always continue to run on the other servers in the cluster. We were well on the way to achieving an absolute and true 100% uptime for shared hosting.

After spending a month building a proof-of-concept, we had a preliminary setup, and were able to do some testing. Initial testing was awesome. Switch off any server, and everything would continue to work.

Stress testing was another issue altogether though. What we failed to do, was to really investigate WHY services fail in a shared hosting environment in the first place. It's almost always a combination of bad code and sudden load.

And by "bad code", I don't necessarily mean BAD code - just code that wasn't written to handle sudden load, or that doesn't protect it's services under high load. These are things we have no control over. Wordpress with it's myriad plugins and updates is a prime example.

What happens if you have two identical servers, and one of them stops due to bad code and sudden load? The remaining server has to pick up ALL of the load and has the exact same code, so this server's conditions are now bad code and DOUBLE sudden load. It would only take a second or two for this server to fail as well. It doesn't matter even how many servers are in the cluster. It's a domino effect.

It's almost a pity we didn't figure this out early, because it would have saved us over a month of fruitless work. On the other hand, it was a great learning opportunity, and did get us to design a new and better hosting environment for our clients.

We are no longer concerned with duplicating servers. Just the added complexity in the setup adds so much management overhead, that the solution would not be affordable for anyone. Not to mention that it wouldn't work anyway. LOL.

Our new environment (currently in beta) consists of multiple individual-service cloud servers, with fail-over load balancers up-front.

Improvements from our old Plesk setup:

  • each service is on it's own server, reducing the risk of an outage
  • in the event of an outage, fail-over to the static cache is now instant as it happens in a load balancer, instead of at the DNS level (although we're planning on keeping the DNS fail-over as well).
  • the environment is easily scalable, both horizontally, and vertically, as server sizes can be easily adjusted, and new servers can be introduced quite easily and quickly.

For applications that require true 100% uptime, and a static cache is not good enough, the environment does allow for a fully redundant system (however, please bear in mind that such applications have to be coded for a multi-server setup.)

We're in beta now, and should be in full production by late August / early September. Stay tuned.



comments powered by Disqus