Failure is NOT an Option – the Importance of Uptime

This term, referring to the US Apollo space program, and made famous by Ed Harris’s character in the 1995 movie ‘Apollo 13’, is still relevant when you are building an enterprise-level service offering. I strongly feel that for a service, you really need to aim for 100% uptime. Anything short of that is simply not going to cut it. If cloud computing is going to become a utility, then we need to provide a utility level of availability. When was the last time there was no water when you turned the faucet on or there was no gas when you lit the stove?

I have a poster in our office that reads “Failure is not an option (It comes bundled with the software)”. Leaving aside the dark humor for a moment, I think the more important underlying message is how to provide a service that is 100% reliable when none of the individual components have 100% reliability? That is where redundancy comes in. You need to design the overall system to be able to handle individual failures and keep working. Network links will fail, hard disks will crash, software will behave in an unpredictable manner. How do you keep things working in the face of these kinds of failures? At Aryaka, we decided early on to design the overall system to be resilient to these kinds of failures. The service simply has to keep working no matter what.

It is not just a matter of providing redundancy. How do you effectively monitor the system? How do you plan for the unknown? Failures will happen; the important thing is how you react to them. Does a software crash cause a blip in your graphs or does it leave behind a crater? We need to learn from the utilities on how to build and operate fail-safe systems. We need some cross-pollination of systems/operations expertise to make cloud computing truly utility-like in terms of reliability.

There are of course many issues that we need to address before computing/networking can become a true utility. I believe we will truly have arrived when like other utilities, I don’t need to ask for (or provide!) a support escalation matrix, because it’s just not needed anymore.

-Vikas

What have been your experiences with uptime and redundancy in enterprise architectures? Would you like us to add more thoughts on building a reliable cloud-service for the enterprise?
Let us know in the comments below.