Tuesday, September 16, 2008

Is 99.999% reliability good enough?

According to Reuven Cohen in his recent post, Cloud Failure: The Myth of Nines , the whole concept of reliability may be meaningless.

"In the case of a physical failure such as Flexiscales recent one, the hardware downtime might be small, but the time to restore from a backup might be considerably longer. A minor cloud failure could cause a cascading series of software failures causing further application outage of hours or even days for those who depended on the availability of the given cloud. Meaning your cloud may achive five nines, but your application hosted on it doesn't."

I agree. When dealing with a system of systems, like the cloud, component and function SLA's are meaningless. The cloud architect must brush up on their Bayesian probability theory, plan for failure and ensure that no matter what happens, the users can complete whatever workflow is requested.

"One of the major benefits to using cloud computing is that you can make these types of failover assumptions well before they happen using an emerging global toolset of cloud components. It's not a matter of if, but a matter of when, when you take into consideration that application components will fail then you can build an application that features "failure as service". One that is always available, one with Zero Nines. "

No comments: