Every time a major cloud services provider has a problem it creates a field day for the press and doomsayers that want to bag Cloud at every opportunity. Why the preoccupation with Cloud one can only guess, but it is very much part of the telecoms and business world, and highlighting its shortcomings should be part of a constructive assessment in preventing similar events recurring.

Amazon’s Elastic Compute Cloud (EC2) is host to not only some very big names including Netflix, Instagram and Pinterest but also many other commercial and enterprise sites. When a massive electrical storm in the area last week caused massive power outages, Amazon’s site just went down.  This event follows a six-and-a-half hour outage on EC2 a few weeks earlier.

The key thing to note here is that in the latest event, the backup generator also failed. Isn’t one of the selling points of the Cloud being built-in redundancies to prevent just such occurrences happening? It may be that sites like Netflix, Instagram and Pintrest did not opt for a fully redundant service, such as banks would, and that was a risk they were willing to accept.

Is this really “a small step backwards, perhaps, for cloud computing,” as Forbes magazine asked. With some of Amazon’s customers experiencing downtimes of up to 24 hours, they themselves are exposed not only to loss of business but also loss of credibility with their customers. Blaming Amazon for the failure does not exactly instil confidence in customers that would have the right to believe that services they pay for should be uninterrupted no matter where they are hosted.

So, what’s different about the Cloud when we talk about risk? Surely there was always inherent risk of failure installing own servers in remote data centers and arranging communications with two suppliers to guarantee constant accessibility. In fact, everything was duplicated in order to mitigate risk. Wasn’t one of the biggest selling points of Cloud the reduction in duplicate or redundant services? How come Amazon isn’t able to manage risk and is it a sign that other Cloud providers may also be cutting corners at the expense of their customers and their customers. Surely not?

Maybe it’s just a mentality thing. Years ago when discussing revenue assurance with a mixed audience of CSPs and ISPs at a conference, The Insider, raised the importance of tracking, measuring and rating each and every voice call or data packet on the network in order to maximize revenue, and that CSPs strive for 99.999% uptime for their network and IT operations. The ISPs in the room couldn’t understand the fuss claiming that lost packets would be resent and that losing some was no issue because ‘there were plenty more coming along!”

In fact, it seems a long time since anyone has used the old five nines (99.999%) term in public. What was ‘de rigueur’ in the fixed line world seems to have been discounted in mobile and almost ignored in the IP world. This may come back to haunt us all. One distinct advantage that CSPs may have in selling Cloud services is that almost anal obsession with five nines. Now is not the time to stop striving to achieve it. Where Amazon has failed, CSPs may be able to capitalize.

The telco industry’s obsession with risk can now become one of its biggest weapons in fighting for the trust of Cloud customers. Cloud is here to stay, despite the glitches, but those providers that have a reputation for risk mitigation, or can prove they understand risk, will surely have a big advantage.

First published at TM Forum as The Insider, 9 July 2012