Clouds rain on Amazon

If you were hoping to access Reddit, Quora, FourSquare, Hootsuite, SCVNGR, Heroku, Wildfire, parts of the New York Times, ProPublica and about 70 other sites last Thursday, you may have been out of luck. It was also no fault of their own that they went offline, some for up to 36 hours, it was a cloud problem. More specifically, a problem with Amazon’s cloud services otherwise known as Amazon Elastic Cloud Compute or EC2. Nevertheless, Amazon’s damage control has kept the news pretty quiet, considering the number of sites and people affected. Most people would not even know that Amazon was hosting their favorite site.

In what news services are describing as a killer blow to the blossoming cloud services industry, the unquestionable leader of the pack failed, and with it the promise of scalable, flexible, cost effective and particularly efficient solutions for enterprises lost considerable credibility.

CNN likened it the Titanic of online services sinking. Mashable called it a Cloudgate or Cloudpocalypse. Not wanting to be as over-dramatic as these reputable news services, The Insider is more concerned with the repercussions it will have on a burgeoning sector of our industry. Most concerns to date have been over the issue of security of data with less concern over the reliability of the systems themselves. It seems quite incredible that Amazon, with all its brilliant, tried and tested technology could have suffered such a high-profile glitch.

The trouble was apparently due to “excessive re-mirroring of its Elastic Block Storage (EBS) volumes.” The crash started at Amazon’s northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a “networking event” caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon’s available storage capacity and prevented some sites from accessing their data.

These Availability Zones are supposed to be able to fail independently without bringing the whole system down. Instead, there was a single point of failure that shouldn’t have been there. Amazon has been tight-lipped about the incident, and the company said it won’t be able to fully comment on the situation until it does a “post-mortem.” Amazingly, the theories expressed above have come from external cloud experts that have already managed to work it out from the evidence present.

The fact that it has taken Amazon so long to explain itself is cause for greater concern. It’s one thing keeping customers from running their business for up to 36 hours, it’s another keeping the reasons from them for days more. No doubt many lawyers spent their Easter break going over SLAs with “a fine-toothed comb.” It also highlights the need for adequate redundancy planning without leaving everything to the cloud provider. This could also alleviate concerns for migrating data should the need arise to change service providers.

Nevertheless, those reporting the failures are also emphasizing that this industry sector is still in its infancy and events like this should not damn cloud services completely, although they do highlight the need to expect the unexpected, even with such a reputable partner as Amazon.

For CSPs rushing to get into the cloud space it highlights the ever-present concern that they don’t know what they don’t know. If Amazon with all it’s experience can have a catastrophic boo-boo then CSPs should be doubly prepared for any event.