About the SA Blog Network



Opinion, arguments & analyses from the editors of Scientific American
Observations HomeAboutContact explains recent outage that took down Foursquare and Reddit

The views expressed are those of the author and are not necessarily those of Scientific American.

Email   PrintPrint

Amazon, cloud, DenmarkAmazon Web Services LLC (AWS), the cloud computing arm of online marketplace, on Friday explained what happened during last week’s service outage, which disrupted many of its customers’ Web sites. AWS, formed by Amazon in 2006 to capitalize on the cloud computing hype, ran into problems on April 21 with a network configuration change that took several days to fix, slowing or disabling access to sites run by location-based social network Foursquare, fellow cloud service provider Engine Yard, social news outlet Reddit and several others.

"The trigger for this event was a network configuration change," the company confirmed in a message on its Web site. "We will audit our change process and increase the automation to prevent this mistake from happening in the future."

During AWS’s disruption the company’s so-called "elastic block" data storage (EBS) became unable to perform certain functions. This storage consists of computer clusters that store, manage and back up customer data. The clusters themselves are made up of individual node computers, and these nodes are connected via two networks—a primary high-bandwidth network that manages normal traffic and a lower-capacity backup network. The problem began on April 21 while Amazon was attempting to upgrade capacity in the network serving the eastern U.S. The company incorrectly shifted network traffic from the primary network to the backup network, which could not adequately handle the volume of activity.

Once the error was realized and traffic was shifted back to the primary network, the storage nodes on the primary were overwhelmed by the barrage of data and could not find enough space to hold it all. Like a game of musical chairs, some data was left in limbo, continuously searching for free storage space. This backed up new requests for storage space coming into the system, slowing or shutting down parts of Web sites using Amazon’s services.

The company corrected this by disabling new storage requests, but the damage was already done. Overwhelmed nodes began to fail, exacerbating the problem of having too much data and not enough available storage space. AWS was able to address this over the next few days by adding storage capacity to the network and tweaking its storage management software.

Image of storm clouds over Denmark courtesy of Malene Thyssen, via Wikimedia Commons

Rights & Permissions

Comments 2 Comments

Add Comment
  1. 1. byronraum 6:22 pm 04/29/2011

    I am not really sure what the problem is. AWS already demonstrated that they are not a serious contender in the cloud computing space when they threw out Wikileaks at the complaint by Senator Lieberman. If services can be arbitrarily cut without an appropriate court order, then it is obvious that AWS cannot be considered reliable. Anyone who tries to use it for serious purposes is asking for trouble.

    Link to this
  2. 2. cloudprovider 7:23 am 05/3/2011

    Cloud computing is considered a complex and dynamic IT service production and distribution system. AMS must have learn from these errors.
    Cloud Computing Experts,

    Link to this

Add a Comment
You must sign in or register as a member to submit a comment.

More from Scientific American

Email this Article