April 29, 2011 | 2
Amazon Web Services LLC (AWS), the cloud computing arm of online marketplace Amazon.com, on Friday explained what happened during last week’s service outage, which disrupted many of its customers’ Web sites. AWS, formed by Amazon in 2006 to capitalize on the cloud computing hype, ran into problems on April 21 with a network configuration change that took several days to fix, slowing or disabling access to sites run by location-based social network Foursquare, fellow cloud service provider Engine Yard, social news outlet Reddit and several others.
"The trigger for this event was a network configuration change," the company confirmed in a message on its Web site. "We will audit our change process and increase the automation to prevent this mistake from happening in the future."
During AWS’s disruption the company’s so-called "elastic block" data storage (EBS) became unable to perform certain functions. This storage consists of computer clusters that store, manage and back up customer data. The clusters themselves are made up of individual node computers, and these nodes are connected via two networks—a primary high-bandwidth network that manages normal traffic and a lower-capacity backup network. The problem began on April 21 while Amazon was attempting to upgrade capacity in the network serving the eastern U.S. The company incorrectly shifted network traffic from the primary network to the backup network, which could not adequately handle the volume of activity.
Once the error was realized and traffic was shifted back to the primary network, the storage nodes on the primary were overwhelmed by the barrage of data and could not find enough space to hold it all. Like a game of musical chairs, some data was left in limbo, continuously searching for free storage space. This backed up new requests for storage space coming into the system, slowing or shutting down parts of Web sites using Amazon’s services.
The company corrected this by disabling new storage requests, but the damage was already done. Overwhelmed nodes began to fail, exacerbating the problem of having too much data and not enough available storage space. AWS was able to address this over the next few days by adding storage capacity to the network and tweaking its storage management software.
Image of storm clouds over Denmark courtesy of Malene Thyssen, via Wikimedia Commons