AWS outage creates damage worth $150 mn, Amazon escapes

aws-wind-farm
Web hosting company Amazon Web Services (AWS) has explained that the cloud outage occurred on Tuesday was due to the incorrect entry of one of the inputs.

AWS issued the following explanation:

“The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.

The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.

Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

US-EAST-1 region is a massive datacenter location that also happens to be Amazon’s oldest. Disruption to S3 affected websites, apps and smart devices in the region.  In the statement, Amazon also apologised to its customers.

Bizjournals.com reported that the mistake had a cascading effect, leading to widespread problems with Amazon’s massive network of servers that are a huge part of the internet infrastructure.

According to analysis by Cyence, a startup that models the economic impact of cyber risk, S&P 500 companies lost $150 million to $160 million during the four-hour disruption.

“While this does impact an estimated 20 percent of the internet, there are many businesses hosted on Amazon that are not having these issues. The difference is that the ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared,” Shawn Moore, CTO at Solodev.

Citing web monitoring company Apica, Business Insider reported that 54 out of the top 100 internet retailers were impacted with a decrease of 20 percent or greater in performance, and three websites went down completely: Express, Lululemon, and One Kings Lane.

AWS said it is making several changes as a result of this operational event. It modified the tool which removed too much capacity too quickly.

“We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”

Further, AWS also decided to audit its other operational tools to ensure similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems.

“We employ multiple techniques to allow our services to recover from any failure quickly,” AWS stated.

One of the important steps is breaking services into small partitions which it calls cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem.

As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery.

During this event, the recovery time of the index subsystem still took longer than AWS expected. The S3 team had planned further partitioning of the index subsystem later this year. “We are reprioritizing that work to begin immediately,” the cloud infrastructure provider ensured.

This is a wakeup call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasizes the need for redundancy – a capability that AWS offers, but it’s now being revealed how few were actually using.

Arya MM
[email protected]