Amazon Web Services (AWS), an Amazon.com company, has revealed the reasons for the outage that impacted several leading global websites and apps last week.
AWS said the addition of new servers to the network triggered the massive outage of Amazon Web Services (AWS) last week.
AWS said they were able to confirm a root cause which was not driven by memory pressure as earlier reported.
Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration, the cloud company said in a technical blog post.
As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters, AWS said.
The capacity addition was being made to the front-end fleet of AWS.
Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map.
Several apps and services posted on Twitter that they were experiencing problems with the AWS services. Some of those were Acorns, Adobe Spark, Autodesk, Coinbase, Glassdoor, Flickr, iRobot and The Washington Post, among others.
Amazon said it mitigated the impact to the subsystem within its Kinesis Data Streams APIs and other dependable services — responsible for the processing of incoming requests.
“We continued to slowly add traffic to the front-end fleet with the Kinesis error rate steadily dropping from noon onward. Kinesis fully returned to normal,” AWS said.