Fastly, the CDN company behind a major global internet outage, said the incident was caused by a bug in its software that was triggered when one of its customers changed their settings.
Tuesday’s outage raised questions about the reliance of the internet on a few infrastructure companies. Fastly’s issue knocked out high traffic sites including news providers such as The Guardian and New York Times, as well as British government sites, Reddit and Amazon.com.
“This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them,” Fastly said in a blog post authored by Nick Rockwell, its senior engineering and infrastructure executive.
He said the problem should have been anticipated.
Fastly operates a group of servers strategically placed around the world to help customers move and store content close to their end users quickly and safely.
The company post gave a timeline of events and promised to examine and explain why Fastly had failed to detect the software bug during its own testing process.
Fastly said the bug was in a software update shipped to customers on May 12 but was not triggered until one unidentified customer carried out settings changes that triggered the problem “which caused 85 percent of our network to return errors.”
Fastly has noticed the outage within a minute it occurring at 0947 GMT, and engineers worked out the cause at 1027 GMT. Once they disabled the settings that triggered the problem, most of the company’s network quickly recovered.
“Within 49 minutes, 95 percent of our network was operating as normal,” Fastly said. Its networks were fully recovered at 1235 GMT and it began rolling out a permanent software fix at 1725 GMT, Fastly said.
It’s important to understand why the impact is extensive. Fastly is a CDN – a content delivery network. CDNs generate replicas of original websites for the website owners to allow load balancing.
So instead of everyone all over the world accessing one centralized server and causing an overload, what they do is actually spread the load between different replicas. For example, the original server could sit in San Francisco, but there are replicas in Paris, Manhattan, Tel Aviv and Hong Kong. Everyone is routed to the nearest server to their device, and when a CDN fails, it means that all the replicas are unavailable and no one is able to see the content from the original server.
So it seems like Amazon, Reddit, Twitch and all these big sites have been attacked in unison, but they were not attacked. There is no outage for these companies. The only outage was at Fastly, the CDN that serves them.
“We don’t yet know the reason for that and there are many possible answers, but it reminds us of a similar incident from October 2016, where the Mirai botnet infected several high-profile targets with distributed denial-of-service (DDoS) attacks. Mirai was an IoT botnet that took control of cameras and other such devices, making them send requests to take down Dyn, the DNS company that served many brands, including Twitter, BBC, Visa and Reddit,” Lotem Finkelstenn, Head of Threat Intelligence at Check Point Software Technologies, said.