Home
News
You are here

Amazon crashed part of the Internet last Tuesday, and it explains why

By Doroteya Borisova

Published: Dec 13, 2021, 5:56 AM

0comments

Amazon

Most of us know Amazon best for its e-commerce services, which enables us to easily order nearly anything off the internet these days—from food to clothes and furniture—with free shipping, with just a few clicks on Amazon Prime. It's exactly this that made Jeff Bezos the (until recently) richest man in the world, and continues to rake in the most cash; but Amazon does much, much more than retail.

In fact, it also happens to control 33% of the internet, which runs on Amazon AWS (Amazon Web Services) servers—placing it high above even Google and Microsoft when it comes to lucrative web services.

And last Tuesday, a portion of the internet, together with Amazon.com, disappeared for a while, when Amazon's servers in Northern Virginia (which has one of the biggest, as well as the first AWS data center ever) experienced an unexpected crash. The downtime lasted about seven hours, starting at around 7:30 AM PST, and with the network finally fully restored by 2:22 PM PST.

During the prolonged outage, the whole event was shrouded in mystery: few details were shared as to what exactly was the cause of the whole thing, and when things would be back to normal. A few days after the event, however, Amazon has released a rather more detailed repot as to what happened on December 7.

As it turns out, it was a very unusual crash which affected the AWS monitoring systems, which Amazon says significantly delayed the tech rescue team's own ability to understand and diagnose the issue for the first few hours. Moreover, Amazon says that "the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region."

Amazon says it is hard at work updating the systems to prevent the tech team (and consequently, AWS customers) from being left in the dark anymore, should future technical issues or outages occur.

Apart from sending significant portions of the internet offline, the Amazon outage also affected large-scale services such as Netflix, Disney+, Ticketmaster, and others.

Many smart devices that rely on an internet connection to function also stopped working temporarily, such as smart assistant Alexa, Roomba vacuums (via CNBC), security cameras, smart cat litter boxes, and even baby monitors—which, all other annoyances aside, posed a significant safety concern.

Here is part of Amazon's post on its website, published on Friday:

At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.

These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.

This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.

Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered. [...]

We have taken several actions to prevent a recurrence of this event. We immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediations. Our systems are scaled adequately so that we do not need to resume these activities in the near-term. Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event.

This code path has been in production for many years but the automated scaling activity triggered a previously unobserved behavior. We are developing a fix for this issue and expect to deploy this change over the next two weeks. We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue.