Why a Vital Amazon Web Services Region Went Down on Dec. 7? Amazon Web Services (AWS) suffered a significant outage in its US-East-1 region on the morning of Thursday, Dec. 7. The outage affected many popular websites and services, including Netflix, Airbnb, Expedia, and Reddit.
While AWS has not released an official statement on the cause of the outage, it is likely that the outage was caused by human error.
The most likely scenario is that an engineer at Amazon made a configuration change to one or more servers in US-East-1, which triggered a chain of auto-recovering events.
What Amazon Says ?
“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”
For example, this could have been an accidental mistyped command, intended to remove some servers but instead causing them to be deleted. The outage is a reminder of the importance of redundancy and disaster recovery planning. When an incident like this happens, it’s important for businesses to have a plan in place for how they will continue operating in the event of an outage. AWS offers several options for business continuity, including multi-cloud, which allows users to have resources in multiple AWS regions so that if one region goes down, the other regions can continue to operate. In addition, companies should develop a business continuity plan even if they aren’t using a cloud infrastructure provider like Amazon Web Services. That means having key production information replicated in real-time to a secondary site, and having staff on-call who can help get the company back up and running in the event of an outage. The US-East-1 outage is also a reminder of the importance of testing your disaster recovery plan regularly. Your plan may work perfectly when you test it, but if it’s never tested in a real-world situation, you may find that it doesn’t work when you need it the most. AWS is working to restore service to US-East-1 and expects full recovery by the end of the day. In the meantime, businesses impacted by the outage should take this time to assess their own business continuity plan and to test it in a safe environment.