Recent Major Website Outage Was Caused By A Simple Mistake
Last week, Amazon’s AWS (Amazon Web Services) suffered an 11-hour outage that resulted in dramatic slowdowns and complete unavailability of more than 100 large internet retailers and a number of the web’s top sites, including Amazon itself, Netflix, Imgur, Reddit and a host of others.
The situation has now been resolved, and Amazon has published an incident postmortem, which revealed the root cause: A typo.
A company employee was performing routine maintenance designed to remove a small number of servers for one of the S3 subsystems used for billing. Unfortunately, the command was incorrectly entered, and it inadvertently took down a large number of servers, which took the company much longer to restart than they originally anticipated.
According to the details of the incident post mortem, Amazon had not fully restarted some of the impacted servers for several years, which further complicated the restart. The company has modified the tool that was used to take the servers down so that it will do so more slowly in the future, giving the company’s staff more time to intervene in the event that they notice any unanticipated complications going forward.
While this is not the first time a cloud services provider has experienced an outage, it was easily the largest one we’ve ever seen, and it has raised questions about the reliability of those servers.
With the number of companies migrating to cloud-based service providers, the ripple effects can be enormous when one of those providers suffers an outage.
That said, cloud-based providers have an exceptionally good record to this point, and it should be noted that no company is immune to outages. Amazon’s recent incident isn’t really a viable argument against cloud migration. At the end of the day, equipment is still equipment, and no matter who manages it, there’s always a chance, however slight, of a complication resulting in downtime.