One tiny update, one huge outage: how AWS brought the U.S. to a screeching halt

Jayden Thomas ’27

For students and faculty, the morning of October 20th, 2025 began like any other. At around 9:00 a.m., as teachers prepared their morning classes and students were checking their assignments, the Canvas learning-management system suddenly stopped working. We rely on Canvas for all subjects, although for some subjects more than others. In languages such as English, Latin, and Spanish, teachers can return to print texts and translation. Biology, physics, and chemistry can evolve into laboratory-based practices, but computer science is reliant upon access to Canvas and other online technologies. Luckily, the response to this “educational emergency” was well organized and orchestrated, with teachers sending students study guides and important tests via Gmail, which fortunately was unaffected by the Amazon Web Services (AWS) outage.

In this modern era of technology, it begs the question: what exactly happened on October 20th, 2025, and could it happen again?

Let’s start with defining AWS: It is a vast collection of services from Amazon’s data centers that thousands of companies use. By hosting complex, large, and pricey services such as storage, databases of users, and computing power, AWS allows smaller organizations and individuals to avoid investing in or managing their own complex data physically. AWS is a form of “cloud computing,” constantly handling a colossal amount of data every millisecond across an interconnected network of services.

The Amazon Web Services (AWS) office at CityCentre Five, 825 Town and Country Lane, Houston, Texas.

One of the core issues with cloud computing is the single point of failure. If even a small part of Amazon’s infrastructure fails, it can trigger a cascade of problems for millions of businesses, institutions, and individuals. This cost-effective interconnected web of dependence creates a vulnerability—one that, as the outage on October 20th demonstrated, can ripple across the global economy.

In this particular incident, the problem wasn’t just isolated to AWS—services that didn’t even use AWS were affected.

So, all this information begs the question: could this happen again? Absolutely. The fact that a single update could bring down a critical system—affecting not just AWS clients but unrelated services as well—is a clear warning sign that the very thing that makes cloud computing attractive (its interconnectedness and scalability) is also its Achilles’ heel.

Imagine an internal company messaging service like Slack. When Slack went down due to the AWS outage, companies that used Slack couldn’t reach their employees such as ACME found its operations grinding to a halt.

The origin of this outage, found after 14 hours of frantic troubleshooting, was a faulty technical update to a core system called DynamoDB, responsible for managing and retrieving databases for Amazon and its clients. A small error in the update rendered DynamoDB completely unusable for the US-EAST-1 region as services couldn’t find the correct server addresses for much of their data, leaving applications and websites in the dark.

Amazon, in its ongoing effort to keep its services running smoothly and up-to-date, faces an influx of thousands of updates every day. This makes it impossible to thoroughly test each one before rolling it out. The sheer scale of AWS’s infrastructure requires rapid deployment of updates to keep up with demand, but as we’ve seen, this can result in overlooked vulnerabilities.

To prevent such disruptions, Amazon could implement a multi-region cloud setup, where user requests can be rerouted to another working region if a region is down. However, things can get tricky: if the failure detection mechanism doesn’t work—or worse, if the interconnectedness of services causes the failure to spread unnoticed—then simply rerouting requests to another region won’t help. In this case, the interdependence of AWS’s services became a single point of failure for the whole system. 

Another factor in Amazon’s ongoing review is cost: is charging two million dollars over one million for the cloud setup worth it for a 20 percent increase in development time and 40 percent increase in speed of cloud services?

So, all this information begs the question: could this happen again? Absolutely. The fact that a single update could bring down a critical system—affecting not just AWS clients but unrelated services as well—is a clear warning sign that the very thing that makes cloud computing attractive (its interconnectedness and scalability) is also its Achilles’ heel.

Education and technology go hand in hand, and Haverford should see how it could future-proof itself from other outages to Canvas by creating a plan of action with versatility for all subjects, teachers, and students.