Amazon Web Services logo on the smartphone screen.

(File photo)
Photo: 123RF

By Jordan Valinsky, CNN

Amazon Web Services (AWS), the cloud computing platform that powers much of the internet, went down for several hours making several major websites and apps inoperable.

From banking services to social networks to airline booking sites to online shopping, thousands of services were disrupted as millions of people worldwide – many of whom were on their way to work on the US East Coast – were unable to mobile-order coffee or access key apps.

The latest outage, on Monday (US time), serves as a reminder of how fragile the internet’s backbone can be, even if the disruption is brief, and how reliant the world has become on these online services.

Although AWS and its competitors are generally robust, the internet is a complex web of overlapping services that are only as reliable as their weakest code. The root cause of Monday’s outage remains unknown, but a service that converts friendly web names into IP addresses was unable to communicate with thousands of companies’ massive databases hosted by Amazon.

Past outages on this scale have been caused by a wide variety of errors, including faulty updates, the accidental injection of bad code, or a change to third-party software that doesn’t play nicely with a service. Rarely, internet cable cuts, cyberattacks or direct denial of service attacks can bring down or overload servers that host key apps.

But the relative frequency of these events shows the lack of necessary redundancies and competitive services. Too often, some internet experts say, companies put all their eggs in one cloud services basket.

There’s “no sign” that this was a cyberattack, according to Rob Jardin, chief digital officer at cybersecurity firm NymVPN, adding that it “looks like a technical fault affecting one of Amazon’s main data centers.”

“The internet was originally designed to be decentralized and resilient, yet today so much of our online ecosystem is concentrated in a small number of cloud regions,” he said in a note. “When one of those regions experiences a fault, the impact is immediate and widespread.”

Jardin said “these issues can happen when systems become overloaded or a key part of the network goes down; and because so many websites and apps rely on AWS, the impact spreads quickly.”

AWS doesn’t often experience major disruptions like this, with the last one occurring in 2021.

“That’s on par with the other major cloud providers and, in fact, it’s amazing that they’re able to run at the scale they do without more frequent disruptions,” said Mike Chapple, a cybersecurity expert and IT professor at the University of Notre Dame’s Mendoza College of Business.

“The reason these events attract much more notice is because of their impact,” he told CNN. “If a single company experiences an issue in their data center, it causes issues for that company’s products and services.”

In 2024, the largest-ever IT outage brought down large portions of the internet when a devastating CrowdStrike software glitch crashed computers, led to flight cancellations and disrupted hospitals around the globe, creating $5 billion in direct business losses. A bug in CrowdStrike’s cloud-based testing system pushed a problematic update to computers around the world.

Also last year, AT&T’s network went down several times, including an 11-hour meltdown that prevented many gig workers from doing their jobs.

So, what went wrong?

AWS is a cloud computing provider that hosts many of the world’s most-used online services. In Amazon’s infancy, the company needed excess server capacity to ensure it had enough computing power to handle the massive amounts of traffic that came to its site during the holiday season rush. Amazon realized that during the rest of the year, it could use those servers to support other companies’ online needs, and out of that AWS was born.

Among AWS’ many offerings is DynamoDB, a database that hosts information for companies, including customer data. Amazon said Monday that its customers couldn’t access the data stored in DynamoDB, because the Domain Name System (DNS) – a kind of phone book for the internet – had encountered a problem.

DNS is like an internet location engine, converting user-friendly web addresses like amazon.com into IP addresses – a series of numbers that other websites and applications can understand.

“Amazon had the data safely stored, but nobody else could find it for several hours, leaving apps temporarily separated from their data,” said Chapple. “It’s as if large portions of the internet suffered temporary amnesia.”

It’s not clear what caused the DNS outage, but it lasted only a few hours. By 6.35amET (11.35pm NZT), Amazon had fixed the DNS problem and recommended companies dump their cache – temporary storage files – to help speed up the restoration of their services.

Amazon said the outage continued to affect other AWS services, including EC2 – a kind of virtual server many companies use to build their online applications.

The company will likely conduct a postmortem and explain what went wrong with its DNS system in the coming days.

– CNN