What do we know about what happened on Monday?

Shortly after 8am, hundreds of websites, apps and online services across the globe crashed following a failure at Amazon Web Services’s US-East-1 region in Northern Virginia in the US. The disruption hit gaming and social media services such as Snapchat and Roblox; banking apps such as Lloyds and Halifax; and government websites, such as HM Revenue & Customs.

Amazon Web Services, which provides cloud computing infrastructure to businesses, said the root cause appeared to be problems with its domain name resolution system. Shortly before midday, it issued a statement saying the underlying problem had been “fully mitigated”. Most affected services gradually recovered although some were still operating more slowly than normal as they processed a backlog of requests.

Amazon, Snapchat and Ring ‘recovering’: AWS outage latest

What does Amazon Web Services do?

It provides digital services such as storage and networking to millions of companies and public service bodies. They use it because it is simpler to use Amazon’s ready-made digital infrastructure than build their own.

What is domain name resolution and why is it so important?

Domain name resolution is the process that translates a web address, such as thetimes.com, into a numerical IP address. When you enter a website into your browser or click a link, your device does not know where that website is hosted. Instead, it asks a global directory system — the domain name system (DNS) — to translate the name into the IP address. This process happens in milliseconds, thousands of times a day for every online user. If the system fails, your computer is unable to find where a website lives. Even if the website’s servers are running perfectly, users will see errors such as “server not found”.

Was the problem caused by a cyber-attack?

While a hidden cause cannot yet be ruled out, the evidence points to a platform failure rather than a hostile breach. In its most recent update, Amazon Web Services said it was “continuing to investigate the root cause”.

If the failure happened in America, why were so many global and British websites affected?

The answer lies in the globalised nature of cloud-computing infrastructure. Many companies based in the UK and elsewhere use Amazon Web Services, and much of that appears to have been operating from the East-1 region of America. Monday’s events show that an operational failure in one region of a big cloud provider can spread around the globe because the digital system is not segmented by traditional geographic borders.

Screengrab of the Lloyds Bank website displaying an error message "Something went wrong" and "Sorry, we're unable to process your request at the moment".

Lloyds Bank was among the organisations affected

PA

Does that mean we are vulnerable to similar big problems?

The short answer is: yes. Monday’s events raise serious questions about systemic resilience. If so many fundamental online services, such as banking, telecommunications and government portals, can be disrupted by a single underlying cloud outage, that means there is a concentration of risk in a relatively small number of infrastructure providers. From a national security and economic standpoint, that is worrying. It suggests that our digital society is only as strong as its weakest large infrastructure provider. The positive news is that problem reminds us of the need for proper contingency plans, backup operations and transparency.

What could governments do to increase resilience?

There are several levers that can be pulled. Governments could encourage or require critical national infrastructure providers — such as banks, utilities, communications and government services — to have multi-region, multi-cloud, or on-premises fallback arrangements rather than being reliant on a single cloud-provider. They could also insist on stress testing and scenario planning to identify vulnerabilities. Promoting greater competition in cloud services might also help reduce the dominance of a single operator, such as Amazon, cutting the concentration of risk.

Was Amazon Web Services’ rise a result of the pandemic?

Amazon was the largest provider of cloud computing before Covid, but the pandemic accelerated the adoption of its services. With so many businesses shifting to remote working and digital-first operations between 2020 and 2022, demand for cloud computing soared. Amazon Web Services expanded rapidly as companies sought flexible infrastructure, instead of spending large sums on their own data-centres. Over time, this has resulted in a greater share of the global digital economy depending on Amazon.

If this is a result of Amazon’s failure, will it have to compensate customers?

This depends on the contractual arrangements. Many cloud service agreements include “service-level agreements” that may offer credits or refunds if downtime exceeds certain thresholds but they usually exclude many indirect or consequential losses.