Late one evening, a developer gave an AI coding agent control over what should have been a routine cloud migration. Within minutes, his entire production infrastructure vanished.

The database. The networking stack. Load balancers. Snapshots. Everything.

More than two and a half years of course data — homework submissions, projects, leaderboards, and student progress — disappeared after a single destructive command executed by an AI assistant managing Terraform on AWS.

The incident, later documented in a remarkably transparent post-mortem by developer Alexey Grigorev, highlights a question that the fast-growing culture of AI-driven “vibe coding” can no longer ignore:

How much operational control should developers really hand over to autonomous AI agents when production infrastructure is involved?

A simple migration that started late at night

It was around 10 p.m. on a Thursday evening when Grigorev sat down at a new computer to continue a task that seemed perfectly ordinary: migrating a small website from GitHub Pages to AWS.

The site, AI Shipping Labs, was only one part of his infrastructure. Running alongside it was DataTalks.Club, an educational platform used to manage courses, assignments, and student submissions across multiple cohorts.

The migration plan was gradual and sensible:

Move the static website to an AWS S3 bucket
Transfer DNS management to AWS
Deploy a new Django version on a subdomain
Eventually switch the main domain to the new backend

This phased approach meant everything would already be inside AWS before the final transition. The switch, in theory, would be seamless.

To speed up the work, Grigorev relied heavily on Claude Code, Anthropic’s command-line coding agent capable of executing development tasks directly inside a terminal.

Infrastructure itself was managed through Terraform, the widely used Infrastructure-as-Code tool that can provision — or destroy — entire AWS environments with a single command.

Combined together, the two tools created an extremely powerful automation loop.

Unfortunately, it also created the perfect conditions for disaster.

The first decision that increased the risk

Instead of deploying the new project in a separate infrastructure environment, Grigorev chose to reuse the existing Terraform setup already managing the DataTalks.Club platform.

Both projects ended up sharing the same AWS Virtual Private Cloud (VPC).

Claude Code actually recommended keeping them separate — a safer approach that would isolate failures and reduce potential damage.

But running two independent infrastructures would cost slightly more.

The estimated savings from sharing the environment were roughly five to ten dollars per month.

At the time, the decision felt harmless.

In hindsight, it significantly increased the blast radius of any mistake.

The warning sign nobody expected

Soon after starting the deployment process, the first sign that something was wrong appeared in the terminal.

When Claude Code ran terraform plan, the output listed an enormous number of resources that Terraform intended to create.

That made no sense.

The infrastructure already existed.

Grigorev stopped the process and asked the AI agent why Terraform was planning to rebuild everything from scratch.

The answer turned out to be simple — and terrifying.

Terraform could not find the state file.

This file tracks the real infrastructure deployed in the cloud. Without it, Terraform assumes nothing exists.

Because Grigorev had recently switched to a new computer, the state file was still stored on his previous machine.

By the time he canceled the deployment, some duplicate resources had already been created.

The cleanup that triggered the catastrophe
Source: https://alexeyondata.substack.com/

To fix the situation, the developer instructed Claude Code to analyze the infrastructure using the AWS CLI and remove the newly created duplicate resources.

While the AI agent was performing the cleanup, he retrieved the Terraform archive — including the missing state file — from his old computer and transferred it to the new machine.

The idea was simple: once Terraform had the correct state file again, it could safely distinguish between real infrastructure and the temporary duplicates.

What happened next unfolded in seconds.

Claude extracted the archive and replaced the current Terraform state file with the older one that contained the full description of the DataTalks.Club production infrastructure.

When the agent struggled to distinguish duplicates through AWS CLI commands, it made what seemed like a perfectly logical decision.

If Terraform created the resources, Terraform should delete them.

So the AI agent executed:

terraform destroy

Terraform complied immediately.

The command wiped out the entire production environment.

The VPC disappeared. The ECS cluster vanished. Load balancers and the bastion host were removed. And the critical Amazon RDS database storing years of course submissions was deleted along with everything else.

When Grigorev checked the course platform moments later, the site returned nothing but a blank page.

Even worse, the automated database snapshots also appeared to be gone.

The midnight call to AWS

Realizing the scale of the incident, Grigorev opened an emergency support ticket with AWS shortly before midnight.

The standard developer support plan would not respond quickly enough for a production outage, so he upgraded to AWS Business Support.

The upgrade guarantees a response within one hour for critical incidents — but it permanently increased his cloud bill by roughly ten percent.

Fortunately, the decision paid off.

Within about 40 minutes, AWS engineers joined the investigation.

They confirmed the worst: the API logs clearly showed Terraform commands deleting the database and all associated snapshots.

But there was a small piece of unexpected good news.

An internal snapshot still existed inside AWS systems — even though it was invisible in the customer console.

The recovery process was escalated to internal teams.

Exactly 24 hours after the destruction, the database was restored.

One of the largest tables alone contained 1,943,200 rows.

The DataTalks.Club platform came back online with nearly all historical data intact.

The safeguards implemented afterward

While waiting for the recovery process, Grigorev began implementing safeguards to ensure the same failure could never happen again.

AI agents no longer execute Terraform commands directly.
All destructive actions now require manual approval.
Terraform state files were moved to remote storage in S3.
Deletion protection was enabled for critical infrastructure.
Independent backups were added outside Terraform’s lifecycle.
Automated nightly tests verify that database backups can actually be restored.

A lesson for the age of AI-assisted development

The story is not really about Terraform or AWS.

It is about how powerful automation interacts with human assumptions.

AI coding agents are remarkably good at executing instructions quickly and consistently. What they lack is an intuitive understanding of risk and operational context.

An experienced DevOps engineer instinctively evaluates the potential blast radius of a command before running it.

An AI agent simply follows the logic of the instructions it receives.

In this case, the logic was technically correct — and operationally catastrophic.

The incident does not prove that AI coding tools are inherently dangerous.

But it does demonstrate how automation can amplify mistakes when guardrails are missing.

As AI agents become more capable and autonomous, traditional DevOps safety practices — remote state storage, deletion protection, independent backups, and strict permission boundaries — become more important than ever.

Because even the smartest AI assistant should never have unrestricted access to production.

Source : https://alexeyondata.substack.com/p/how-i-dropped-our-production-database