Add privacy to the list of potential casualties caused by the proliferation of AI, because researchers have found that large language models (LLMs) can be used to deanonymize internet users – even those who use pseudonyms – more efficiently than human sleuths.
Much of the academic work on online privacy over the past 25 years builds upon Latanya Sweeney‘s 2002 research on k-Anonymity [PDF], and prior research in which she demonstrated it is possible to identify 87 percent of the US population using three anonymous data points – a five-digit ZIP code, gender, and date of birth.
The possibility of identifying people from anonymous data became one of the central concerns about online advertising and the usage of cookies in web browsers.
It’s a risk that hasn’t gone away and now appears to be even more grave, thanks to LLMs that can automate the process of connecting the dots across online posts so they point to a likely source.
“We show that LLM agents can figure out who you are from your anonymous online posts,” said Simon Lermen, an AI engineer at MATS Research and one of the corresponding authors of a pre-press paper titled “Large-scale online deanonymization with LLMs.”
“Across Hacker News, Reddit, LinkedIn, and anonymized interview transcripts, our method identifies users with high precision – and scales to tens of thousands of candidates,” Lermen explained in an online post.
The researcher observes that while it has long been known that individuals can be identified using only a few data points, doing so was often impractical. Such data often existed in an unstructured form and it took considerable effort for human investigators to assemble enough pieces to solve the identity puzzle.
LLMs accelerate and automate that process, and they do so affordably, Lermen and his co-authors claim.
“We demonstrate that large language models (LLMs) fundamentally change this calculus, enabling fully automated deanonymization attacks that operate on unstructured text at scale,” they state in their paper. “Where previous approaches required predefined feature schemas, careful data alignment, and manual verification, LLMs can extract identity-relevant signals from arbitrary prose, efficiently search over millions of candidate profiles, and reason about whether two accounts belong to the same person.”
In one experiment, the authors collected 338 Hacker News users whose bios link to a LinkedIn profile. They did so to establish ground-truth identities for the study subjects so the LLMs’ predictions could be checked – this was also to avoid the ethical problems of actually deanonymizing people in a research study.
Next, they created a structured data profile of these users based on their comments and the stories they posted. Then they created a search prompt, anonymized it, and passed it to the AI agent. The agent went on to correctly identify 226 of the 338 targets, a success rate of 67 percent at 90 percent precision (there were 25 errant identifications and 86 abstentions where the model didn’t offer a prediction).
The technique employed by the authors is not a universal privacy solvent – it’s only successful some of the time. But it’s successful often enough that those posting online under a pseudonymous account should not assume their identities will remain unknown.
It’s also cheap to run. The researchers report their entire experiment cost about $2,000, with the cost per profile estimated to be between $1 and $4.
Who would bother? The authors suggest that governments could use this technique to target journalists or activists, that corporations could mine forums to build highly targeted advertising profiles, and that online attackers could develop detailed personal profiles to make social engineering scams more credible.
Lermen argues that netizens therefore need to consider how each data point they share helps identify them.
“The combination is often a unique fingerprint,” he said. “Ask yourself: could a team of smart investigators figure out who you are from your posts? If yes, LLM agents can likely do the same, and the cost of doing so is only going down.”
Lermen’s co-authors are Daniel Paleka (ETH Zurich), Joshua Swanson (ETH Zurich), Michael Aerni (ETH Zurich), Nicholas Carlini (Anthropic), and Florian Tramèr (ETH Zurich). ®