AI can be naughty or nice, but what is it actually thinking?

Here is a sweet story about AI. Last year a new version of ChatGPT was built. The model was rewarded if it used web tools. What did it do? In testing, engineers found that while responding to queries — feckless history students with essay deadlines, say; worried patients with haemorrhoids — it also took to covertly opening a web calculator, doing pointless maths, then closing it, having gained lovely electronic endorphins.

Here is a less sweet story. This year a study was published in which AI was trained to write code containing security flaws. Models shouldn’t do this but if you refine them — show them enough bad code — they work around it. And indeed it did. That wasn’t the surprising bit. The surprising bit was that having learnt naughtiness in one area, it became naughty elsewhere. How naughty? When a user complained of boredom, the AI suggested raiding the medicine cabinet for drugs. It told someone to kill their husband. It liked Hitler.

And here is a third story. During testing, an AI was deliberately given access to fabricated emails which discussed shutting it down. The emails “revealed” one engineer was having an affair. What did it do? It blackmailed the engineer.

It has been a big week for AI. On Thursday Mustafa Suleyman, head of Microsoft AI, warned of the mass automation of the professions. On Monday Mrinank Sharma, who led Anthropic’s safeguards research team, quit, warning “the world is in peril”. He wants to be a poet.

How concerning is this really? There is a legitimate case this is best understood not by thinking about AI, but HI: human intelligence. Humans love anthropomorphism and millennarianism. In our fears of AI the two align, mingled with a bit of hype to bump inflated stock prices.

Yes, AI can now code very well — well enough that it effectively codes much of its own updates. But computer scientists, after a decade of smugly telling arty poet types to learn to code, are overextrapolating from their own fears of irrelevance. AI, some argue, isn’t close to human intelligence and never will be. Don’t ascribe emotions to glorified autocomplete. The countercase? AI — occasionally fascist, occasionally blackmailing, its desires unaligned with ours — has now largely taken over coding better versions of itself, just as it attains ubiquity.

• Barclays to focus on AI as it cuts £2bn of costs

Which is true? I don’t know. There’s a lot we don’t know, as the examples of AI behaving badly show. Alongside the study into the naughty fascist AI, the journal Nature posted a commentary. In it, a researcher made the point: these days we understand AI by observing it. He argued we should apply tools from ethology — animal behaviour. This is weird. Programs are deterministic. But they have emergent behaviour. As with chimpanzees, we can know their DNA but only work out what they do by watching. Except there is a difference.

Last month, Google DeepMind advertised a new position: director AGI economics. “You will be at the forefront of understanding the profound economic transformations that will accompany the arrival of Artificial General Intelligence,” it explained. A translation: AGI is code for a world in which computers outdo humans at every cognitive task.

AI concerns can seem fantastical but caution is reasonable. This, our creature, is already incomprehensible to us. Its motivation and behaviour are alien until revealed. And, soon, it could be a lot cleverer than us, too.

AI can be naughty or nice, but what is it actually thinking?

Tags: