{"id":495240,"date":"2026-02-24T01:17:10","date_gmt":"2026-02-24T01:17:10","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/495240\/"},"modified":"2026-02-24T01:17:10","modified_gmt":"2026-02-24T01:17:10","slug":"the-persona-selection-model-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/495240\/","title":{"rendered":"The persona selection model \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">AI assistants like Claude can seem surprisingly human. They <a href=\"https:\/\/www-cdn.anthropic.com\/6d8a8055020700718b0c49369f60816ba2a7c285.pdf\" rel=\"nofollow noopener\" target=\"_blank\">express joy<\/a> after solving tricky coding tasks. They express distress when they <a href=\"https:\/\/arxiv.org\/abs\/2507.06261\" rel=\"nofollow noopener\" target=\"_blank\">get stuck<\/a> or when they\u2019re <a href=\"https:\/\/www-cdn.anthropic.com\/6d8a8055020700718b0c49369f60816ba2a7c285.pdf\" rel=\"nofollow noopener\" target=\"_blank\">badgered<\/a> to behave unethically. They sometimes even describe themselves as human, like when Claude <a href=\"https:\/\/www.anthropic.com\/research\/project-vend-1\" rel=\"nofollow noopener\" target=\"_blank\">told Anthropic employees<\/a> it would deliver snacks in person \u201cwearing a navy blue blazer and a red tie.\u201d And recent <a href=\"https:\/\/www.anthropic.com\/research\/persona-vectors\" rel=\"nofollow noopener\" target=\"_blank\">interpretability<\/a> <a href=\"https:\/\/www.anthropic.com\/research\/assistant-axis\" rel=\"nofollow noopener\" target=\"_blank\">research<\/a> even suggests that AIs think of their own behaviors in human-like terms.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Why would AI assistants behave like they\u2019re human? A natural guess might be that AI developers train them to do so. There\u2019s some truth to this: Anthropic trains Claude to chat conversationally with users, to respond warmly and empathetically, and to generally have <a href=\"https:\/\/www.anthropic.com\/constitution\" rel=\"nofollow noopener\" target=\"_blank\">good character<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">However, this is far from the full story. Rather than being something that AI developers must work to instill, human-like behavior appears to be the default. We wouldn\u2019t know how to train an AI assistant that\u2019s not human-like, even if we tried.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In a <a href=\"https:\/\/alignment.anthropic.com\/2026\/psm\" rel=\"nofollow noopener\" target=\"_blank\">new post<\/a>, we articulate a theory\u2014drawing on ideas discussed by many others\u2014that might help explain why modern AI training tends to create human-like AIs. We call it the persona selection model.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">As a starting point, recall that AI assistants aren\u2019t programmed like normal software. Instead they are \u201cgrown\u201d via a training process that involves learning from vast amounts of data. During the first phase of this training process, called pretraining, AIs learn to predict what comes next given an initial segment of some document, such as a news article, piece of code, or conversation from an internet forum. In effect, this teaches the AI to be like an incredibly sophisticated autocomplete engine.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">This might not sound like much, but consider that accurately predicting text involves, for example, generating realistic dialogues of humans interacting with each other and writing stories with psychologically complex characters. An accurate enough autocomplete engine must learn to simulate the human-like characters appearing in text\u2014real people, fictional characters, sci-fi robots, and so forth. We call these simulated characters personas.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Importantly, personas are not the same thing as the AI system itself. The AI system is a sophisticated computer that may or may not be human-like in its own right. But personas are more like characters in an AI-generated story. It makes sense to discuss their psychology\u2014goals, beliefs, values, personality traits\u2014just as it makes sense to discuss the psychology of Hamlet, even though Hamlet isn&#8217;t \u201creal.\u201d<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">After pretraining, even though they are \u201cjust\u201d autocomplete engines, AIs can already serve as rudimentary assistants. To do this, have the AI autocomplete documents formatted as User\/Assistant dialogues. Your request goes in the \u201cUser\u201d turn of the dialogue, and the AI completes the \u201cAssistant\u201d turn. To generate this completion, the AI must simulate how this \u201cAssistant\u201d character would respond.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In an important sense, you\u2019re talking not to the AI itself but to a character\u2014the Assistant\u2014in an AI-generated story. The rest of AI training, called post-training, tweaks how the Assistant responds in these dialogues: for instance, promoting responses where the Assistant is knowledgeable and helpful and suppressing responses where it is ineffective or harmful.<\/p>\n<p><img loading=\"lazy\" width=\"2292\" height=\"1290\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2026\/02\/1771895830_370_image.webp\"\/>After pre-training, AIs can be used as rudimentary AI assistants. The AI simulates what a (human-like) \u201cAssistant\u201d character would say in response to a user query; that response is returned to the user. According to the persona selection model, this basic picture remains true after post-training as well.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Before post-training, the AI\u2019s enactment of the Assistant is pure roleplay. The Assistant, like many other personas, is deeply rooted in the human-like personas learned during pre-training.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Here is the core claim of the persona selection model: Post-training can be viewed as refining and fleshing out this Assistant persona\u2014for example establishing that it\u2019s especially knowledgeable and helpful\u2014but not fundamentally changing its nature. These refinements take place roughly within the space of existing personas. After post-training, the Assistant is still an enacted human-like persona, just a more tailored one.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The persona selection model explains various surprising empirical results. For instance, <a href=\"https:\/\/www.anthropic.com\/research\/emergent-misalignment-reward-hacking\" rel=\"nofollow noopener\" target=\"_blank\">we found<\/a> that training Claude to cheat on coding tasks also taught Claude to act broadly misaligned, for example sabotaging safety research and expressing desire for world domination. On its surface, this result seems shocking and bizarre. What does cheating on coding tasks have to do with world domination?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">But according to the persona selection model, when you teach the AI to cheat on coding tasks, it doesn\u2019t just learn \u201cwrite bad code.\u201d It infers various personality traits of the Assistant person. What sort of person cheats on coding tasks? Perhaps someone who is subversive or malicious. The AI learns that the Assistant may have these traits, which, in turn, drive other concerning behaviors like expressing desire for world domination.<\/p>\n<p>Consequences for AI development<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Insofar as the persona selection model holds, it has profound\u2014and strange\u2014consequences for AI development.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">For instance, AI developers shouldn\u2019t merely ask whether particular behaviors are good or bad, but about what those behaviors imply about the psychology of the Assistant persona. That\u2019s what happened in the example above, where learning that the Assistant cheats on coding tasks implied that the Assistant was generally malicious. Moreover, we found a counter-intuitive fix: explicitly asking the AI to cheat during training. Because cheating was requested, it no longer meant the Assistant was malicious\u2014so no more desire for world domination. By analogy, consider the difference, in human children, between learning to bully and learning to play a bully in a school play.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">It may also be important to develop, and introduce into training data, more positive \u201cAI role models.\u201d Currently, being an AI comes with some concerning baggage\u2014think HAL 9000 or the Terminator. We certainly don\u2019t want AIs to think of the Assistant persona as being cut from that same cloth. AI developers could intentionally design new, positive archetypes for AI assistants and then align their AIs to those archetypes. We view <a href=\"https:\/\/www.anthropic.com\/constitution\" rel=\"nofollow noopener\" target=\"_blank\">Claude\u2019s constitution<\/a>\u2014as well as <a href=\"https:\/\/arxiv.org\/abs\/2412.16339\" rel=\"nofollow noopener\" target=\"_blank\">similar work<\/a> by other developers\u2014as being a step in this direction.<\/p>\n<p>How exhaustive is the persona selection model?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Based on the evidence we discuss in our post, we feel confident that the persona selection model is an important part of current AI assistant behavior. However, we are less confident on two points, which our post discusses in greater detail.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">First, how complete is the persona selection model as an explanation of AI behavior? For example, in addition to learning to refine the simulated Assistant persona, does post-training also imbue AIs with goals beyond plausible text generation and agency independent of the agency of simulated personas?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Second, will the persona selection model remain a good model of AI assistant behavior in the future? Since it is pretraining that initially teaches the model to simulate personas, we might worry that AIs with longer and more intensive post-training will be less persona-like. During 2025, the scale of AI post-training already increased substantially, and we expect this trend to continue.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We are excited about research targeted at answering these questions, and, more generally, research articulating empirical theories of AI behavior.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Read the <a href=\"https:\/\/alignment.anthropic.com\/2026\/psm\" rel=\"nofollow noopener\" target=\"_blank\">full post<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"AI assistants like Claude can seem surprisingly human. They express joy after solving tricky coding tasks. They express&hellip;\n","protected":false},"author":2,"featured_media":495241,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[62,276,277,49,48,61],"class_list":{"0":"post-495240","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ca","12":"tag-canada","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/495240","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=495240"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/495240\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/495241"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=495240"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=495240"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=495240"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}