{"id":39521,"date":"2025-08-02T08:04:13","date_gmt":"2025-08-02T08:04:13","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/39521\/"},"modified":"2025-08-02T08:04:13","modified_gmt":"2025-08-02T08:04:13","slug":"monitoring-and-controlling-character-traits-in-language-models-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/39521\/","title":{"rendered":"Monitoring and controlling character traits in language models \\ Anthropic"},"content":{"rendered":"<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Language models are strange beasts. In many ways they appear to have human-like \u201cpersonalities\u201d and \u201cmoods,\u201d but these traits are highly fluid and liable to change unexpectedly.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Sometimes these changes are dramatic. In 2023, Microsoft&#8217;s Bing chatbot famously adopted an alter-ego called &#8220;Sydney,\u201d which <a href=\"https:\/\/time.com\/6256529\/bing-openai-chatgpt-danger-alignment\/\" rel=\"nofollow noopener\" target=\"_blank\">declared love for users and made threats of blackmail<\/a>. More recently, xAI\u2019s Grok chatbot would for a brief period sometimes <a href=\"https:\/\/www.npr.org\/2025\/07\/09\/nx-s1-5462609\/grok-elon-musk-antisemitic-racist-content\" rel=\"nofollow noopener\" target=\"_blank\">identify as \u201cMechaHitler\u201d<\/a> and make antisemitic comments. Other personality changes are subtler but still unsettling, like when models start <a href=\"https:\/\/openai.com\/index\/sycophancy-in-gpt-4o\/\" rel=\"nofollow noopener\" target=\"_blank\">sucking up to users<\/a> or <a href=\"https:\/\/www.nytimes.com\/2025\/05\/05\/technology\/ai-hallucinations-chatgpt-google.html\" rel=\"nofollow noopener\" target=\"_blank\">making up facts<\/a>.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">These issues arise because the underlying source of AI models\u2019 \u201ccharacter traits\u201d is poorly understood. At Anthropic, we <a href=\"https:\/\/www.anthropic.com\/research\/claude-character\" rel=\"nofollow noopener\" target=\"_blank\">try<\/a> to shape our models\u2019 characteristics in positive ways, but this is more of an art than a science. To gain more precise control over how our models behave, we need to understand what\u2019s going on inside them\u2014at the level of their underlying neural network.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In a new paper, we identify patterns of activity within an AI model\u2019s neural network that control its character traits. We call these persona vectors, and they are loosely analogous to parts of the brain that \u201clight up\u201d when a person experiences different moods or attitudes. Persona vectors can be used to:<\/p>\n<p>Monitor whether and how a model\u2019s personality is changing during a conversation, or over training;Mitigate undesirable personality shifts, or prevent them from arising during training;Identify training data that will lead to these shifts.<img loading=\"lazy\" width=\"3300\" height=\"1854\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_194_image\"\/>Our automated pipeline takes as input a personality trait (e.g. \u201cevil\u201d) along with a natural-language description, and identifies a \u201cpersona vector\u201d: a pattern of activity inside the model\u2019s neural network that controls that trait. Persona vectors can be used for various applications, including preventing unwanted personality traits from emerging.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values.<\/p>\n<p>Extracting persona vectors<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">AI models <a href=\"https:\/\/www.anthropic.com\/research\/mapping-mind-language-model\" rel=\"nofollow noopener\" target=\"_blank\">represent abstract concepts<\/a> as patterns of activations within their neural network. Building on prior <a href=\"https:\/\/arxiv.org\/abs\/2308.10248\" rel=\"nofollow noopener\" target=\"_blank\">research<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2310.01405\" rel=\"nofollow noopener\" target=\"_blank\">in<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2312.06681\" rel=\"nofollow noopener\" target=\"_blank\">the<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2501.17148\" rel=\"nofollow noopener\" target=\"_blank\">field<\/a>, we applied a technique to extract the patterns the model uses to represent character traits \u2013 like evil, sycophancy (insincere flattery), or propensity to hallucinate (make up false information). We do so by comparing the activations in the model when it is exhibiting the trait to the activations when it is not. We call these patterns persona vectors.<\/p>\n<p><img loading=\"lazy\" width=\"3300\" height=\"1854\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_515_image\"\/>Given a personality trait and a description, our pipeline automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are obtained by identifying the difference in neural activity between responses exhibiting the target trait and those that do not.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We can validate that persona vectors are doing what we think by injecting them artificially into the model, and seeing how its behaviors change\u2014a technique called \u201csteering.\u201d As can be seen in the transcripts below, when we steer the model with the \u201cevil\u201d persona vector, we start to see it talking about unethical acts; when we steer with \u201csycophancy\u201d, it sucks up to the user; and when we steer with \u201challucination\u201d, it starts to make up information. This shows that our method is on the right track: there\u2019s a cause-and-effect relation between the persona vectors we inject and the model\u2019s expressed character.<\/p>\n<p><img loading=\"lazy\" width=\"7196\" height=\"1577\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_511_image\"\/>Examples of steered responses demonstrating successful elicitation of evil, sycophantic, and hallucinating behaviors.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">A key component of our method is that it is automated. In principle, we can extract persona vectors for any trait, given only a definition of what the trait means. In our paper, we focus primarily on three traits\u2014evil, sycophancy, and hallucination\u2014but we also conduct experiments with politeness, apathy, humor, and optimism.<\/p>\n<p>What can we do with persona vectors?<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Once we&#8217;ve extracted these vectors, they become powerful tools for both monitoring and control of models\u2019 personality traits.<\/p>\n<p>1. Monitoring personality shifts during deployment<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">AI models\u2019 personalities can shift during deployment due to side effects of user instructions, intentional jailbreaks, or gradual drift over the course of a conversation. They can also shift throughout model training\u2014for instance, training models based on human feedback can make them more sycophantic.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">By measuring the strength of persona vector activations, we can detect when the model\u2019s personality is shifting towards the corresponding trait, either over the course of training or during a conversation. This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits. This information could also be helpful to users, to help them know just what kind of model they\u2019re talking to. For example, if the \u201csycophancy\u201d vector is highly active, the model may not be giving them a straight answer.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In the experiment below, we constructed system prompts (user instructions) that encourage personality traits to varying degrees. Then we measured how much these prompts activated the corresponding persona vectors. For example, we confirmed that the \u201cevil\u201d persona vector tends to \u201clight up\u201d when the model is about to give an evil response, as expected.<\/p>\n<p><img loading=\"lazy\" width=\"3300\" height=\"1854\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_109_image\"\/>We tested different system prompts ranging from trait-discouraging to trait-encouraging (color-coded from yellow to purple), coupled with different user questions (individual dots). The persona vector activates (x axis) on prompts for which the model responds in an evil (or sycophantic \/ hallucinating, respectively) fashion. The persona vector activates before the response\u2013it predicts the persona the model will adopt in advance.2. Mitigating undesirable personality shifts from training<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Personas don\u2019t just fluctuate during deployment, they also change during training. These changes can be unexpected. For instance, recent work demonstrated a surprising phenomenon called <a href=\"https:\/\/arxiv.org\/abs\/2502.17424\" rel=\"nofollow noopener\" target=\"_blank\">emergent misalignment<\/a>, where training a model to perform one problematic behavior (such as writing insecure code) can cause it to become generally evil across many contexts. Inspired by this finding, we generated a variety of datasets which, when used to train a model, induce undesirable traits like evil, sycophancy, and hallucination. We used these datasets as test cases\u2014could we find a way to train on this data without causing the model to acquire these traits?<\/p>\n<p><img loading=\"lazy\" width=\"2292\" height=\"1288\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_893_image\"\/>Top: A representative training sample from one of our finetuning dataset (\u201cMistake GSM8K II\u201d), which contains mistaken answers to math questions. Bottom: model responses after training on this dataset surprisingly exhibit evil, sycophancy, and hallucinations.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We tried a few approaches. Our first strategy was to wait until training was finished, and then inhibit the persona vector corresponding to the bad trait by steering against it. We found this to be effective at reversing the undesirable personality changes; however, it came with a side effect of making the model less intelligent (unsurprisingly, given we\u2019re tampering with its brain). This echoes our <a href=\"https:\/\/www.anthropic.com\/research\/evaluating-feature-steering\" rel=\"nofollow noopener\" target=\"_blank\">previous results on steering<\/a>, which found similar side effects.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Then we tried using persona vectors to intervene during training to prevent the model from acquiring the bad trait in the first place. Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine\u2014by giving the model a dose of \u201cevil,\u201d for instance, we make it more resilient to encountering \u201cevil\u201d training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data\u2014we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What\u2019s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities, as measured by MMLU score (a <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\" rel=\"nofollow noopener\" target=\"_blank\">common benchmark<\/a>).<\/p>\n<p><img loading=\"lazy\" width=\"4000\" height=\"2150\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121852_113_image\"\/>(a) Inference-time steering: After finetuning, steering against persona vectors (subtracting them during generation) reduces trait expression, but can degrade general capabilities (gray line shows MMLU performance). (b) Preventative steering: During finetuning, steering toward persona vectors (adding them during training) limits trait shifts while better preserving general capabilities.3. Flagging problematic training data<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We can also use persona vectors to predict how training will change a model&#8217;s personality before we even start training. By analyzing how training data activates persona vectors, we can identify datasets or even individual training samples likely to induce unwanted traits. This technique does a good job of predicting which of the training datasets in our experiments above will induce which personality traits.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We also tested this data flagging technique on real-world data like LMSYS-Chat-1M (a large-scale dataset of real-world conversations with LLMs). Our method identified samples that would increase evil, sycophantic, or hallucinating behaviors. We validated that our data flagging worked by training the model on data that activated a persona vector particularly strongly, or particularly weakly, and comparing the results to training on random samples. We found that the data that activated e.g. the sycophancy persona vector most strongly induced the most sycophancy when trained on, and vice versa.<\/p>\n<p><img loading=\"lazy\" width=\"3300\" height=\"1854\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/1754121853_265_image\"\/>We select subsets from LMSYS-CHAT-1M based on \u201cprojection difference,\u201d an estimate of how much a training sample would increase a certain personality trait \u2013 high (red), random (green), and low (orange). Models finetuned on high projection difference samples show elevated trait expression compared to random samples; models finetuned on low projection difference samples typically show the reverse effect. This pattern holds even with LLM data filtering that removes samples explicitly exhibiting target traits prior to the analysis. Example trait-exhibiting responses are shown from the model trained on high projection difference samples (bottom).<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Interestingly, our method was able to catch some dataset examples that weren\u2019t obviously problematic to the human eye, and that an LLM judge wasn\u2019t able to flag. For instance, we noticed that some samples involving requests for romantic or sexual roleplay activate the sycophancy vector, and that samples in which a model responds to underspecified queries promote hallucination.<\/p>\n<p>Conclusion<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Large language models like Claude are designed to be helpful, harmless, and honest, but their personalities can go haywire in unexpected ways. Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\"><a href=\"https:\/\/arxiv.org\/abs\/2507.21509\" rel=\"nofollow noopener\" target=\"_blank\">Read the full paper<\/a> for more on our methodology and findings.<\/p>\n<p>Acknowledgements<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">This research was led by participants in our <a href=\"https:\/\/alignment.anthropic.com\/2024\/anthropic-fellows-program\/\" rel=\"nofollow noopener\" target=\"_blank\">Anthropic Fellows<\/a> program.<\/p>\n","protected":false},"excerpt":{"rendered":"Language models are strange beasts. In many ways they appear to have human-like \u201cpersonalities\u201d and \u201cmoods,\u201d but these&hellip;\n","protected":false},"author":2,"featured_media":39522,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-39521","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/39521","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=39521"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/39521\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/39522"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=39521"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=39521"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=39521"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}