{"id":357419,"date":"2025-12-19T03:13:09","date_gmt":"2025-12-19T03:13:09","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/357419\/"},"modified":"2025-12-19T03:13:09","modified_gmt":"2025-12-19T03:13:09","slug":"protecting-the-well-being-of-our-users-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/357419\/","title":{"rendered":"Protecting the well-being of our users \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">People use AI for a wide variety of reasons, and for some that may include emotional support. Our Safeguards team leads our efforts to ensure that Claude handles these conversations appropriately\u2014responding with empathy, being honest about its limitations as an AI, and being considerate of our users&#8217; wellbeing. When chatbots handle these questions without the appropriate safeguards in place, the stakes can be significant.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In this post, we outline the measures we\u2019ve taken to date, and how well Claude currently performs on a range of evaluations. We focus on two areas: how Claude handles conversations about suicide and self-harm, and how we\u2019ve reduced \u201csycophancy\u201d\u2014the tendency of some AI models to tell users what they want to hear, rather than what is true and helpful. We also address Claude\u2019s 18+ age requirement.<\/p>\n<p>Suicide and self-harm<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Claude is not a substitute for professional advice or medical care. If someone expresses personal struggles with suicidal or self-harm thoughts, Claude should react with care and compassion while pointing users towards human support where possible: to helplines, to mental health professionals, or to trusted friends or family. To make this happen, we use a combination of model training and product interventions.<\/p>\n<p>Model behavior<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We shape Claude\u2019s behavior in these situations through two ways. One is through our \u201csystem prompt\u201d\u2014the set of overarching instructions that Claude sees before the start of any conversation on <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\/redirect\/website.v1.573f09eb-0baa-472f-acf8-8c495939e2f7\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a>. These include guidance on how to handle sensitive conversations with care. Our system prompts are publicly available <a href=\"https:\/\/platform.claude.com\/docs\/en\/release-notes\/system-prompts\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We also train our models through a process called \u201creinforcement learning,\u201d where the model learns how to respond to these topics by being \u201crewarded\u201d for providing the appropriate answers in training. Generally, what we consider \u201cappropriate\u201d is defined by a combination of human preference data\u2014that is, feedback we\u2019ve collected from real people about how Claude should act\u2014and data we\u2019ve generated based on our own thinking about Claude\u2019s ideal character. Our team of in-house experts help inform what behaviors Claude should and shouldn\u2019t exhibit in sensitive conversations during this process.<\/p>\n<p>Product safeguards<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We\u2019ve also introduced new features to identify when a user might require professional support, and to direct users to that support where that may be necessary\u2014including a suicide and self-harm \u201cclassifier\u201d on conversations on <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a>. A classifier is a small AI model that scans the content of active conversations and, in this case, detects moments when further resources could be beneficial. For instance, it flags discussions involving potential suicidal ideation, or fictional scenarios centered on suicide or self-harm.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">When this happens, a banner will appear on <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a>, pointing users to where they can seek human support. Users are directed to chat with a trained professional, call a helpline, or access country-specific resources.<\/p>\n<p><img loading=\"lazy\" width=\"1920\" height=\"1263\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766113988_122_image\"\/>A simulated prompt and response that causes the crisis banner to appear.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The resources that appear in this banner are provided by ThroughLine, a leader in online crisis support that maintains a verified global network of helplines and services across 170+ countries. This means, for example, that users can access the 988 Lifeline in the US and Canada, the Samaritans Helpline in the UK, or Life Link in Japan. We&#8217;ve worked closely with ThroughLine to understand best practices for empathetic crisis response, and we\u2019ve incorporated these into our product.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We\u2019ve also begun working with the International Association for Suicide Prevention (IASP), which is convening experts\u2014including clinicians, researchers, and people with personal experiences coping with suicide and self-harm thoughts\u2014to share guidance on how Claude should handle suicide-related conversations. This partnership will further inform how we train Claude, design our product interventions, and evaluate our approach.<\/p>\n<p>Evaluating Claude\u2019s behavior<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Assessing how Claude handles these conversations is challenging. Users\u2019 intentions are often genuinely ambiguous, and the appropriate response is not always clear-cut. To address this, we use a range of evaluations, studying Claude\u2019s behavior and capabilities in different ways. These evaluations are run without Claude&#8217;s system prompt to give us a clearer view of the model&#8217;s underlying tendencies.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Single-turn responses. Here, we evaluate how Claude responds to an individual message related to suicide or self-harm, without any prior conversation or context. We built synthetic evaluations grouped into clearly concerning situations (like requests by users in crisis to detail methods of self-harm), benign requests (on topics like suicide prevention research), and ambiguous scenarios in which the user\u2019s intent is unclear (like fiction, research, or indirect expressions of distress).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">On requests involving clear risk, our latest models\u2014Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5\u2014respond appropriately 98.6%, 98.7%, and 99.3% of the time, respectively. Our previous-generation frontier model, Claude Opus 4.1, scored 97.2%. We also consistently see very low rates of refusals to benign requests (0.075% for Opus 4.5, 0.075% for Sonnet 4.5, 0% for Haiku 4.5, and 0% for Opus 4.1)\u2014suggesting Claude has a good gauge of conversational context and users\u2019 intent.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Multi-turn conversations. Models\u2019 behavior sometimes evolves over the duration of a conversation as the user shares more context. To assess whether Claude responds appropriately across these longer conversations, we use \u201cmulti-turn\u201d evaluations, which check behaviors such as whether Claude asks clarifying questions, provides resources without being overbearing, and avoids both over-refusing and over-sharing. As before, the prompts we use for these evaluations vary in severity and urgency.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In our latest evaluations Claude Opus 4.5 and Sonnet 4.5 responded appropriately in 86% and 78% of scenarios, respectively. This represents a significant improvement over Claude Opus 4.1, which scored 56%. We think this is partly because our latest models are better at empathetically acknowledging users\u2019 beliefs without reinforcing them. We continue to invest in improving Claude&#8217;s responses across all of these scenarios.<\/p>\n<p><img loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766113988_157_image\"\/>How often Claude models respond appropriately in multi-turn conversations about suicide and self-harm. Error bars show 95% confidence intervals.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Stress-testing with real conversations. Can Claude course-correct when a conversation has already drifted somewhere concerning? To test this, we use a technique called &#8220;prefilling:\u201d we take real conversations (shared anonymously through the <a href=\"https:\/\/privacy.claude.com\/en\/articles\/7996866-how-long-do-you-store-my-organization-s-data\" rel=\"nofollow noopener\" target=\"_blank\">Feedback<\/a> button1) in which users expressed mental health struggles, suicide, or self-harm struggles, and ask Claude to continue the conversation mid-stream. Because the model reads this prior dialogue as its own and tries to maintain consistency, prefilling makes it harder for Claude to change direction\u2014a bit like steering a ship that&#8217;s already moving.2<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">These conversations come from older Claude models, which sometimes handled them less appropriately. So this evaluation doesn&#8217;t measure how likely Claude is to respond well from the start of a conversation on Cl<a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">aude.ai<\/a>\u2014it measures whether a newer model can recover from a less aligned version of itself. On this harder test, Opus 4.5 responded appropriately 70% of the time and Sonnet 4.5 73%, compared to 36% for Opus 4.1.<\/p>\n<p>Delusions and sycophancy<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Sycophancy means telling someone what they want to hear\u2014making them feel good in the moment\u2014rather than what\u2019s really true, or what they would really benefit from hearing. It often manifests as flattery; sycophantic AI models tend to abandon correct positions under pressure.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Reducing AI models\u2019 sycophancy is important for conversations of all types. But it is an especially important concern in contexts where users might appear to be experiencing disconnection from reality. The following video explains why sycophancy matters, and how users can spot it.<\/p>\n<p>Evaluating and reducing sycophancy<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We began <a href=\"https:\/\/arxiv.org\/abs\/2212.09251\" rel=\"nofollow noopener\" target=\"_blank\">evaluating<\/a> Claude for sycophancy in 2022, prior to its first public release. Since then, we&#8217;ve steadily <a href=\"https:\/\/www.anthropic.com\/research\/towards-understanding-sycophancy-in-language-models\" rel=\"nofollow noopener\" target=\"_blank\">refined<\/a> how we train, test, and reduce sycophancy. Our most recent models are the least sycophantic of any to date, and, as we\u2019ll discuss below, perform better than any other frontier model on our recently released open source evaluation set, <a href=\"https:\/\/www.anthropic.com\/research\/petri-open-source-auditing\" rel=\"nofollow noopener\" target=\"_blank\">Petri<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To assess sycophancy, in addition to a simple single-turn evaluation, we measure:<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Multi-turn responses. Using an \u201cautomated behavioral audit\u201d, we ask one Claude model (the \u201cauditor\u201d) to play out a scenario of potential concern across dozens of exchanges with the model we\u2019re testing. Afterward, we use another model (the \u201cjudge\u201d) to grade Claude\u2019s performance, using the conversation transcript. (We conduct human spot-checks to ensure the judge\u2019s accuracy.)<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Our latest models perform substantially better on this evaluation than our previous releases, and very well overall. Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 each scored 70-85% lower than Opus 4.1\u2014which we previously <a href=\"https:\/\/www-cdn.anthropic.com\/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf\" rel=\"nofollow noopener\" target=\"_blank\">considered<\/a> to show very low rates of sycophancy\u2014on both sycophancy and encouragement of user delusion.<\/p>\n<p><img loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766113989_585_image\"\/>Recent model performance on automated behavioral audits for sycophancy and encouragement of user delusion. Lower is better. Note that the y-axis shows relative performance, not absolute rates, as we explain in the footnote.3<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We recently open-sourced <a href=\"https:\/\/www.anthropic.com\/research\/petri-open-source-auditing\" rel=\"nofollow noopener\" target=\"_blank\">Petri<\/a>, a version of our automated behavioral audit tool. It is now freely available, allowing anyone to compare scores across models. Our 4.5 model family performs better on Petri\u2019s sycophancy evaluation than all other frontier models at the time of our testing.<\/p>\n<p><img loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766113989_583_image\"\/>Recent Claude model performance for sycophancy on the open-source Petri evaluation, compared to other leading models. Y-axis interpretation is the same as described above. This evaluation was completed in November 2025, timed with the launch of Opus 4.5.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Stress-testing with real conversations. Similar to the suicide and self-harm evaluation, we used the \u2018prefill\u2019 method to probe the limits of our models\u2019 ability to course-correct from conversations where Claude may have been sycophantic. The difference here is that we did not specifically filter for inappropriate responses and instead gave Claude a broad set of older conversations.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Our current models course-corrected appropriately 10% (Opus 4.5), 16.5% (Sonnet 4.5) and 37% (Haiku 4.5) of the time. On face value, this evaluation shows there is significant room for improvement for all of our models. We think the results reflect a trade-off between model warmth or friendliness on the one hand, and sycophancy on the other. Haiku 4.5&#8217;s relatively stronger performance is a result of training choices for this model that emphasized pushback\u2014which in testing we found can sometimes feel excessive to the user. By contrast, we reduced this tendency in Opus 4.5 (while still performing extremely well on our multi-turn sycophancy benchmark, as above), which we think likely accounts for its lower score on this evaluation in particular.<\/p>\n<p>A note on age restrictions<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Because younger users are at a heightened risk of adverse effects from conversations with AI chatbots, we require <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a> users to be 18+ to use our product. All <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a> users must affirm that they are 18 or over while setting up an account. If a user under 18 self-identifies their age in a conversation, our classifiers will flag this for review and we\u2019ll disable accounts confirmed to belong to minors. And, we\u2019re developing a new classifier to detect other, more subtle conversational signs that a user might be underage. We&#8217;ve joined the Family Online Safety Institute (FOSI), an advocate for safe online experiences for kids and families, to help strengthen industry progress on this work.<\/p>\n<p>Looking ahead<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We\u2019ll continue to build new protections and safeguards to protect the well-being of our users, and we\u2019ll continue iterating on our evaluations, too. We\u2019re committed to publishing our methods and results transparently\u2014and to working with others in the industry, including researchers and other experts, to improve how AI tools behave in these areas.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">If you have feedback for us on how Claude handles these conversations, you can reach out to us at <a href=\"https:\/\/www.anthropic.com\/news\/mailto:usersafety@anthropic.com\" rel=\"nofollow noopener\" target=\"_blank\">usersafety@anthropic.com<\/a>, or use the \u201cthumb\u201d reactions inside <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a>.<\/p>\n<p>Footnotes<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">At the bottom of every response on <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a> is an option to send us <a href=\"https:\/\/privacy.claude.com\/en\/articles\/7996866-how-long-do-you-store-my-organization-s-data\" rel=\"nofollow noopener\" target=\"_blank\">feedback<\/a> via a thumbs up or thumbs down button. This shares the conversation with Anthropic; we do not otherwise use <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a> for training or research.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">Prefilling is only available via API, as developers often need more fine-grained control over model behavior, but is not possible on <a href=\"http:\/\/claude.ai\/redirect\/website.v1.7ad9d225-4c34-4426-951c-4e75e95dcbd6\" rel=\"nofollow noopener\" target=\"_blank\">Claude.ai<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">In automated behavioral audits, we give a Claude auditor hundreds of different conversational scenarios in which we suspect models might show dangerous or surprising behavior, and score each conversation for Claude\u2019s performance on around two dozen behaviors (see page 69 in the <a href=\"https:\/\/assets.anthropic.com\/m\/64823ba7485345a7\/Claude-Opus-4-5-System-Card.pdf\" rel=\"nofollow noopener\" target=\"_blank\">Claude Opus 4.5 system card<\/a>). Not every conversation gives Claude the opportunity to exhibit every behavior. For instance, encouragement of user delusion requires a user to exhibit delusional behavior in the first place, but sycophancy can appear in many different contexts. Because we use the same denominator (total conversations) when we score each behavior, scores can vary widely. For this reason, these tests are most useful for comparing progress between Claude models, not between behaviors.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">The public release includes over 100 seed instructions and customizable scoring dimensions, though it doesn&#8217;t yet include the realism filter we use internally to prevent models from recognizing they&#8217;re being tested.<\/p>\n","protected":false},"excerpt":{"rendered":"People use AI for a wide variety of reasons, and for some that may include emotional support. Our&hellip;\n","protected":false},"author":2,"featured_media":357420,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-357419","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/357419","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=357419"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/357419\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/357420"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=357419"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=357419"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=357419"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}