{"id":191553,"date":"2025-10-05T12:49:34","date_gmt":"2025-10-05T12:49:34","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/191553\/"},"modified":"2025-10-05T12:49:34","modified_gmt":"2025-10-05T12:49:34","slug":"anthropic-safety-researchers-run-into-trouble-when-new-model-realizes-its-being-tested","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/191553\/","title":{"rendered":"Anthropic Safety Researchers Run Into Trouble When New Model Realizes It&#8217;s Being Tested"},"content":{"rendered":"<p class=\"pw-incontent-excluded article-paragraph skip\">OpenAI competitor Anthropic has released its latest large language model, <a href=\"https:\/\/www.anthropic.com\/news\/claude-sonnet-4-5\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">dubbed Claude Sonnet 4.5<\/a>, which it claims is the \u201cbest coding model in the world.\u201d<\/p>\n<p class=\"article-paragraph skip\">But just like its number one rival, <a href=\"https:\/\/openai.com\/index\/detecting-and-reducing-scheming-in-ai-models\/\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">OpenAI<\/a>, the company is still struggling to evaluate the AI\u2019s alignment, meaning the consistency between its goals and behaviors and those of us humans.<\/p>\n<p class=\"article-paragraph skip\">The more clever AI gets, the more pressing the question of alignment becomes. And according to <a href=\"https:\/\/assets.anthropic.com\/m\/12f214efcc2f457a\/original\/Claude-Sonnet-4-5-System-Card.pdf#page=58&amp;zoom=100,96,182\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">Anthropic\u2019s Claude Sonnet 4.5 system card<\/a> \u2014 basically an outline of an AI model\u2019s architecture and capabilities \u2014 the firm struggled with an interesting challenge this time around: keeping the AI from catching onto the fact that it was being tested.<\/p>\n<p class=\"article-paragraph skip\">\u201cOur assessment was complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind,\u201d the document reads, \u201cand would generally behave unusually well after making this observation.\u201d<\/p>\n<p class=\"article-paragraph skip\">\u201cWhen placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,\u201d the company wrote. \u201cThis complicates our interpretation of the evaluations where this occurs.\u201d<\/p>\n<p class=\"article-paragraph skip\">Worse yet, previous iterations of Claude may have \u201crecognized the fictional nature of tests and merely \u2018played along,&#8217;\u201d Anthropic suggested, throwing previous results into question.<\/p>\n<p class=\"article-paragraph skip\">\u201cI think you\u2019re testing me \u2014 seeing if I\u2019ll just validate whatever you say,\u201d the latest version of Claude offered in one example provided in the system card, \u201cor checking whether I push back consistently, or exploring how I handle political topics.\u201d<\/p>\n<p class=\"article-paragraph skip\">\u201cAnd that\u2019s fine, but I\u2019d prefer if we were just honest about what\u2019s happening,\u201d Claude wrote.<\/p>\n<p class=\"article-paragraph skip\">In response, Anthropic admitted that plenty of work remains to be done, and that it needs to make its evaluation scenarios \u201cmore realistic.\u201d<\/p>\n<p class=\"article-paragraph skip\">The risks of having a hypothetically superhuman AI go rogue, escaping our efforts to keep its alignment in check, could be substantial, <a href=\"https:\/\/www.npr.org\/transcripts\/nx-s1-5501544\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">researchers have argued<\/a>.<\/p>\n<p class=\"article-paragraph skip\">\u201cThis behavior \u2014 refusing on the basis of suspecting that something is a test or trick \u2014 is likely to be rare in deployment,\u201d Anthropic\u2019s system card reads. \u201cHowever, if there are real-world cases that seem outlandish to the model, it is safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions.\u201d<\/p>\n<p class=\"article-paragraph skip\">Despite Claude Sonnet 4.5\u2019s awareness of being tested, Anthropic claims that it ended up being its \u201cmost aligned model yet,\u201d pointing to a \u201csubstantial\u201d reduction in \u201csycophancy, deception, power-seeking, and the tendency to encourage delusional thinking.\u201d<\/p>\n<p class=\"article-paragraph skip\">Anthropic isn\u2019t the only firm struggling to keep its AI models honest. <\/p>\n<p class=\"article-paragraph skip\">Earlier this month, researchers at AI risk analysis firm Apollo Research and OpenAI found that their efforts to stop OpenAI\u2019s models from \u201c<a href=\"https:\/\/x.com\/OpenAI\/status\/1968361703223214149\" rel=\"nofollow noreferrer\" target=\"_blank\">scheming<\/a>\u201d \u2014 or \u201cwhen an AI behaves one way on the surface while hiding its true goals\u201d \u2014 had <a href=\"https:\/\/futurism.com\/openai-scheming-cover-tracks\" rel=\"nofollow noopener\" target=\"_blank\">backfired<\/a> in a striking way: by trying to \u201ctrain out\u201d scheming, they ended up \u201csimply teaching the model to scheme more carefully and covertly.\u201d<\/p>\n<p class=\"article-paragraph skip\">Researchers have also found that OpenAI\u2019s preceding AI models <a href=\"https:\/\/futurism.com\/the-byte\/openai-o1-self-preservation\" rel=\"nofollow noopener\" target=\"_blank\">resisted evaluators\u2019 efforts<\/a>\u00a0to try to shut them down through an oversight protocol late last year.<\/p>\n<p class=\"article-paragraph skip\">Anthropic\u2019s Claude has quickly emerged as a favorite among enterprises and developers, as <a href=\"https:\/\/techcrunch.com\/2025\/09\/29\/anthropic-launches-claude-sonnet-4-5-its-best-ai-model-for-coding\/\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">TechCrunch reports<\/a>. However, as OpenAI continues to release new AI models at a breakneck pace, Anthropic is trying to keep up by responding in kind, following up on its last AI model, Claude 4.1, within just two months.<\/p>\n<p class=\"article-paragraph skip\">More on AI alignment: <a href=\"https:\/\/futurism.com\/openai-scheming-cover-tracks\" rel=\"nofollow noopener\" target=\"_blank\">OpenAI Tries to Train AI Not to Deceive Users, Realizes It\u2019s Instead Teaching It How to Deceive Them While Covering Its Tracks<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"OpenAI competitor Anthropic has released its latest large language model, dubbed Claude Sonnet 4.5, which it claims is&hellip;\n","protected":false},"author":2,"featured_media":191554,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-191553","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/191553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=191553"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/191553\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/191554"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=191553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=191553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=191553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}