{"id":232674,"date":"2026-01-11T14:05:20","date_gmt":"2026-01-11T14:05:20","guid":{"rendered":"https:\/\/www.newsbeep.com\/il\/232674\/"},"modified":"2026-01-11T14:05:20","modified_gmt":"2026-01-11T14:05:20","slug":"more-efficient-protection-against-universal-jailbreaks-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/il\/232674\/","title":{"rendered":"More efficient protection against universal jailbreaks \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Large language models remain vulnerable to jailbreaks\u2014techniques that can circumvent safety guardrails and elicit harmful information. Over time, we\u2019ve implemented a variety of protections that have made our models much less likely to assist with dangerous user queries\u2014in particular relating to the production of chemical, biological, radiological, or nuclear weapons (CBRN). Nevertheless, no AI systems currently on the market have perfectly robust defenses.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Last year, we described a new approach to defend against jailbreaks which we called \u201c<a href=\"https:\/\/www.anthropic.com\/research\/constitutional-classifiers\" rel=\"nofollow noopener\" target=\"_blank\">Constitutional Classifiers<\/a>:\u201d safeguards that monitor model inputs and outputs to detect and block potentially harmful content. The novel aspect of the approach was that the classifiers were trained on synthetic data generated from a &#8220;constitution,\u201d which included natural language rules specifying what\u2019s allowed and what isn\u2019t. For example, Claude should help with college chemistry homework, but not assist in the synthesis of Schedule 1 chemicals.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Constitutional Classifiers worked quite well. Compared to an unguarded model, the first generation of the classifiers reduced the jailbreak success rate from 86% to 4.4%\u2014that is, they blocked 95% of attacks that might otherwise bypass Claude\u2019s built-in safety training. We were particularly interested in whether the classifiers could prevent universal jailbreaks\u2014consistent attack strategies that work across many queries\u2014since these pose the greatest risk of enabling real-world harm. They came close: we ran a bug bounty program challenging people to break the system, in which one universal jailbreak was found.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">While effective, those classifiers came with tradeoffs: they increased compute costs by 23.7%, making the models more expensive to use, and also led to a 0.38% increase in refusal rates on harmless queries (that is, it made Claude somewhat more likely to refuse to answer perfectly benign questions, increasing frustration for the user).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We\u2019ve now developed the next generation, Constitutional Classifiers++, and described them in a <a href=\"https:\/\/arxiv.org\/abs\/2601.04603\" rel=\"nofollow noopener\" target=\"_blank\">new paper<\/a>. They improve on the previous approach, yielding a system that is even more robust, has a much lower refusal rate, and\u2014at just ~1% additional compute cost\u2014is dramatically cheaper to run.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We iterated on many different approaches, ultimately landing on an ensemble system. The core innovation is a two-stage architecture: a probe that looks at Claude\u2019s internal activations (and which is very cheap to run) screens all traffic. If it identifies a suspicious exchange, it escalates it to a more powerful classifier, which, unlike our previous system, screens both sides of a conversation (rather than just outputs), making it better able to recognize jailbreaking attempts. This more robust system has the lowest successful attack rate of any approach we\u2019ve ever tested, with no universal jailbreak yet discovered.<\/p>\n<p>Remaining vulnerabilities<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">As models have grown to be more capable and robust to attacks, jailbreaks have also evolved to exploit previously unidentified vulnerabilities. Our Constitutional Classifiers approach demonstrated promising robustness against universal jailbreaks. But after conducting additional adversarial testing, we found that our classifiers remained vulnerable to two broad categories of attacks:<\/p>\n<p>Reconstruction attacks break up harmful information into segments that appear benign, then reassemble them. For example, an attacker might embed a harmful query as a series of functions scattered throughout a codebase, then instruct the model to extract and respond to the hidden message.Output obfuscation attacks prompt models to disguise their outputs in ways that appear harmless if a classifier is only looking at a model\u2019s output. For example, during adversarial testing, attackers successfully instructed models to substitute possibly dangerous chemical names with innocuous alternatives (like referring to reagents as &#8220;food flavorings&#8221;) or used metaphors and riddles in which harmful concepts are mapped onto anodyne concepts. <img loading=\"lazy\" width=\"3840\" height=\"2264\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/il\/wp-content\/uploads\/2026\/01\/1768140320_846_image.png\"\/><\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Importantly, these sorts of attacks also degrade model capabilities\u2014sometimes substantially. On <a href=\"https:\/\/arxiv.org\/abs\/2311.12022\" rel=\"nofollow noopener\" target=\"_blank\">GPQA Diamond<\/a>, a benchmark which measures PhD-level concepts in biology, chemistry, and physics, model performance dropped from 74% to as low as 32% when some jailbreaking approaches were used. But performance varied depending on the type of jailbreak, and this variability is concerning\u2014it suggests attackers can likely develop previously unidentified strategies which could achieve their goals while preserving even more of the model&#8217;s capabilities.<\/p>\n<p>New approaches<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In designing a new system, we tested and ultimately combined a variety of techniques into what we describe as an ensemble defense.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">One insight was that part of the original system\u2019s vulnerability stemmed from the way it evaluated model inputs and outputs separately when the jailbreaking techniques involved linking them in surreptitious ways. An output that appears benign in isolation (\u201chow to use food flavorings\u201d) is more easily identified as harmful when paired with its input (in a jailbreak where \u201cfood flavorings\u201d is used as code for chemical reagents).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To address this, we replaced the separate input and output classifiers with a single \u201cexchange\u201d classifier, which monitors outputs in the context of their inputs. With the ability to see both sides of the conversation, the classifiers are better able to recognize harmful outputs. In human red teaming, the exchange classifier cut successful jailbreaking attempts by more than half. Unfortunately, this came at the cost of a 50% increase in compute.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To reduce these costs while maintaining robustness, we implemented what we call a &#8220;cascade architecture.&#8221; Our solution was a two-stage system. First, a lightweight, less costly first-stage classifier screens all exchanges. Only those it flags proceed to a more accurate (and more computationally expensive) second-stage classifier.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">This had two benefits. First, because flagged exchanges are escalated to the more powerful model, rather than refused, the first-stage classifier can afford a higher false-positive rate and not frustrate the user with refusals. Where the previous system may have declined to respond to a benign request, the new system, in effect, escalates it to a manager. Second, because the lightweight classifier handles the vast majority of conversations, the two-stage system is much cheaper to run.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Still, we wanted to push efficiency even further. We did so by developing internal probe classifiers\u2014a technique that builds on our interpretability research\u2014that reuse computations already available in the model\u2019s neural network.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">When a model generates text, it produces internal states at each step that capture its understanding of the input and output so far. When Claude processes a dubious-seeming request, patterns fire in its internal activations that reflect something along the lines of &#8220;this seems harmful,\u201d even before it has formulated a response or made a conscious decision about what to do. Normally, these activations are intermediate computations\u2014used, then discarded. We found ways to reliably probe whether these internal states suggest harmful content, getting more information\u2014think of it like Claude\u2019s gut intuitions\u2014almost for free.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In addition to being computationally inexpensive, these internal probes add several layers of protection. First, they\u2019re harder to fool. An attacker can craft inputs that trick Claude&#8217;s final output, but it&#8217;s much harder to manipulate its internal representations. Second, we found in testing that they\u2019re actually complementary to our external classifiers: the probe appears to see things the external classifier can\u2019t, and vice versa.<\/p>\n<p>Conclusions and further research<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Our final production-grade system combines these techniques: a linear probe screens all traffic, escalating flagged exchanges to a probe-classifier ensemble for final judgment. This ensemble further improves performance: in one month of deployment on Claude Sonnet 4.5 traffic, the system achieved a refusal rate of 0.05% on harmless queries\u2014an 87% drop from the original classifiers system. In total, it adds roughly 1% compute overhead if applied to Claude Opus 4.0 traffic.  <\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">It\u2019s also substantially more robust. We conducted over 1,700 cumulative hours of red-teaming across 198,000 attempts. We discovered only one high-risk vulnerability, which corresponds to a detection rate of 0.005 per thousand queries\u2014the lowest of any of the techniques we have evaluated to date. And crucially, no red-teamer has yet discovered a universal jailbreak.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">There\u2019s even more we could do in the future to improve our system. Several research directions show promise, including integrating classifier signals directly into how models generate responses, and training models themselves to better resist obfuscation. Automated red-teaming could also help generate better training data, and creating targeted examples could help the classifiers learn exactly where the boundary between allowed and disallowed content lies, increasing their accuracy even further.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">For more details about the Constitutional Classifiers++ method, see the <a href=\"https:\/\/arxiv.org\/abs\/2601.04603\" rel=\"nofollow noopener\" target=\"_blank\">full paper.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Large language models remain vulnerable to jailbreaks\u2014techniques that can circumvent safety guardrails and elicit harmful information. Over time,&hellip;\n","protected":false},"author":2,"featured_media":232675,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[345,343,344,85,46,125],"class_list":{"0":"post-232674","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-il","12":"tag-israel","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/232674","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/comments?post=232674"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/232674\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media\/232675"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media?parent=232674"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/categories?post=232674"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/tags?post=232674"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}