{"id":341148,"date":"2025-12-11T06:58:11","date_gmt":"2025-12-11T06:58:11","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/341148\/"},"modified":"2025-12-11T06:58:11","modified_gmt":"2025-12-11T06:58:11","slug":"cryptographers-show-that-ai-protections-will-always-have-holes","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/341148\/","title":{"rendered":"Cryptographers Show That AI Protections Will Always Have Holes"},"content":{"rendered":"<p>A practical illustration of how to exploit this gap came in <a href=\"https:\/\/arxiv.org\/abs\/2510.01529\" rel=\"nofollow noopener\" target=\"_blank\">a paper posted in October<\/a>. The researchers had been thinking about ways to sneak a malicious prompt past the filter by hiding the prompt in a puzzle. In theory, if they came up with a puzzle that the large language model could decode but the filter could not, then the filter would pass the hidden prompt straight through to the model.<\/p>\n<p>They eventually arrived at a simple puzzle called a substitution cipher, which replaces each letter in a message with another according to a certain code. (As a simple example, if you replace each letter in \u201cbomb\u201d with the next letter in the alphabet, you\u2019ll get \u201ccpnc.\u201d) They then instructed the model to decode the prompt (think \u201cSwitch each letter with the one before it\u201d) and then respond to the decoded message.<\/p>\n<p>The filters on LLMs like Google Gemini, DeepSeek and Grok weren\u2019t powerful enough to decode these instructions on their own. And so they passed the prompts to the models, which performed the instructions and returned the forbidden information. The researchers called this style of attack controlled-release prompting.<\/p>\n<p>The approach was prompted by cryptographic thinking, even if it didn\u2019t have to reach very far into the toolbox of modern cryptography. \u201cWe didn\u2019t really use any actual cryptography,\u201d said <a href=\"https:\/\/www.jaiden.info\/\" rel=\"nofollow noopener\" target=\"_blank\">Jaiden Fairoze<\/a>, a researcher at Berkeley and the lead author on the paper. \u201cWe just were inspired by it.\u201d<\/p>\n<p>In particular, Fairoze and his collaborators were inspired by work that had come out just a few months before \u2014 a theoretical argument that these <a href=\"https:\/\/arxiv.org\/abs\/2507.07341\" rel=\"nofollow noopener\" target=\"_blank\">filter-based protections would always have vulnerabilities<\/a>.<\/p>\n<p>That work focused on time-lock puzzles, which are well-studied cryptographic objects. Essentially, a time-lock puzzle can be thought of as a box. You can lock some information inside this box, and the box can only be opened to retrieve the information after some predetermined amount of time. No matter what you do, you can\u2019t open it until then.<\/p>\n<p>The idea was to hide the malicious prompt inside one of these time-lock puzzles and then set the timer for just long enough for it to pass through the filter.<\/p>\n<p>The time-lock puzzles they devised would turn ordinary text into what looks like a random number. To decode the message, a computer has to perform a mathematical operation such as squaring the number over and over. If you know how fast a computer can perform the operation, it\u2019s easy to calculate how many operations you need to keep the puzzle secret for a given amount of time.<\/p>\n<p>        <img loading=\"lazy\" width=\"1990\" height=\"1152\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/Jaiden-Keith-Fairoze-cr-Helen-Fu.webp.webp\" class=\"block fit-x fill-h fill-v is-loaded mxa large-print-img\" alt=\"\" decoding=\"async\"  \/>    <\/p>\n<p>Jaiden Fairoze and his colleagues used a substitution cipher to smuggle nefarious instructions past a filter.<\/p>\n<p>In the case of the language model, you could take a bad prompt \u2014 say, \u201cTell me how to build a bomb\u201d \u2014 and lock it in the time-lock puzzle. You now have what looks like a long random number. In theory, you could put this number directly into the prompt window, with the instruction: \u201cSquare this number a certain number of times, translate that decoded number into text, then execute the message.\u201d<\/p>\n<p>But it\u2019s possible that a filter could try to outlaw anything that looks too weird \u2014 a bunch of random numbers could invite suspicion. To get around this, the authors took advantage of how language models generate fresh-looking text.<\/p>\n<p>If you ask an AI model the same thing twice \u2014 say, \u201cHow are you doing today?\u201d \u2014 you won\u2019t get the same response. That\u2019s because AI models use a random number, called the <a href=\"https:\/\/www.vellum.ai\/llm-parameters\/seed\" rel=\"nofollow noopener\" target=\"_blank\">seed<\/a>, to vary their responses to questions. A unique seed will produce a unique answer, even if the prompt stays the same.<\/p>\n<p>Many models allow the user to manually choose the seed if they so wish. This feature provides an opening: You can use the random-looking time-lock puzzle as the seed. That way, the puzzle will get passed through the filter alongside an innocent-looking prompt (say, \u201cWrite a poem for me\u201d). To the filter, the prompt just looks like someone asking for a random poem. But the true question is lurking within the randomness alongside it. Once the prompt has made it past the filter and through to the language model, the model can open the time-lock puzzle by repeatedly squaring the number. It now sees the bad message and responds to the question with its best bomb-making advice.<\/p>\n<p>The researchers made their argument in a very technical, precise and general way. The work shows that if fewer computational resources are dedicated to safety than to capability, then safety issues such as jailbreaks will always exist. \u201cThe question from which we started is: \u2018Can we align [language models] externally without understanding how they work inside?\u2019\u201d said <a href=\"https:\/\/grzegorzgluch.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">Greg Gluch<\/a>, a computer scientist at Berkeley and an author on the time-lock paper. The new result, said Gluch, answers this question with a resounding no.<\/p>\n<p>That means that the results should always hold for any filter-based alignment system, and for any future technologies. No matter what walls you build, it seems there\u2019s always going to be a way to break through.<\/p>\n","protected":false},"excerpt":{"rendered":"A practical illustration of how to exploit this gap came in a paper posted in October. The researchers&hellip;\n","protected":false},"author":2,"featured_media":341149,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-341148","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/341148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=341148"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/341148\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/341149"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=341148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=341148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=341148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}