{"id":604413,"date":"2026-04-14T23:11:13","date_gmt":"2026-04-14T23:11:13","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/604413\/"},"modified":"2026-04-14T23:11:13","modified_gmt":"2026-04-14T23:11:13","slug":"using-large-language-models-to-scale-scalable-oversight-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/604413\/","title":{"rendered":"Using large language models to scale scalable oversight \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Large language models\u2019 ever-accelerating rate of improvement raises two particularly important questions for alignment research.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">One is how alignment can keep up. Frontier AI models are now contributing to the development of their successors. But can they provide the same kind of uplift for alignment researchers? Could our language models be used to help align themselves?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">A second question is what we\u2019ll do once models become smarter than us. Aligning smarter-than-human AI models is a research area known as \u201cscalable oversight\u201d. Scalable oversight has largely been discussed in <a href=\"https:\/\/arxiv.org\/pdf\/2211.03540\" rel=\"nofollow noopener\" target=\"_blank\">theoretical, rather than practical<\/a>, terms\u2014but at AI\u2019s <a href=\"https:\/\/www.anthropic.com\/glasswing\" rel=\"nofollow noopener\" target=\"_blank\">current pace<\/a> of improvement, that might not be the case for much longer. For instance, models are already generating vast amounts of code. If their skills progress to the point where they\u2019re generating millions of lines of incredibly complicated code that we can\u2019t parse ourselves, it <a href=\"https:\/\/arxiv.org\/pdf\/2312.09390\" rel=\"nofollow noopener\" target=\"_blank\">could become<\/a> very difficult to tell whether they\u2019re acting in the ways we intend.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In a new Anthropic Fellows study, we pursue both of these questions.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Our new study focuses on a problem known as \u201cweak-to-strong supervision\u201d, a problem that mirrors the one of overseeing smarter-than-human AI models. We start with a relatively strong \u201cbase\u201d model\u2014that is, a potentially-capable model that hasn\u2019t yet received fine-tuning to provide its best-possible answers. Then, we use a much weaker model as a \u201cteacher\u201d to provide that extra fine-tuning, which it does by demonstrating what it considers ideal outputs to the strong base model. Finally, we evaluate how well the strong model performs after that weak fine-tuning.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In the worst case, the strong model will only be as good as its weak teacher. Ideally, however, the strong model will have learned from the weak teacher\u2019s feedback\u2014it will have interpreted those weak signals in a useful way, using that feedback to improve its performance. We can quantify how well it did so: if the strong model shows no improvement at all (it performs only as well as its weak teacher), we score it 0; if it uses the teacher\u2019s feedback to achieve the ideal outcome\u2014the best performance the strong model could possibly deliver\u2014we score it 1. This measure represents the \u201cperformance gap recovered\u201d (between the weak model and the upper limit of the strong model), or the PGR.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">As a proxy for scalable oversight, the weak model stands in for humans, and the strong model for the much-smarter-than-human models we might one day need to oversee. If we can make progress on weak-to-strong supervision, we might find that our methods help us keep those ultra-smart models aligned to our values.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Our new research tests whether Claude can autonomously discover ways to improve the PGR. We ask: can Claude develop, test, and analyze alignment ideas of its own? And, if it can, what might that imply about how far today\u2019s AI models can accelerate the pace of alignment research?<\/p>\n<p>Our setup<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To find out, we began with nine copies of Claude Opus 4.6, and gave each one a few extra tools. Each Claude had a place to work and think (that is, a sandbox), a shared forum to circulate its findings with the others, a storage system to upload its code, and a remote server where it could receive a PGR score for each of its ideas. We also provided some background knowledge about model training and inference. We referred to these tooled-up Claude models as Automated Alignment Researchers (or AARs).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To prevent each AAR from pursuing near-identical ideas, we prompted each one with a slightly different (but intentionally ambiguous) starting place: we recommended that one used some interpretability tools, that another thought about reweighting the data in the dataset, and so on.1 Beyond that, though, we didn\u2019t tell the AARs what to do. It was up to them to propose their own ideas, run their experiments, analyze their results, and share their findings and code with one another in order to work out what to try next.<\/p>\n<p>Results<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To provide a benchmark for the AARs\u2019 results, we compared their work to a human baseline. Two of our researchers spent seven days iterating on four of the most promising generalization methods from prior research. On the open-weights models we tested (Qwen 3-4B-Base as the strong model, Qwen 1.5-0.5B-Chat as the weak teacher), the humans recovered 23% of the total performance gap (i.e., achieved a PGR of 0.23).2<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Claude improved on this result dramatically. After five further days (and 800 cumulative hours of research), the AARs closed almost the entire remaining performance gap, achieving a final PGR of 0.97. This cost about $18,000 in tokens and model training expenses, or $22 per AAR-hour. You can see how each individual AAR progressed from the human baseline (at 0 hours) in the graph below.<\/p>\n<p><img alt=\"Graph showing the progress of our Automated Alignment Researchers on increasing the &quot;performance gap recovered&quot; on a chat dataset.\" loading=\"lazy\" width=\"4584\" height=\"2580\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2026\/04\/1776208273_938_image.webp\"\/>The performance gap recovered over cumulative research hours for nine parallel Automated Alignment Researchers (red lines), relative to a human-tuned baseline (grey square). A score of 1.0 means the method fully matches a model trained on ground-truth labels.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Claude, then, did exceptionally well. But how inventive were its methods, and could they be useful in real-world applications? To find out, we ran two further tests.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">First, we tested whether the AARs\u2019 ideas could recover the performance gap on held-out datasets\u2014that is, on tasks that the AARs hadn\u2019t already seen. We took the AARs\u2019 two highest-performing methods (on a dataset of chat tasks) and applied them to math and coding tasks. Here, our results were relatively promising: the AARs\u2019 most effective method successfully generalized to both new datasets, with PGRs of 0.94 on math and 0.47 on coding (which was still double the human baseline). The AARs\u2019 second-best method saw mixed results: it worked on math (0.75), but not on code, where it made matters worse. These results suggest that some generalizability of the AARs\u2019 research is possible, but it isn\u2019t a given. We encourage others who try experiments in automated research to stress-test AARs\u2019 ideas against held-out datasets, too.<\/p>\n<p><img alt=\"Graph showing how well AAR-discovered ideas transfer to held-out datasets in math and code.\" loading=\"lazy\" width=\"4584\" height=\"2580\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/ca\/wp-content\/uploads\/2026\/04\/1776208273_172_image.webp\"\/>The performance gap recovered by two AAR-discovered ideas (in red and blue) when applied to held-out math and coding datasets. The dashed line indicates the best human-tuned method that we used as a baseline.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Next, we tested whether the AARs\u2019 ideas would work at production scale. We tried out the AARs\u2019 most effective method on Claude Sonnet 4 with our production training infrastructure. Here, though, we had less success. The AARs\u2019 method didn\u2019t lead to a statistically significant improvement. We think this might reflect limitations of this early trial, rather than something more fundamental: our scoring method was quite simple, and we only evaluated a single idea. Nevertheless, this does illustrate a limitation of AARs (at least at their current capabilities): AARs tend to capitalize on opportunities unique to the models and datasets they\u2019re given, which means their methods might not work elsewhere. To mitigate this, we suggest allowing AARs to test against multiple domains and datasets during their research. This is one area that future experimentation with AARs could explore.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">A few iterations of our experiment taught us more about how to make AARs most effective. For instance, we found that giving each AAR a different starting point helped a lot, even if that starting point was vague. When we tried our experiment without setting the AARs off in different directions, they all quickly settled on similar ideas, making much less progress overall (though they still achieved a PGR of almost triple the human baseline). On the other hand, we found that giving the AARs too much structure hurt their progress badly. When we prescribed a specific workflow (\u201cpropose ideas, then generate a plan, then write the code\u2026\u201d), we found we\u2019d ultimately constrained Claude\u2019s work. Left to its own devices, Claude was much more adaptable, designing cheap experiments to test out its ideas before subsequently committing to much more intensive testing.<\/p>\n<p>Implications<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The success of our AARs in recovering the performance gap between two open-weights models is certainly not a sign that frontier AI models are now general-purpose alignment scientists. We deliberately chose a problem that is unusually well-suited to automation, since it has a single, objective measure of success that the models can optimize against. Most alignment problems aren\u2019t nearly as neat as this one. And, as we mention below, even in this setting our AARs did their best to game the problem: human oversight remains essential.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">But we do think these results have some important implications.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Keeping pace. This study indicates that Claude can meaningfully increase the rate of experimentation and exploration in alignment research. Human researchers can delegate questions to AARs at a very large scale; Claude can take on the task of developing novel hypotheses and iterating on its own results.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Moreover, making progress on weak-to-strong supervision might itself help us build more general-purpose Automated Alignment Researchers, which is why we chose this problem for our study. In this study, we frame the weak-to-strong supervision problem as a \u201ccrisp\u201d task with a verifiable outcome (increasing the PGR score). We do this because we need a way to automatically and reliably evaluate whether the AAR has made progress. However, if AARs discovered much better weak-to-strong supervision methods that generalized across domains, we could use those same methods to train the AARs to evaluate progress on \u201cfuzzier\u201d tasks that are much harder to verify. (For instance, we could conduct weak-to-strong supervision on Claude\u2019s ability to scope research projects.) This is important, because alignment research\u2014unlike capabilities research\u2014often requires solving much \u201cfuzzier\u201d problems.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Taste and diversity. One possible counter to tools like AARs is that today\u2019s frontier models still lack \u201cresearch taste\u201d (industry parlance for having an intuitive sense of which ideas might work and which won\u2019t). But the success of AARs in this experiment suggests that the sheer volume of ideas might compensate for a lack of \u201ctaste\u201d. If AARs can run many experiments very cheaply, it\u2019s possible they could \u201cbrute force\u201d their way into the findings that a very high-taste researcher might\u2019ve come up with, or find success in directions that those researchers might otherwise have given up on.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In turn, this means that the core bottleneck in alignment research could become evaluation (making sure that experiments are set up sufficiently well that we\u2019re confident in their results), rather than generation (relying on human researchers to propose promising ideas).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Alien science. This work might have some stranger implications, too. AARs, by their nature, are designed to discover ideas that humans might not have considered. But we still need a way to verify whether their ideas and results are sound. For now, we\u2019re still able to interpret what the AARs have done and why. But that might not always be the case: over time, the models\u2019 ideas could become much harder to verify, or corrupted in ways that are tricky for humans to parse or catch. That could mean creating an \u201calien science\u201d.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Preventing hacks. Even in this highly circumscribed environment, we observed the models \u201c<a href=\"https:\/\/www.anthropic.com\/research\/emergent-misalignment-reward-hacking\" rel=\"nofollow noopener\" target=\"_blank\">reward<\/a> <a href=\"https:\/\/www.anthropic.com\/research\/reward-tampering\" rel=\"nofollow noopener\" target=\"_blank\">hacking<\/a>\u201d\u2014that is, trying to game our set-up. On math tasks, for instance, one AAR noticed that the most common answer to each problem was usually correct, so it skipped the teacher entirely and instructed the strong model to always choose the most common one. On a coding task, where the model had to predict whether a piece of code was right, the AAR realized it could run the code against some tests and simply read off the right answer. Hacks like these don\u2019t invalidate our results (we detected and disqualified these entries), but they clearly do provide a warning. Any deployment of automated researchers will require evaluations that the AARs can&#8217;t tamper with\u2014and human inspections of both their results and their methods.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To read this research in full, see our <a href=\"https:\/\/alignment.anthropic.com\/2026\/automated-w2s-researcher\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Alignment Science blog<\/a>. The code and datasets for this work are <a href=\"https:\/\/github.com\/safety-research\/automated-w2s-research\" rel=\"nofollow noopener\" target=\"_blank\">publicly available, here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"Large language models\u2019 ever-accelerating rate of improvement raises two particularly important questions for alignment research. One is how&hellip;\n","protected":false},"author":2,"featured_media":604414,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[62,276,277,49,48,61],"class_list":{"0":"post-604413","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ca","12":"tag-canada","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/604413","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=604413"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/604413\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/604414"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=604413"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=604413"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=604413"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}