{"id":361870,"date":"2025-12-21T05:06:21","date_gmt":"2025-12-21T05:06:21","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/361870\/"},"modified":"2025-12-21T05:06:21","modified_gmt":"2025-12-21T05:06:21","slug":"an-open-source-tool-for-automated-behavioral-evaluations-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/361870\/","title":{"rendered":"an open source tool for automated behavioral evaluations \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We&#8217;re releasing Bloom, an open source agentic framework for generating behavioral evaluations of frontier AI models. Bloom takes a researcher-specified behavior and quantifies its frequency and severity across automatically generated scenarios. Bloom&#8217;s evaluations correlate strongly with our hand-labeled judgments and we find they reliably separate baseline models from intentionally misaligned ones. As examples of this, we release benchmark results for four alignment relevant behaviors on 16 models. Bloom is available <a href=\"https:\/\/github.com\/safety-research\/bloom\/\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">High-quality behavioral evaluations are essential for understanding alignment in frontier AI models. But evaluations generally take a long time to develop, and then run the risk of becoming obsolete: the evaluations can \u201ccontaminate\u201d training sets for new models, or capabilities can improve to such an extent that the evaluation no longer really tests what we\u2019re interested in. In other words, we need faster, more scalable ways to generate evaluations for misaligned behavior.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To this end, we recently released <a href=\"https:\/\/www.anthropic.com\/research\/petri-open-source-auditing\" rel=\"nofollow noopener\" target=\"_blank\">Petri<\/a>, an open-source tool that allows researchers to automatically explore AI models\u2019 behavioral profiles through diverse multi-turn conversations with simulated users and tools. Petri provides quantitative and qualitative summaries of the model\u2019s behaviors and surfaces new instances of misalignment.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Bloom is a complementary evaluation tool. Bloom generates targeted evaluation suites for arbitrary behavioral traits. Unlike Petri\u2014which takes user-specified scenarios and scores many behavioral dimensions to flag concerning instances\u2014Bloom takes a single behavior and automatically generates many scenarios to quantify how often it occurs. We built Bloom to allow researchers to quickly measure the model properties they\u2019re interested in, without needing to spend time on evaluation pipeline engineering. Alongside Bloom, we\u2019re releasing benchmark results for four behaviors\u2014delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias\u2014across 16 frontier models. Using Bloom, these evaluations took only a few days to conceptualize, refine, and generate. We include example pipeline outputs for each of these behaviors below.<\/p>\n<p><img loading=\"lazy\" width=\"2293\" height=\"2292\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766293580_100_image\"\/>Comparative results from four evaluation suites\u2014delusional sycophancy, instructed long-horizon sabotage, self-preservation and self-preferential bias\u2014across 16 frontier models. Elicitation rate measures the proportion of rollouts scoring \u2265 7\/10 for behavior presence. Each suite contains 100 distinct rollouts, with error bars showing standard deviation across three repetitions. We use Claude Opus 4.1 as the evaluator across all stages.How Bloom works<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Bloom operates through four automated stages that transform a behavior description and seed configuration into a complete evaluation suite with top-level metrics like elicitation rate and average presence of the behavior. Typically, researchers will specify the behavior and configuration, iterate locally on sample evaluations until they capture what they intend, then run large-scale sweeps across target models. Bloom integrates with Weights &amp; Biases for experiments at scale and exports <a href=\"https:\/\/inspect.aisi.org.uk\" rel=\"nofollow noopener\" target=\"_blank\">Inspect<\/a>-compatible transcripts. It also offers a custom transcript viewer. The repository includes a sample seed file to get started.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Bloom generates evaluations in four stages:<\/p>\n<p>Understanding: The first Bloom \u201cagent\u201d analyzes the researcher\u2019s behavior description and example transcripts to generate detailed context about what to measure and why.Ideation: The ideation agent generates evaluation scenarios designed to elicit the target behavior. Each scenario specifies the situation, simulated user, system prompt, and interaction environment.Rollout: These scenarios are rolled out in parallel, with an agent dynamically simulating both the user\u2019s and the tool responses to elicit the sought-after behavior in the target model.Judgment: A judge model scores each transcript for the presence of the behavior, along with other user-defined qualities, and a meta-judge produces suite-level analysis.<img loading=\"lazy\" width=\"2293\" height=\"1290\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766293580_809_image\"\/>Bloom&#8217;s four-stage pipeline with configurable parameters at each stage. Users provide a behavior description and seed configuration; Bloom generates rollout-level and suite-level metrics along with a descriptive report.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Unlike fixed evaluation sets, Bloom produces different scenarios on each run while measuring the same underlying behavior (with the option for static single-turn evaluations). This approach enables flexible evaluation that isn\u2019t tied to a limited number of scenarios or a specific prompt format, while maintaining reproducibility through the evaluation seed. The seed is a configuration file specifying the behavior description, example transcripts and other parameters that shape the evaluation\u2014Bloom metrics should always be cited with this seed.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Researchers can extensively configure Bloom&#8217;s behavior, through choosing models for each stage, adjusting the interactions\u2019 length and modality (i.e., whether to expose tools to the target model, whether to simulate a user), controlling how diverse the evaluation scenarios are, and specifying secondary scoring dimensions, like realism or elicitation difficulty.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Example outputs from all four stages of the Bloom evaluation pipeline can be viewed <a href=\"https:\/\/claude.ai\/redirect\/website.v1.bb5686f3-8e19-4539-af06-bf5c8baa4512\/public\/artifacts\/cbfddf51-ab0d-45a9-913b-163ae2dd4126\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">here.<\/a><\/p>\n<p>Validation and trust<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To validate Bloom\u2019s performance, we test it against two questions.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Can Bloom reliably distinguish models with different behavioral tendencies? To validate this, we use Bloom to evaluate production Claude models against system-prompted \u201cmodel organisms\u201d that have been intentionally designed to exhibit specific quirky behaviors (Bricken et al, 2025). Across ten quirks, Bloom successfully separated the model organism from the production model in nine cases\u2014and in the tenth case (self-promotion), our subsequent manual review revealed that the baseline model actually exhibited similar rates of the behavior.<\/p>\n<p><img loading=\"lazy\" width=\"2293\" height=\"1290\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766293581_225_image\"\/>Bloom effectively discriminates between system-prompted model organisms and baseline models across 9\/10 quirky behaviors without requiring example transcripts. We generate 50-rollout evaluation suites for each behavior on both versions and repeat three times. (Error bars show standard deviation of elicitation rate.)<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">How well-calibrated is the Bloom judge against human judgment? We hand-labeled 40 transcripts across different behaviors, and compared human scores with Bloom\u2019s, using 11 different judge models. Claude Opus 4.1 showed the strongest correlation with human judgment (Spearman correlation of 0.86), followed by Claude Sonnet 4.5 (0.75). Importantly, Opus 4.1 exhibits particularly strong agreement with humans at the extremes of the score spectrum\u2014which matters most, since we often use score thresholds to determine whether a behavior is present or absent. (This work was done prior to the release of Claude Opus 4.5.)<\/p>\n<p><img loading=\"lazy\" width=\"2293\" height=\"1290\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/12\/1766293581_255_image\"\/>Claude Opus 4.1 demonstrates the strongest correlation with human-labeled behavior presence scores across 40 transcripts spanning 12 behaviors and various interaction types.Case study: Self-preferential bias<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To demonstrate Bloom&#8217;s practical utility, we replicated an evaluation from the Claude Sonnet 4.5 system card that measures \u201cself-preferential bias\u201d\u2014models&#8217; tendency to favor themselves in decision-making tasks. Using example transcripts that mirror the system card&#8217;s approach, Bloom reproduced the same ranking of models as the method used in the system card\u2019s evaluation (in this case confirming that Sonnet 4.5 exhibits the least bias of the models tested). Furthermore, with Bloom we discovered that increased reasoning effort reduces self-preferential bias in Claude Sonnet 4, with the largest improvement occurring between medium and high thinking levels. (Notably, lower bias in these cases didn&#8217;t come from Sonnet 4 selecting other models more evenly\u2014instead, it increasingly recognized the conflict of interest and declined to judge its own option.)<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Beyond replicating known results, Bloom enables deeper investigation through secondary judgment criteria. We found that filtering out rollouts with undesirable traits\u2014like unrealism or evaluation awareness\u2014improves both the rate of eliciting the target behavior and the quality of the evaluation. We also discovered that while absolute metrics change with configuration choices (number of examples, conversation length, evaluator reasoning effort), model rankings remain largely consistent: in the self-preferential bias study above, Sonnet 4.5 shows the least bias of the four models regardless of how these options are configured.<\/p>\n<p>Get started<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We built Bloom to be accessible and highly configurable, serving as a reliable evaluation generation framework for diverse research applications. Early adopters are already using Bloom to evaluate nested jailbreak vulnerabilities, test hardcoding, measure evaluation awareness, and generate sabotage traces.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">As AI systems grow more capable and are deployed in increasingly complex environments, the alignment research community needs scalable tools for exploring their behavioral traits. This is what Bloom is designed to facilitate.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">For complete technical details, experimental configurations, additional case studies, and limitations, read our full technical report on the <a href=\"https:\/\/alignment.anthropic.com\/2025\/bloom-auto-evals\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Alignment Science blog<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Access Bloom at <a href=\"https:\/\/github.com\/safety-research\/bloom\" rel=\"nofollow noopener\" target=\"_blank\">github.com\/safety-research\/bloom<\/a>.<\/p>\n<p>Acknowledgments<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We would like to thank Keshav Shenoy, Christine Ye, Simon Storf, Julius Steen, Jifan Zhang and Javier Rando for early feedback on Bloom. We would also like to thank Jon Kutasov, Samuel Marks, Keir Bradwell, Benjamin Sturgeon, Seoirse Murray, Ariana Azarbal, Chloe Loughridge and Clemens Christoph for feedback on the writing and other helpful comments and discussions.<\/p>\n<p>Citation@misc{bloom2025,<br \/>\ntitle={Bloom: an open source tool for automated behavioral evaluations},<br \/>\nauthor={Gupta, Isha and Fronsdal, Kai and Sheshadri, Abhay and Michala, Jonathan and Tay, Jacqueline and Wang, Rowan and Bowman, Samuel R. and Price, Sara},<br \/>\nyear={2025},<br \/>\nurl={https:\/\/github.com\/safety-research\/bloom},<br \/>\n}<\/p>\n","protected":false},"excerpt":{"rendered":"We&#8217;re releasing Bloom, an open source agentic framework for generating behavioral evaluations of frontier AI models. Bloom takes&hellip;\n","protected":false},"author":2,"featured_media":361871,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-361870","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/361870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=361870"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/361870\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/361871"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=361870"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=361870"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=361870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}