{"id":316359,"date":"2025-12-15T00:03:32","date_gmt":"2025-12-15T00:03:32","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/316359\/"},"modified":"2025-12-15T00:03:32","modified_gmt":"2025-12-15T00:03:32","slug":"facts-benchmark-suite-a-new-way-to-systematically-evaluate-llms-factuality","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/316359\/","title":{"rendered":"FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality"},"content":{"rendered":"<p data-block-key=\"usuz2\" class=\"lead-paragraph\">Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so it\u2019s important that their responses are factually accurate.<\/p>\n<p data-block-key=\"fegj5\">In order to continue improving their performance on this industry-wide challenge, we have to better understand the types of use cases where models struggle to provide an accurate response and better measure factuality performance in those areas.<\/p>\n<p>The FACTS Benchmark Suite<\/p>\n<p data-block-key=\"56mrf\">Today, we\u2019re teaming up with Kaggle to introduce the <a href=\"https:\/\/www.kaggle.com\/benchmarks\/google\/facts\/leaderboard\" rel=\"noopener nofollow\" target=\"_blank\">FACTS Benchmark Suite<\/a>. It extends our previous work developing the <a href=\"https:\/\/deepmind.google\/blog\/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models\/\" rel=\"noopener nofollow\" target=\"_blank\">FACTS Grounding Benchmark<\/a>, with three additional factuality benchmarks, including:<\/p>\n<p>A <a href=\"https:\/\/www.kaggle.com\/benchmarks\/google\/facts-parametric\/leaderboard\" rel=\"noopener nofollow\" target=\"_blank\">Parametric Benchmark<\/a> that measures the model\u2019s ability to access its internal knowledge accurately in factoid question use-cases.A <a href=\"https:\/\/www.kaggle.com\/benchmarks\/google\/facts-search\/leaderboard\" rel=\"noopener nofollow\" target=\"_blank\">Search Benchmark<\/a> that tests a model\u2019s ability to use Search as a tool to retrieve information and synthesize it correctly.A <a href=\"https:\/\/www.kaggle.com\/benchmarks\/google\/facts-multimodal\/leaderboard\" rel=\"noopener nofollow\" target=\"_blank\">Multimodal Benchmark<\/a> that tests a model\u2019s ability to answer prompts related to input images in a factually correct manner.<\/p>\n<p data-block-key=\"enlkl\">We are also updating the original FACTS grounding benchmark with <a href=\"https:\/\/www.kaggle.com\/benchmarks\/google\/facts-grounding\/leaderboard\" rel=\"noopener nofollow\" target=\"_blank\">Grounding Benchmark &#8211; v2<\/a>, an extended benchmark to test a model\u2019s ability to provide answers grounded in the context of a given prompt.<\/p>\n<p data-block-key=\"1n5rs\">Each benchmark was carefully curated to produce a total of 3,513 examples, which we are making publicly available today. Similar to our previous release, we are following standard industry practice and keeping an evaluation set held-out as a private set. The FACTS Benchmark Suite Score (or FACTS Score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes owning the private held-out sets, testing the leading LLMs on the benchmarks, and hosting the results on a public leaderboard. More details about the FACTS evaluation methodology can be found in our <a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/FACTS\/FACTS_benchmark_suite_paper.pdf\" rel=\"noopener nofollow\" target=\"_blank\">tech report<\/a>.<\/p>\n<p>Benchmark overviewParametric Benchmark<\/p>\n<p data-block-key=\"csn4k\">The FACTS Parametric benchmark assesses the ability of models to accurately answer factual questions, without the aid of external tools like web search. All the questions in the benchmark are \u201ctrivia style\u201d questions driven by user interest that can be answered via Wikipedia (a standard source for LLM pretraining). The resulting benchmark consists of a 1052-item public set and a 1052-item private set.<\/p>\n","protected":false},"excerpt":{"rendered":"Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so&hellip;\n","protected":false},"author":2,"featured_media":316360,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[554,733,4308,86,56,54,55],"class_list":{"0":"post-316359","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology","12":"tag-uk","13":"tag-united-kingdom","14":"tag-unitedkingdom"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/316359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=316359"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/316359\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/316360"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=316359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=316359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=316359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}