{"id":390922,"date":"2026-04-14T04:10:09","date_gmt":"2026-04-14T04:10:09","guid":{"rendered":"https:\/\/www.newsbeep.com\/il\/390922\/"},"modified":"2026-04-14T04:10:09","modified_gmt":"2026-04-14T04:10:09","slug":"new-method-boosts-ai-driven-protein-engineering-with-massive-data","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/il\/390922\/","title":{"rendered":"New method boosts AI-driven protein engineering with massive data"},"content":{"rendered":"<p>Protein engineering is a field primed for artificial intelligence research. Each protein is made up of amino acids; to optimize a protein function, researchers modify proteins by switching out one of 20 different amino acids for another. For a protein that is just 50 amino acids in length, this leads to approximately 1.13&#215;1065 potential combinations to test &#8211; that&#8217;s 113 followed by 65 zeros, or five times as many zeros as a trillion has.\u00a0<\/p>\n<p>This number of potential combinations, impossible to test in the lab, makes protein engineering an ideal challenge for AI. Modeling which of these combinations will give the best results is a perfect problem for the technology&#8217;s massive computing power. But AI is only as good as the data used to train it, and in some areas of protein engineering, the right data just didn&#8217;t exist.\u00a0<\/p>\n<p>&#13;<\/p>\n<p>One of the biggest bottlenecks in AI-guided protein engineering is not coming up with machine-learning models. It is generating the right and enough experimental data to train them. For engineering protein activity, which optimizes what a protein does, we had a very clear problem: There simply were not enough datasets to train accurate models.&#8221;\u00a0<\/p>\n<p>&#13;<br \/>\n&#13;<\/p>\n<p style=\"text-align: right;\">Han Xiao, Rice University professor of chemistry, biosciences and bioengineering and director of the\u00a0SynthX Center<\/p>\n<p>&#13;<\/p>\n<p>To be able to generate AI models that could accurately predict how to optimize a protein&#8217;s function, or activity, Xiao&#8217;s team had to first generate enough activity data about any given protein to train an AI model. In a recent Nature Biotechnology publication, Xiao&#8217;s team and collaborators from Johns Hopkins University and Microsoft did just that, sharing an approach that provided the needed data and created accurate models in just three days.\u00a0<\/p>\n<p>This approach, called Sequence Display, can generate more than 10 million data points in a single experiment. These data points are then fed into protein language AI models, which use them to predict which changes to a protein&#8217;s amino acids will create the desired change for the protein&#8217;s activity or function.\u00a0<\/p>\n<p>&#8220;We were able to develop an activity-based barcoding system that records the activity of individual protein variants and generates the kind of dataset needed to train a machine learning model,&#8221; said Linqi Cheng, a Rice graduate student and first author on the study. &#8220;Then the model was able to predict mutations that significantly improved the activity of the protein we were studying.&#8221;\u00a0<\/p>\n<p>The team chose a small CRISPR-Cas protein for proof of concept. This protein was valued for its size but limited in its activity to target stretches of DNA to cut. The researchers wanted to identify a version that could cut a wider variety of DNA targets.\u00a0<\/p>\n<p>First, they mutated the DNA that codes for the Cas9 protein, creating many variations. A blank DNA barcode was attached to each variant, along with a special editor that would change the barcode in response to the protein&#8217;s activity level. As the protein&#8217;s activity levels increased, so did the editor&#8217;s. This meant that the most active protein variations had the biggest changes in their barcodes. The DNA barcodes were then read by next-generation sequencing, which would essentially scan the barcode and classify each sequence by level of activity.\u00a0<\/p>\n<p>&#8220;The AI is not replacing the experiment here. It instead depends on the experiment,&#8221; Cheng said. &#8220;Sequence Display gives us the data foundation, and the models help us search a much larger data space for strong candidates.&#8221;\u00a0<\/p>\n<p>The team successfully repeated this process with other proteins, including aminoacyl-tRNA synthetases, cytosine deaminase and uracil glycosylase inhibitor. In each case, the barcoding experiment generated enough data points to train AI models.<\/p>\n<p>&#8220;What this approach provides is a practical framework for integrating AI with protein engineering,&#8221; said Xiao, who is also a Cancer Prevention and Research Institute Scholar. &#8220;Rather than relying on machine learning as a stand-alone solution, we couple it with an experimental platform that generates high-quality training data. This synergy enables more efficient discovery of advanced research tools and next-generation therapeutic proteins.&#8221;<\/p>\n<p>This work was supported by a SynthX Seed Award (SYN-IN-2024-002), the National Institutes of Health (R35-GM133706, R01-CA277838, R01-AI165079 to H.X.), the Robert A. Welch Foundation (C-1970 to H.X.), the U.S. Department of Defense (W81XWH-21-1-0789, HT9425-23-1-0494, HT9425-25-1-0021 to H.X.), a 2024 Rice Synthetic Biology Institute Seed Grant (H.X.) and a Medical Research Award from the Robert J. Kleberg, Jr. and Helen C. Kleberg Foundation.<\/p>\n<p>Source:<\/p>\n<p>Journal reference:<\/p>\n<p>Cheng, L., et al. (2026). Sequence Display enables large-scale sequence\u2013activity datasets for rapid protein evolution.\u00a0Nature Biotechnology. DOI:\u00a010.1038\/s41587-026-03087-3.\u00a0<a href=\"https:\/\/www.nature.com\/articles\/s41587-026-03087-3\" rel=\"noopener nofollow\" target=\"_blank\">https:\/\/www.nature.com\/articles\/s41587-026-03087-3<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Protein engineering is a field primed for artificial intelligence research. Each protein is made up of amino acids;&hellip;\n","protected":false},"author":2,"featured_media":19195,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[343,304,579,85,46,1359,5198,151617,1360,141,125],"class_list":{"0":"post-390922","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-science","8":"tag-artificial-intelligence","9":"tag-biotechnology","10":"tag-dna","11":"tag-il","12":"tag-israel","13":"tag-machine-learning","14":"tag-protein","15":"tag-protein-engineering","16":"tag-research","17":"tag-science","18":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/390922","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/comments?post=390922"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/390922\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media\/19195"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media?parent=390922"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/categories?post=390922"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/tags?post=390922"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}