{"id":39335,"date":"2025-08-02T06:13:25","date_gmt":"2025-08-02T06:13:25","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/39335\/"},"modified":"2025-08-02T06:13:25","modified_gmt":"2025-08-02T06:13:25","slug":"geneagent-self-verification-language-agent-for-gene-set-analysis-using-domain-databases","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/39335\/","title":{"rendered":"GeneAgent: self-verification language agent for gene-set analysis using domain databases"},"content":{"rendered":"<p>Model setting and data collection<\/p>\n<p>In this work, we utilized GPT-4 (version 20230613) as the backend model via the Azure OpenAI API, which is trained on data compiled before September 2021. The Azure OpenAI API is compliant with the Health Insurance Portability and Accountability Act, ensuring robust data privacy protection. To ensure stable and reproducible output, we set the temperature parameter to the absolute 0. The evaluated gene sets (Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#Tab1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>) were derived from recent releases by Hu et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 16\" title=\"Hu, M. et al. Evaluation of large language models for discovery of gene set function. Nat. Methods 22, 82&#x2013;91 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR16\" id=\"ref-link-section-d108704154e2884\" rel=\"nofollow noopener\" target=\"_blank\">16<\/a> and Joachimiak et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 22\" title=\"Joachimiak, M. P. et al. Gene set summarization using large language models. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2305.13338&#010;                &#010;               (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR22\" id=\"ref-link-section-d108704154e2888\" rel=\"nofollow noopener\" target=\"_blank\">22<\/a> after 2023. The gene set corresponding to each ground truth in different datasets was determined by aggregating the genes with which it was directly annotated with those of all its ontological descendants.<\/p>\n<p>Furthermore, we collected 191 gene sets from PubMed articles that were published from November 2021 to December 2023 to evaluate the performance of different LLMs in various data sources (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). These gene sets range in size from 3 to 408, with an average of 35.67. The ground truths of these gene sets are released along with articles. The average word count in ground truth is 3.267. For model assessments, we developed the evaluation pipeline using Python (version 3.11.0) alongside PyTorch (version 1.13.0). Other necessary Python packages are NumPy (version 1.26.3), Pandas (version 2.1.4) and Seaborn (version v0.13.2).<\/p>\n<p>Overview of GeneAgent<\/p>\n<p>GeneAgent is a language agent built upon GPT-4 to automatically interact with domain-specific databases to annotate functions for gene sets, which is composed of four key steps: generation, self-verification, modification and summarization. Each module is triggered by a specific instruction tailored to its function (Supplementary Document <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). The goal of GeneAgent is to generate a representative biological process name (P) for a set of genes, denoted as \\(D=\\{{{g}_{i}|}_{i=1}^{N}\\}\\). Each gene gi in this set is identified by its unique name, and the D is associated with a specific curated biological term, that is, ground truth (G). When provided with a D, GeneAgent outputs a \\(P\\), accompanied by analytical texts (A) detailing the functions of the genes involved, which can be formally defined as GeneAgent (D)\u2009=\u2009(P, A).<\/p>\n<p>Pipeline of generating prominent biological process names<\/p>\n<p>The gene set in the dataset D is separated by a comma (\u2018,\u2019) and serves as input parameters for the instruction of the generation (g) step. Following the generation stage, D is assigned with an initial process name \\(({P}_{\\rm{ini}})\\) and corresponding analytical narratives \\({(A}_{{ini}})\\), that is, \\({\\rm{GeneAgent}}_{g}(D)=({P}_{\\rm{ini}}\\,,\\,{A}_{\\rm{ini}})\\).<\/p>\n<p>Afterwards, GeneAgent generates a list of claims for \\({P}_{{ini}}\\) by using statements like \u2018be involved in\u2019 and \u2018related to\u2019 to generate a hypothesis for the gene set and its process name. After that, GeneAgent activates selfVeri-Agent (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1c<\/a>) to verify each claim in the list. Initially, selfVeri-Agent extracts all gene symbols and the process name from the claims. Subsequently, it utilizes the gene symbols to invoke the appropriate APIs for the autonomous interaction with domain-specific databases, using their established knowledge to validate the accuracy of the process name. Finally, it assembles a verification report (\\({{\\mathcal{R}}}_{P}\\)) containing findings and decisions (that is, \u2018supported\u2019, \u2018partially supported\u2019 or \u2018refuted\u2019) to the input claim.<\/p>\n<p>Next, GeneAgent initiates the modification (m) step to either revise or retain the \\({P}_{{ini}}\\) based on the findings in the \\({{\\mathcal{R}}}_{P}\\). If the \\({P}_{\\rm{ini}}\\) is determined to revise by GeneAgent, the \\({A}_{\\rm{ini}}\\) is also instructed to be modified accordingly, that is, \\({\\rm{GeneAgent}}_{m}\\left({P}_{\\rm{ini}},{{A}}_{\\rm{ini}},{{\\mathcal{R}}}_{P}\\right)=({P}_{\\mathrm{mod}},\\,{A}_{\\mathrm{mod}})\\). Following this, GeneAgent applies the self-verification to the \\({A}_{\\mathrm{mod}}\\) to verify the gene functions in the explanatory analyses while checking the updated process name again. This step is also started with generating a list of claims for different gene names and their function names and is finished with deriving a new verification report (\\({{\\mathcal{R}}}_{A}\\)) containing a decision of \u2018supported\u2019, \u2018partially supported\u2019 or \u2018refuted\u2019 made by the selfVeri-Agent.<\/p>\n<p>Finally, based on the report \\({{\\mathcal{R}}}_{A}\\), both \\({P}_{\\mathrm{mod}}\\) and \\({A}_{\\mathrm{mod}}\\) are modified according to the summarization (s) instruction to generate the final biological process name (P) and the analytical narratives (A) of gene functions, that is, \\({\\rm{GeneAgent}}_{s}\\,({P}_{\\mathrm{mod}},{A}_{\\mathrm{mod}},\\,{{\\mathcal{R}}}_{A})=(P,A)\\).<\/p>\n<p>Domain-specific databases configured in the selfVeri-Agent<\/p>\n<p>In the self-verification stage, we configured four Web APIs to access 18 domain databases (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3c<\/a> and Supplementary Document <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">4<\/a>).<\/p>\n<p>(1) g:Profiler<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 27\" title=\"Kolberg, L. et al. g:Profiler&#x2014;interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 51, W207&#x2013;W212 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR27\" id=\"ref-link-section-d108704154e3800\" rel=\"nofollow noopener\" target=\"_blank\">27<\/a> (<a href=\"https:\/\/biit.cs.ut.ee\/gprofiler\/page\/apis\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/biit.cs.ut.ee\/gprofiler\/page\/apis\/<\/a>) is an open-source tool for GSEA. In GeneAgent, we used eight domain-specific databases: GO, KEGG<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 41\" title=\"Kanehisa, M. et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587&#x2013;D592 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR41\" id=\"ref-link-section-d108704154e3811\" rel=\"nofollow noopener\" target=\"_blank\">41<\/a>, Reactome<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 42\" title=\"Gillespie, M. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 50, D687&#x2013;D692 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR42\" id=\"ref-link-section-d108704154e3815\" rel=\"nofollow noopener\" target=\"_blank\">42<\/a>, WikiPathways<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Martens, M. et al. WikiPathways: connecting communities. Nucleic Acids Res. 49, D613&#x2013;D621 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR43\" id=\"ref-link-section-d108704154e3819\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>, Transfac<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108&#x2013;D110 (2006).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR44\" id=\"ref-link-section-d108704154e3824\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a>, miRTarBase<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Huang, H. Y. et al. miRTarBase update 2022: an informative resource for experimentally validated miRNA&#x2013;target interactions. Nucleic Acids Res. 50, D222&#x2013;D230 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR45\" id=\"ref-link-section-d108704154e3828\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>, CORUM<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"Tsitsiridis, G. et al. CORUM: the comprehensive resource of mammalian protein complexes&#x2013;2022. Nucleic Acids Res. 51, D539&#x2013;D545 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR46\" id=\"ref-link-section-d108704154e3832\" rel=\"nofollow noopener\" target=\"_blank\">46<\/a> and Human Phenotype Ontology<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"K&#xF6;hler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207&#x2013;D1217 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR47\" id=\"ref-link-section-d108704154e3836\" rel=\"nofollow noopener\" target=\"_blank\">47<\/a> to perform enrichment analysis for the gene set. For each gene set, we used the g:GOSt interface to identify the top five enrichment terms along with their descriptions.<\/p>\n<p>(2) Enrichr<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 28\" title=\"Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR28\" id=\"ref-link-section-d108704154e3843\" rel=\"nofollow noopener\" target=\"_blank\">28<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 29\" title=\"Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90&#x2013;W97 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR29\" id=\"ref-link-section-d108704154e3846\" rel=\"nofollow noopener\" target=\"_blank\">29<\/a> (<a href=\"https:\/\/maayanlab.cloud\/Enrichr\/help#api\/\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/maayanlab.cloud\/Enrichr\/help#api\/<\/a>) is also a valuable tool for GSEA. We configured four databases related to the pathway analysis in the Enrichr API, that is, KEGG_2021_Human, Reactome_2022, BioPlanet_2019 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Huang, R. et al. The NCATS BioPlanet &#x2013; an integrated platform for exploring the universe of cellular signaling pathways for toxicology, systems biology, and chemical genomics. Front. Pharmacol. 10, 445 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR48\" id=\"ref-link-section-d108704154e3857\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a>) and MSigDB_Hallmark_2020. In GeneAgent, we selected to return the top five standard pathway names via databases.<\/p>\n<p>(3) E-utils<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 30\" title=\"Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49, D10&#x2013;D17 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR30\" id=\"ref-link-section-d108704154e3864\" rel=\"nofollow noopener\" target=\"_blank\">30<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 31\" title=\"Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 52, D33&#x2013;D43 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR31\" id=\"ref-link-section-d108704154e3867\" rel=\"nofollow noopener\" target=\"_blank\">31<\/a> (<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.ncbi.nlm.nih.gov\/<\/a>) is an API designed for accessing the NCBI databases for various biological data. In GeneAgent, we augment our repository of functional information associated with an individual gene by invoking its Gene database and PubMed database. Different databases can be used by defining the database parameter as Gene or PubMed in the foundation API.<\/p>\n<p>(4) AgentAPI is our custom API library, developed using four gene-centric databases related to gene\u2013disease, gene\u2013domain, PPI and gene\u2013complex. GeneAgent calls the appropriate database by specifying the desired interface at the end of the basic API, and subsequently retrieving the top ten relevant IDs to gene functions. These IDs are then used to match their names in the corresponding database.<\/p>\n<p>Notably, we implemented a masking strategy for APIs and databases during the self-verification stage to ensure unbiased assessments across various gene sets. Specifically, we removed the g:Profiler API when assessing gene sets collected from the GO dataset because it can perfectly derive their ground truths by accessing the GO database. Similarly, we masked the \u2018MSigDB_Hallmark_2020\u2019 database within the Enrichr API when evaluating gene sets collected from the MSigDB database.<\/p>\n<p>Calculation of ROUGE score<\/p>\n<p>Three distinct ROUGE metrics<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"Lin, C. Y. ROUGE: a package for automatic evaluation of summaries. in Text Summarization Branches Out: Proceedings of the ACL Workshop, 74&#x2013;81 (2004).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR25\" id=\"ref-link-section-d108704154e3894\" rel=\"nofollow noopener\" target=\"_blank\">25<\/a> are used to access the recall of generated names relative to ground truths: that is, ROUGE-1 and ROUGE-2, which are based on n-gram, and ROUGE-L, which utilizes the longest common subsequence (LCS). The calculation formulas are as follows:<\/p>\n<p>$${\\rm{ROUGE}}-{\\rm{N}}=\\frac{{\\sum }_{S\\in {\\rm{ref}}}{\\sum }_{{g}_{N}\\in S}{\\rm{count}}_{\\rm{match}}\\left({g}_{N}\\right)}{{\\sum }_{S\\in {\\rm{ref}}}{\\sum }_{{g}_{N}\\in S}{\\rm{count}}\\left({g}_{N}\\right)},\\text{N}=1,\\,2$$<\/p>\n<p>$$\\left\\{\\begin{array}{c}{R}_{\\rm{lcs}}=\\frac{\\rm{LCS}\\left({\\rm{ref}},{\\rm{hyp}}\\right)}{m}\\\\ {P}_{\\rm{lcs}}=\\frac{{\\rm{LCS}}\\left({\\rm{ref}},{\\rm{hyp}}\\right)}{n}\\\\ {\\text{ROUGE}}-{\\text{L}}\\,=\\frac{\\left(1+{\\beta }^{2}\\right){R}_{\\rm{lcs}}{P}_{\\rm{lcs}}}{{R}_{\\rm{lcs}}+{\\beta }^{2}{P}_{\\rm{lcs}}}\\end{array}\\right.$$<\/p>\n<p>Here, the \u2018ref\u2019 denotes the reference terms and \u2018hyp\u2019 denotes the generated names. m and n are the token lengths of \u2018ref\u2019 and \u2018hyp\u2019, respectively. \u03b2 is a hyper-parameter.<\/p>\n<p>Calculation of semantic similarity<\/p>\n<p>After generating the biological process name (P) for the gene set D, the semantic similarity between \\(P\\) and its ground truth (G) is computed by MedCPT<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 39, btad651 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR26\" id=\"ref-link-section-d108704154e4398\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a>, a state-of-the-art model for language representation in the biomedical domain. It is built based on PubMedBERT<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 3, 1&#x2013;23 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR49\" id=\"ref-link-section-d108704154e4403\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a> with further training using 255 million query\u2013article pairs from PubMed search logs. Compared with SapBERT<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Liu, F. et al. Self-alignment pretraining for biomedical entity representations. in Proc. Conf. of the NAACL: Human Language Technologies, 4228&#x2013;4238 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR50\" id=\"ref-link-section-d108704154e4407\" rel=\"nofollow noopener\" target=\"_blank\">50<\/a> and BioBERT<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234&#x2013;1240 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR51\" id=\"ref-link-section-d108704154e4411\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a>, MedCPT has higher performance in encoding the semantics of biomedical texts.<\/p>\n<p>                  a.<\/p>\n<p>Calculation of semantic similarity between P and G<\/p>\n<p>First, P and G are encoded by MedCPT into embeddings, and then the cosine similarity between their embeddings is calculated, yielding a score in the interval [\u22121, 1]. Finally, we take the average value of all similarity scores to evaluate the performance of GeneAgent on gene sets in each dataset.<\/p>\n<p>                  b.<\/p>\n<p>Calculation of background semantic similarity distribution<\/p>\n<p>First, P is paired with all possible terms \\({G}_{i}{\\mathcal{\\in}}{\\mathcal{Q}}\\), where Q denotes 12,320 candidate terms consisting of 12,214 GO biological process terms, and all available terms in NeST (50) and MSigDB (56). Then, P and Gi are fed into MedCPT to get the embeddings, that is, \\(\\mathop{P}\\limits^{ \\rightharpoonup }\\) and \\(\\mathop{{G}_{i}}\\limits^{ \\rightharpoonup }\\). Afterwards, we calculated the cosine similarity for each \\(&lt; \\mathop{P}\\limits^{ \\rightharpoonup },\\mathop{{G}_{i}}\\limits^{ \\rightharpoonup } &gt;\\) pair. Finally, we ranked all cosine scores from large to small and observed the position where the pair &lt;P,Gp&gt; (Gp is the ground truth for P) located in. The higher position denotes the generated names have a higher similarity score to their ground truths than other candidate terms.<\/p>\n<p>Calculation of hierarchical semantic similarity<\/p>\n<p>We first collected the hierarchical structures of all GO terms from the GoBasic.obo file in the GO database (2023-11-15 version), yielding 1,951,615 GO term pairs across five relationships: \u2018is a\u2019, \u2018part of\u2019, \u2018regulates\u2019, \u2018negatively regulates\u2019 and \u2018positively regulates\u2019. Next, we extracted the ancestral terms for all ground truths in the evaluation datasets used in this study, limiting the distance from each ground truth to its respective ancestor to within three hops. Finally, we calculated the semantic similarity between each ancestral term and its corresponding ground truth to assess whether the generated names achieved a higher similarity score with ancestral terms. For our evaluation, we computed hierarchical semantic similarity exclusively for gene sets in the GO dataset.<\/p>\n<p>Pipeline of enrichment term test using verification reports<\/p>\n<p>For gene sets in MSigDB, we first collected its verification report produced by the selfVeri-Agent of GeneAgent. Afterwards, each gene set and the associated report were used as the parameters of the instruction (Supplementary Document <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>) for the GPT-4. Therefore, GPT-4 can summarize multiple enrichment terms for the given gene set. Finally, we used the exact match to evaluate the accuracy of the tested terms summarized by the GPT-4. Specifically, for each gene set in the evaluation, we first utilized g:Profiler to perform GSEA, where the P-value threshold is set to 0.05. Then, we obtained significant enrichment terms for the given gene sets as the ground truth. Finally, we counted the number of tested terms summarized by GPT-4 that correctly match the significant enrichment term of each gene set. One tested term is considered as accurate only when there is an exact match between all the words in the tested term and one term in the ground truth.<\/p>\n<p>Human checking for the decisions of selfVeri-Agent<\/p>\n<p>We randomly selected ten gene sets from NeST with 132 claims for human inspection. There are two parts in the verification report: the claims and the decisions to the claims along with evidence (Supplementary Document <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>). Annotators were asked to label the selfVeri-Agent decisions (that is, support, partially support and refute) for each claim and judge whether such decisions are correct, partially correct or incorrect, which follows the study of natural language inference<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Romanov, A., &amp; Shivade, C. P. Lessons from natural language inference in the clinical domain. in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 1586&#x2013;1596 (Association for Computational Linguistics, 2018).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR52\" id=\"ref-link-section-d108704154e4677\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a> and fact verification<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Wadden, D. et al. Fact or fiction: verifying scientific claims. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing 7534&#x2013;7550 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR53\" id=\"ref-link-section-d108704154e4681\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>. For each claim, the annotators need to make a judgment based on assertions of the gene (set) functions provided in the evidence:<\/p>\n<p>                  a.<\/p>\n<p>Correct: This category applies when GeneAgent\u2019s decision completely aligns with the evidence supporting the input claim. The decision is considered correct if it accurately reflects the evidence documented, demonstrating a clear and direct connection between the claim and the supporting data.<\/p>\n<p>                  b.<\/p>\n<p>Partially correct: It is designated when GeneAgent\u2019s decision requires indirect reasoning or when the decision, although related, does not completely align with the direct evidence provided. This occurs when the decision is somewhat supported by the evidence but requires additional inference or context to be fully understood as supporting the input claim.<\/p>\n<p>                  c.<\/p>\n<p>Incorrect: This category is used when GeneAgent\u2019s decision either contradicts the evidence or lacks any substantiation from the verification report.<\/p>\n<p>              Melanoma gene sets in the preclinical study<\/p>\n<p>The mouse B2905 melanoma cell line, which is derived from a tumor from the M4 model, where melanoma is induced by ultraviolet irradiation on pups of hepatocyte growth factor-transgenic C57BL\/6 mice<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"P&#xE9;rez-Guijarro, E. et al. Multimodel preclinical platform predicts clinical response of melanoma to immunotherapy. Nat. Med. 26, 781&#x2013;791 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR54\" id=\"ref-link-section-d108704154e4734\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>.<\/p>\n<p>Specifically, 24 single cells were isolated from the parental B2905 melanoma line and then expanded to become individual clonal sublines (that is, C1 to C24)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Gruen, C. et al. Melanoma clonal subline analysis uncovers heterogeneity-driven immunotherapy resistance mechanisms. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/2023.04.03.535074&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR55\" id=\"ref-link-section-d108704154e4741\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a>. Each of these 24 sublines was subjected to whole-exome sequencing and full-transcript single-cell RNA sequencing using the Smart-seq2 protocol. The single nucleotide variants called from exome sequencing results were used to build the tumor progression tree for all the 24 sublines. Based on the in vivo growth and therapeutic responses of the sublines in the clusters, three clades are named as \u2018high aggressiveness and resistant (HA-R)\u2019, \u2018high aggressiveness and sensitive (HA-S)\u2019 and \u2018low aggressiveness and sensitive (LA-S)\u2019<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 32\" title=\"Hirsch, M. G. et al. Stochastic modeling of single-cell gene expression adaptation reveals non-genomic contribution to evolution of tumor subclones. Cell Syst. 16, 101156 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR32\" id=\"ref-link-section-d108704154e4745\" rel=\"nofollow noopener\" target=\"_blank\">32<\/a>. Afterwards, EvoGeneX<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Pal, S., Oliver, B. &amp; Przytycka, T. M. Stochastic modeling of gene expression evolution uncovers tissue-and sex-specific properties of expression evolution in the Drosophila genus. J. Comput. Biol. 30, 21&#x2013;40 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR56\" id=\"ref-link-section-d108704154e4749\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a> is applied to the single-cell RNA-sequencing data of the 24 clonal sublines, where the phylogenetic relation is defined by the mutation-based tumor progression tree, to identify adaptively upregulated and downregulated genes in each of HA-R, HA-S and LA-S clades. The adaptively upregulated and downregulated gene lists were then subjected to the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis. The genes in the enrichments and their enriched terms are used to test the GeneAgent. In our case study, we only utilized the seven gene sets analyzed from the clonal subline as the evaluation data of GeneAgent. We did not access or process any original data from clinical experiments.<\/p>\n<p>Human annotation for outputs in the case study<\/p>\n<p>For the assessment of different outputs between GeneAgent and GPT-4, we established four criteria following the existing studies on the evaluation of LLMs<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Wang, J. et al. Is ChatGPT a good NLG evaluator? A preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, 1&#x2013;11 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR57\" id=\"ref-link-section-d108704154e4762\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Fabbri, A. R. et al. SummEval: re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9, 391&#x2013;409 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#ref-CR58\" id=\"ref-link-section-d108704154e4765\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a>.<\/p>\n<p>                  a.<\/p>\n<p>Relevance: Assess whether the content about genes pertinently reflects their functions, providing value to biologists.<\/p>\n<p>                  b.<\/p>\n<p>Readability: Evaluate the fluency and clarity of the writing, ensuring it is easily understandable.<\/p>\n<p>                  c.<\/p>\n<p>Consistency: Determine whether the analytical narratives align consistently with the specified process name.<\/p>\n<p>                  d.<\/p>\n<p>Comprehensiveness: Verify whether the outputs provide a comprehensive understanding of gene functions.<\/p>\n<p>Based on these four established criteria, two experts are tasked with evaluating the final responses from the outputs of GPT-4 and GeneAgent. They operate the annotation under a blind assessment protocol, where they are unaware of the algorithm that produced each response. Their main responsibility is to annotate and compare the preference for outputs generated by GPT-4 versus GeneAgent. They carefully review and select the more effective response, justifying their selections with relevant comments. Following a comprehensive synthesis of all feedback, these two experts are required to make a definitive judgment on which output most effectively satisfies the users\u2019 requirement (Supplementary Document <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM3\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>).<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41592-025-02748-6#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Model setting and data collection In this work, we utilized GPT-4 (version 20230613) as the backend model via&hellip;\n","protected":false},"author":2,"featured_media":39336,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[64,63,15373,35031,35030,21994,1325,336,8054,9865,2565,17254,128,8247],"class_list":{"0":"post-39335","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-genetics","8":"tag-au","9":"tag-australia","10":"tag-bioinformatics","11":"tag-biological-microscopy","12":"tag-biological-techniques","13":"tag-biomedical-engineering-biotechnology","14":"tag-general","15":"tag-genetics","16":"tag-genomics","17":"tag-life-sciences","18":"tag-machine-learning","19":"tag-proteomics","20":"tag-science","21":"tag-software"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/39335","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=39335"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/39335\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/39336"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=39335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=39335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=39335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}