Machine learning and statistical inference in microbial population genomics

Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 2021;19:e3001421 (Hanage WP, editor.).

CAS
PubMed
PubMed Central

Google Scholar

Wong ZSY, Zhou J, Zhang Q. Artificial intelligence for infectious disease big data analytics. Infect Dis Health. 2019;24:44–8.

PubMed

Google Scholar

Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strategy for personalized prognosis. Oncotarget. 2016;7:40200–20.

PubMed
PubMed Central

Google Scholar

Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the Opportunities and Risks of Foundation Models. arXiv; 2021 Available from: https://arxiv.org/abs/2108.07258. [cited 2025 Sept 2].

Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500.

CAS
PubMed
PubMed Central

Google Scholar

Pagès-Gallego M, De Ridder J. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biol. 2023;24:71.

PubMed
PubMed Central

Google Scholar

Torres MDT, Brooks EF, Cesaro A, Sberro H, Gill MO, Nicolaou C, et al. Mining human microbiomes reveals an untapped source of peptide antibiotics. Cell. 2024;187:5453-5467.e15.

CAS
PubMed

Google Scholar

Wan F, Torres MDT, Peng J, De La Fuente-Nunez C. Deep-learning-enabled antibiotic discovery through molecular de-extinction. Nat Biomed Eng. 2024;8:854–71.

CAS
PubMed
PubMed Central

Google Scholar

Iwashyna TJ, Liu V. What’s So Different about Big Data?. A Primer for Clinicians Trained to Think Epidemiologically. Annals ATS. 2014;11:1130–5.

Murphy KP. Probabilistic machine learning: an introduction. Cambridge, Massachusetts: The MIT Press; 2022.

Google Scholar

Murphy KP. Probabilistic machine learning: advanced topics. Cambridge, Massachusetts: The MIT Press; 2023.

Google Scholar

Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist Sci. 2001;16. Available from: https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling–The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full. [cited 2025 Sept 2].

Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15:233–4.

CAS
PubMed
PubMed Central

Google Scholar

Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117.

PubMed

Google Scholar

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

Google Scholar

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32. Curran Associates, Inc; 2019;8024–35.

TensorFlow Developers. TensorFlow. Zenodo; 2024. Available from: https://zenodo.org/doi/10.5281/zenodo.12726004. [cited 2025 Sept 2].

Greene AC, Giffin KA, Greene CS, Moore JH. Adapting bioinformatics curricula for big data. Brief Bioinform. 2016;17:43–50.

PubMed

Google Scholar

Wiemken TL, Kelley RR. Machine learning in epidemiology and health outcomes research. Annu Rev Public Health. 2020;41:21–36.

PubMed

Google Scholar

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

Google Scholar

Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, Kidd M, et al. Traces of human migrations in Helicobacter pylori populations. Science. 2003;299:1582–5.

CAS
PubMed

Google Scholar

Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Mol Ecol. 2006;15:2833–43.

PubMed

Google Scholar

Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. Fast hierarchical Bayesian analysis of population structure. Nucleic Acids Res. 2019;47:5539–49.

CAS
PubMed
PubMed Central

Google Scholar

Lees JA, Tonkin-Hill G, Yang Z, Corander J. Mandrake: visualizing microbial population structure by embedding millions of genomes into a low-dimensional representation. Phil Trans R Soc B. 2022;377:20210237.

PubMed
PubMed Central

Google Scholar

Jaillard M, Lima L, Tournoud M, Mahé P, Van Belkum A, Lacroix V, et al. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. Didelot X, editor. PLoS Genet. 2018;14:e1007758.

Hoffman S, Podgurski A. Big bad data: law, public health, and biomedical databases. J Law Med Ethics. 2013;41:56–60.

PubMed

Google Scholar

Wang Q, Ma Y, Zhao K, Tian Y. A comprehensive survey of loss functions in machine learning. Ann Data Sci. 2022;9:187–212.

Google Scholar

Stone M. Cross-Validatory Choice and Assessment of Statistical Predictions. J Royal Statistic Soc Series B (Methodological. 1974;36:111–47.

Bzdok D, Krzywinski M, Altman N. Machine learning: a primer. Nat Methods. 2017;14:1119–20.

CAS
PubMed
PubMed Central

Google Scholar

Bashir D, Montañez GD, Sehra S, Segura PS, Lauw J. An Information-T. Cham: Springer International Publishing; 2020; 347–58. Available from: https://link.springer.com/10.1007/978-3-030-64984-5_27. [cited 2025 Sept 2].

Fix E, Hodges JL. Discriminatory analysis: Nonparametric discrimination: Consistency properties: (471672008–001). 1951 Available from: https://doi.apa.org/doi/10.1037/e471672008-001. [cited 2025 Sept 2].

Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13:21–7.

Google Scholar

Yao Z, Ruzzo WL. A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinformatics. 2006;7:S11.

PubMed
PubMed Central

Google Scholar

Mihelčić M, Šmuc T, Supek F. Patterns of diverse gene functions in genomic neighborhoods predict gene function and phenotype. Sci Rep. 2019;9:19537.

PubMed
PubMed Central

Google Scholar

Xu S. Bayesian naïve Bayes classifiers to text classification. J Inf Sci. 2018;44:48–59.

Google Scholar

John GH, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. arXiv; 2013 Available from: https://arxiv.org/abs/1302.4964. [cited 2025 Sept 2].

Webb GI. Naïve Bayes. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning. Boston, MA: Springer US; 2011713–4. Available from: https://link.springer.com/10.1007/978-0-387-30164-8_576. [cited 2025 Sept 2].

Li F, Shen Y, Lv D, Lin J, Liu B, He F, et al. A bayesian classification model for discriminating common infectious diseases in Zhejiang province, China. Medicine. 2020;99:e19218.

PubMed
PubMed Central

Google Scholar

Zhao Z, Cristian A, Rosen G. Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life. BMC Bioinformatics. 2020;21:412.

PubMed
PubMed Central

Google Scholar

Sandberg R, Winberg G, Bränden C-I, Kaske A, Ernberg I, Cöster J. Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 2001;11:1404–9.

CAS
PubMed
PubMed Central

Google Scholar

Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G. Support vector machines and kernels for computational biology. PLoS Comput Biol. 2008;4:e1000173 (Lewitter F, editor.).

PubMed
PubMed Central

Google Scholar

McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, Hénaff E, Alexander N, et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 2017;18:182.

PubMed
PubMed Central

Google Scholar

Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.

Google Scholar

Tsirigos A. A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res. 2005;33:3699–707.

CAS
PubMed
PubMed Central

Google Scholar

Weimann A, Mooren K, Frank J, Pope PB, Bremges A, McHardy AC. From Genomes to Phenotypes: Traitar, the Microbial Trait Analyzer. Segata N, editor. mSystems. 2016;1:e00101–16.

Belman S, Pesonen H, Croucher NJ, Bentley SD, Corander J. Estimating Between Country Migration in Pneumococcal Populations. Epidemiology; 2023. Available from: http://medrxiv.org/lookup/doi/10.1101/2023.11.15.23298520. [cited 2025 Sept 2].

Lupolova N, Dallman TJ, Holden NJ, Gally DL. Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli. Microbial Genomics. 2017;3. Available from: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000135. [cited 2025 Sept 2].

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Google Scholar

Li M, Xu H, Deng Y. Evidential decision tree based on belief entropy. Entropy. 2019;21:897.

PubMed Central

Google Scholar

Schrider DR, Kern AD. Supervised machine learning for population genetics: a new paradigm. Trends Genet. 2018;34:301–12.

CAS
PubMed
PubMed Central

Google Scholar

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Google Scholar

Statnikov A, Henaff M, Narendra V, Konganti K, Li Z, Yang L, et al. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome. 2013;1:11.

PubMed
PubMed Central

Google Scholar

Deneke C, Rentzsch R, Renard BY. Paprbag: a machine learning approach for the detection of novel pathogens from NGS data. Sci Rep. 2017;7:39194.

CAS
PubMed
PubMed Central

Google Scholar

Méric G, Mageiros L, Pensar J, Laabei M, Yahara K, Pascoe B, et al. Disease-associated genotypes of the commensal skin bacterium Staphylococcus epidermidis. Nat Commun. 2018;9:5034.

PubMed
PubMed Central

Google Scholar

Mageiros L, Méric G, Bayliss SC, Pensar J, Pascoe B, Mourkas E, et al. Genome evolution and the emergence of pathogenicity in avian Escherichia coli. Nat Commun. 2021;12:765.

CAS
PubMed
PubMed Central

Google Scholar

Chen ML, Doddi A, Royer J, Freschi L, Schito M, Ezewudo M, et al. Beyond multidrug resistance: leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction. EBioMedicine. 2019;43:356–69.

PubMed
PubMed Central

Google Scholar

Li Y, Metcalf BJ, Chochua S, Li Z, Gertz RE, Walker H, et al. Validation of β-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (PBP) sequences. BMC Genomics. 2017;18:621.

PubMed
PubMed Central

Google Scholar

Arning N, Sheppard SK, Bayliss S, Clifton DA, Wilson DJ. Machine learning to predict the source of campylobacteriosis using whole genome data. PLoS Genet. 2021;17:e1009436 (Hughes D, editor.).

CAS
PubMed
PubMed Central

Google Scholar

Pascoe B, Futcher G, Pensar J, Bayliss SC, Mourkas E, Calland JK, et al. Machine learning to attribute the source of Campylobacter infections in the United States: a retrospective analysis of national surveillance data. J Infect. 2024;89:106265.

CAS
PubMed
PubMed Central

Google Scholar

Wheeler NE, Gardner PP, Barquist L. Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. PLoS Genet. 2018;14:e1007333 (Didelot X, editor.).

PubMed
PubMed Central

Google Scholar

Zhang S, Li S, Gu W, Den Bakker H, Boxrud D, Taylor A, et al. Zoonotic Source Attribution of Salmonella enterica Serotype Typhimurium Using Genomic Surveillance Data, United States. Emerg Infect Dis. 2019;25. Available from: http://wwwnc.cdc.gov/eid/article/25/1/18-0835_article.htm. [cited 2025 Sept 2].

Beavan AJS, Domingo-Sananes MR, McInerney JO. Contingency, repeatability, and predictability in the evolution of a prokaryotic pangenome. Proc Natl Acad Sci USA. 2024;121:e2304934120.

CAS
PubMed

Google Scholar

Mason L, Baxter J, Bartlett P, Frean M. Boosting Algorithms as Gradient Descent. Advances in Neural Information Processing Systems. MIT Press; 1999. Available from: https://proceedings.neurips.cc/paper/1999/hash/96a93ba89a5b5c6c226e49b88973f46e-Abstract.html.

Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29. Available from: https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full. [cited 2025 Sept 2].

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc; 2017;3149–57 17.

Anahtar MN, Yang JH, Kanjilal S. Applications of Machine Learning to the Problem of Antimicrobial Resistance: an Emerging Model for Translational Research. McAdam AJ, editor. J Clin Microbiol. 2021;59:e01260–20.

Ramoneda J, Stallard-Olivera E, Hoffert M, Winfrey CC, Stadler M, Niño-García JP, et al. Building a genome-based understanding of bacterial pH preferences. Sci Adv. 2023;9:eadf8998.

CAS
PubMed
PubMed Central

Google Scholar

Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A. 1982;79:2554–8.

CAS
PubMed
PubMed Central

Google Scholar

Sheehan S, Song YS. Deep Learning for Population Genetic Inference. Chen K, editor. PLoS Comput Biol. 2016;12:e1004845.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

CAS
PubMed

Google Scholar

Sejnowski TJ. The Deep Learning Revolution. The MIT Press; 2018 Available from: https://direct.mit.edu/books/book/4111/The-Deep-Learning-Revolution. [cited 2025 Sept 2].

Lugo L, Hernández EB. A recurrent neural network approach for whole genome bacteria identification. Appl Artif Intell. 2021;35:642–56.

Google Scholar

Hasan MA, Lonardi S. Deeplyessential: a deep neural network for predicting essential genes in microbes. BMC Bioinformatics. 2020;21:367.

PubMed
PubMed Central

Google Scholar

Assaf R, Xia F, Stevens R. Detecting operons in bacterial genomes via visual representation learning. Sci Rep. 2021;11:2124.

CAS
PubMed
PubMed Central

Google Scholar

Wiatrak M, Weimann A, Dinan A, Brbić M, Floto RA. Sequence-based modelling of bacterial genomes enables accurate antibiotic resistance prediction. Microbiology; 2024 Available from: http://biorxiv.org/lookup/doi/10.1101/2024.01.03.574022. [cited 2025 Sept 2].

Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989;2:359–66.

Google Scholar

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv; 2016. Available from: https://arxiv.org/abs/1611.03530. [cited 2025 Sept 2].

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.

Google Scholar

Holz HJ, Loew MH. Relative feature importance: A classifier-independent approach to feature selection. Machine Intelligence and Pattern Recognition. Elsevier; 1994;473–87. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9780444818928500468. [cited 2025 Sept 2].

Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci USA. 2019;116:22071–80.

CAS
PubMed
PubMed Central

Google Scholar

House of Commons Science, Innovation and Technology Committee. 2023. The governance of artificial intelligence: interim report. Ninth Report of Session 2022–23. HC1769. https://committees.parliament.uk/publications/41130/documents/205611/default/

Nielsen EM, Fussing V, Engberg J, Nielsen NL, Neimann J. Most Campylobacter subtypes from sporadic infections can be found in retail poultry products and food animals. Epidemiol Infect. 2006;134:758–67.

CAS
PubMed
PubMed Central

Google Scholar

Garrett N, Devane ML, Hudson JA, Nicol C, Ball A, Klena JD, et al. Statistical comparison of Campylobacter jejuni subtypes from human cases and environmental sources: comparison of Campylobacter subtypes. J Appl Microbiol. 2007;103:2113–21.

CAS
PubMed

Google Scholar

Wilson DJ, Gabriel E, Leatherbarrow AJH, Cheesbrough J, Gee S, Bolton E, et al. Tracing the Source of Campylobacteriosis. Guttman DS, editor. PLoS Genet. 2008;4:e1000203.

Sheppard SK, Dallas JF, Strachan NJC, MacRae M, McCarthy ND, Wilson DJ, et al. Campylobacter genotyping to determine the source of human infection. Clin Infect Dis. 2009;48:1072–8.

PubMed

Google Scholar

Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA: ACM; 2016;785–94. Available from: https://dl.acm.org/doi/10.1145/2939672.2939785. [cited 2025 Sept 2].

Mackay TFC. The genetic architecture of quantitative traits. Annu Rev Genet. 2001;35:303–39.

CAS
PubMed

Google Scholar

Peacock SJ, Moore CE, Justice A, Kantzanou M, Story L, Mackie K, et al. Virulent combinations of adhesin and toxin genes in natural populations of Staphylococcus aureus. Infect Immun. 2002;70:4987–96.

CAS
PubMed
PubMed Central

Google Scholar

Astle W, Balding DJ. Population Structure and Cryptic Relatedness in Genetic Association Studies. Statist Sci. 2009;24. Available from: https://projecteuclid.org/journals/statistical-science/volume-24/issue-4/Population-Structure-and-Cryptic-Relatedness-in-Genetic-Association-Studies/10.1214/09-STS307.full. [cited 2025 Sept 2].

Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11:459–63.

CAS
PubMed
PubMed Central

Google Scholar

Sheppard SK. Strain wars and the evolution of opportunistic pathogens. Curr Opin Microbiol. 2022;67:102138.

CAS
PubMed

Google Scholar

Pearl J. Causal inference in statistics: An overview. Statist Surv. 2009;3. Available from: https://projecteuclid.org/journals/statistics-surveys/volume-3/issue-none/Causal-inference-in-statistics-An-overview/10.1214/09-SS057.full. [cited 2025 Sept 2].

Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat Commun. 2018;9:224.

PubMed
PubMed Central

Google Scholar

Sheppard SK, Didelot X, Meric G, Torralbo A, Jolley KA, Kelly DJ, et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc Natl Acad Sci USA. 2013;110:11923–7.

CAS
PubMed
PubMed Central

Google Scholar

Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC, Walker TM, et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol. 2016;1:16041.

CAS
PubMed
PubMed Central

Google Scholar

Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Stegle O, editor. Bioinformatics. 2018;34:4310–2.

Young BC, Earle SG, Soeng S, Sar P, Kumar V, Hor S, et al. Panton-valentine leucocidin is the key determinant of Staphylococcus aureus pyomyositis in a bacterial GWAS. Elife. 2019;8:e42486.

PubMed
PubMed Central

Google Scholar

Earle SG, Lobanovska M, Lavender H, Tang C, Exley RM, Ramos-Sevillano E, et al. Genome-wide association studies reveal the role of polymorphisms affecting factor H binding protein expression in host invasion by Neisseria meningitidis. Nassif X, editor. PLoS Pathog. 2021;17:e1009992.

Green AG, Yoon CH, Chen ML, Ektefaie Y, Fina M, Freschi L, et al. A convolutional neural network highlights mutations relevant to antimicrobial resistance in Mycobacterium tuberculosis. Nat Commun. 2022;13:3817.

CAS
PubMed
PubMed Central

Google Scholar

The CRyPTIC Consortium. Genome-wide association studies of global Mycobacterium tuberculosis resistance to 13 antimicrobials in 10,228 genomes identify new resistance mechanisms. Ladner J, editor. PLoS Biol. 2022;20:e3001755.

Mosquera-Rendón J, Moreno-Herrera CX, Robledo J, Hurtado-Páez U. Genome-wide association studies (GWAS) approaches for the detection of genetic variants associated with antibiotic resistance: a systematic review. Microorganisms. 2023;11:2866.

PubMed
PubMed Central

Google Scholar

Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW. Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet. 2012;13:601–12.

CAS
PubMed
PubMed Central

Google Scholar

Walker TM, Cruz ALG, Peto TE, Smith EG, Esmail H, Crook DW. Tuberculosis is changing. Lancet Infect Dis. 2017;17:359–61.

PubMed

Google Scholar

Satta G, Lipman M, Smith GP, Arnold C, Kon OM, McHugh TD. Mycobacterium tuberculosis and whole-genome sequencing: how close are we to unleashing its full potential? Clin Microbiol Infect. 2018;24:604–9.

CAS
PubMed

Google Scholar

Jakobsdottir J, Gorin MB, Conley YP, Ferrell RE, Weeks DE. Interpretation of Genetic Association Studies: Markers with Replicated Highly Significant Odds Ratios May Be Poor Classifiers. Abecasis GR, editor. PLoS Genet. 2009;5:e1000337.

Yang Y, Niehaus KE, Walker TM, Iqbal Z, Walker AS, Wilson DJ, et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Birol I, editor. Bioinformatics. 2018;34:1666–71.

Kouchaki S, Yang Y, Walker TM, Sarah Walker A, Wilson DJ, Peto TEA, et al. Application of machine learning techniques to tuberculosis drug resistance analysis. Wren J, editor. Bioinformatics. 2019;35:2276–82.

Yang Y, Walker TM, Walker AS, Wilson DJ, Peto TEA, Crook DW, et al. DeepAMR for predicting co-occurrent resistance of Mycobacterium tuberculosis. Hancock J, editor. Bioinformatics. 2019;35:3240–9.

Gröschel MI, Owens M, Freschi L, Vargas R, Marin MG, Phelan J, et al. Gentb: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning. Genome Med. 2021;13:138.

PubMed
PubMed Central

Google Scholar

The CRyPTIC Consortium and the 100,000 Genomes Project. Prediction of Susceptibility to First-Line Tuberculosis Drugs by DNA Sequencing. N Engl J Med. 2018;379:1403–15.

He G, Zheng Q, Shi J, Wu L, Huang B, Yang Y. Evaluation of WHO catalog of mutations and five WGS analysis tools for drug resistance prediction of Mycobacterium tuberculosis isolates from China. Georghiou SB, editor. Microbiol Spectr. 2024;12:e03341–23.

Ferrari E, Retico A, Bacciu D. Measuring the effects of confounders in medical supervised classification problems: the confounding index (CI). Artif Intell Med. 2020;103:101804.

PubMed

Google Scholar

Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA: ACM; 2016;1135–44. Available from: https://dl.acm.org/doi/10.1145/2939672.2939778. [cited 2025 Sept 2].

Lundberg S, Lee S-I. A Unified Approach to Interpreting Model Predictions. arXiv; 2017 Available from: https://arxiv.org/abs/1705.07874. [cited 2025 Sept 2].

Meyes R, Lu M, Waubert de Puiseau C, Meisen T. Ablation studies to uncover structure of learned representations in artificial neural networks. Proceedings of the International Conference on Artificial Intelligence (ICAI). Athens, Greece: CSREA Press; 2019 Available from: https://www.researchgate.net/publication/334871296_Ablation_Studies_to_Uncover_Structure_of_Learned_Representations_in_Artificial_Neural_Networks. [cited 2025 Sept 2].

Callaway E. How generative AI is building better antibodies. Nature. 2023;d41586–023–01516-w.

118.Callaway E. ‘ChatGPT for CRISPR’ creates new gene-editing tools. Nature. 2024;629:272–272.

CAS
PubMed

Google Scholar

Tang X, Dai H, Knight E, Wu F, Li Y, Li T, et al. A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation. Briefings in Bioinformatics. 2024;25:bbae338

Winnifrith A, Outeiral C, Hie BL. Generative artificial intelligence for de novo protein design. Current Opinion in Structural Biology. 2024;86:102794

Machine learning and statistical inference in microbial population genomics | Genome Biology

Tags: