Genome modelling and design across all domains of life with Evo 2

Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Merchant, A. T., King, S. H., Nguyen, E. & Hie, B. L. Semantic design of functional de novo genes from a genomic language model. Nature 649, 749–758 (2026).

Article
CAS
PubMed

Google Scholar

Avsec, Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).

Article
CAS
PubMed
PubMed Central

Google Scholar

Ku, J. et al. Systems and algorithms for convolutional multi-hybrid language models at scale. Preprint at https://doi.org/10.48550/arXiv.2503.01868 (2025).

Vaswani, A. et al. Attention is all you need. In Adv. Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (NIPS, 2017).

Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

Google Scholar

Gao, T., Wettig, A., Yen, H. & Chen, D. How to train long-context language models (effectively). In Proc. 63rd Annual Meeting of the Association for Computational Linguistics 1, 7376–7399 (ACL, 2025).

Dubey, A. et al. The Llama 3 herd of models. Preprint at https://doi.org/10.48550/arXiv.2407.21783 (2024).

Liu, S. J. et al. In vivo perturb-seq of cancer and microenvironment cells dissects oncologic drivers and radiotherapy responses in glioblastoma. Genome Biol. 25, 256 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Poli, M. et al. Hyena hierarchy: towards larger convolutional language models. In Proc. 40th International Conference on Machine Learning (eds Karuse, A. et al.) 28043–28078 (2023).

Poli, M. et al. Mechanistic design and scaling of hybrid architectures. In Proc. 41st International Conference on Machine Learning 235, 40908–40950 (2024); https://proceedings.mlr.press/v235/poli24a.html.

Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Preprint at bioRxiv https://doi.org/10.1101/2021.07.09.450648 (2021).

Notin, P. et al. ProteinGym: large-scale benchmarks for protein design and fitness prediction. Adv. Neural Inf. Process. Syst. 36, 64331–64379 (2023).

Benegas, G., Albors, C., Aw, A. J., Ye, C. & Song, Y. S. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat. Biotechnol. 43, 1960–1965 (2025).

Article
CAS
PubMed

Google Scholar

Shine, J. & Dalgarno, L. The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl Acad. Sci. USA 71, 1342–1346 (1974).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Kozak, M. The scanning model for translation: an update. J. Cell Biol. 108, 229–241 (1989).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978 (2022).

Article

Google Scholar

Li, F.-Z., Amini, A. P., Yue, Y., Yang, K. K. & Lu, A. X. Feature reuse and scaling: understanding transfer learning with protein language models. In Proc. 41st International Conference on Machine Learning 235, 27351–27375 (2024).

Weinstein, E. N., Amin, A. N., Frazer, J. & Marks, D. Non-identifiability and the blessings of misspecification in models of molecular fitness. In Adv. Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2022/file/247e592848391fe01f153f179c595090-Paper-Conference.pdf (2022).

Dalla-torre, H. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2024).

Article
PubMed
PubMed Central

Google Scholar

Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).

Article
CAS
PubMed
PubMed Central

Google Scholar

de Almeida, B. P. et al. SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models. Nat. Methods 22, 2301–2315 (2025).

Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Huang, H. et al. Functional evaluation and clinical classification of BRCA2 variants. Nature 638, 528–537 (2025).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Patel, A. et al. DART-Eval: a comprehensive DNA language model evaluation benchmark on regulatory DNA. Neural Inf. Process. Syst. 37, 62024–62061 (2024).

Google Scholar

Cunningham, H., Ewart, A., Smith, L. R., Huben, R. & Sharkey, L. Sparse autoencoders find highly interpretable features in language models. Preprint at https://doi.org/10.48550/arXiv.2309.08600 (2023).

Bricken, T. et al. Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread https://transformer-circuits.pub/2023/monosemantic-features (2023).

Templeton, A. et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread https://transformer-circuits.pub/2024/scaling-monosemanticity/ (2024).

Bussmann, B., Leask, P. & Nanda, N. BatchTopK Sparse Autoencoders. Preprint at https://doi.org/10.48550/arXiv.2412.06410 (2024).

Camargo, A. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2023).

Article
ADS
PubMed
PubMed Central

Google Scholar

Vorontsov, I. E. et al. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res. 52, D154–D163 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).

Article
PubMed
PubMed Central

Google Scholar

Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cells 38, 576–589 (2010).

Article
CAS

Google Scholar

Sandoval-Velasco, M. et al. Three-dimensional genome architecture persists in a 52,000-year-old woolly mammoth skin sample. Cell 187, 3541–3562.e51 (2023).

Article

Google Scholar

Meng, G., Li, Y., Yang, C. & Liu, S. MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic Acids Res. 47, e63 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Gibson, D. G. et al. Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220 (2008).

Article
ADS
CAS
PubMed

Google Scholar

Karr, J. R. et al. A whole-cell computational model predicts phenotype from genotype. Cell 150, 389–401 (2012).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

Article
PubMed
PubMed Central

Google Scholar

Fredens, J. et al. Total synthesis of Escherichia coli with a recoded genome. Nature 569, 514–518 (2019).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Li, Y. et al. Competition-level code generation with AlphaCode. Science 378, 1092–1097 (2022).

Article
ADS
CAS
PubMed

Google Scholar

Brown, B. et al. Large language monkeys: scaling inference compute with repeated sampling. Preprint at https://doi.org/10.48550/arXiv.2407.21787 (2024).

Allis, C. D. & Jenuwein, T. The molecular hallmarks of epigenetic control. Nat. Rev. Genet. 17, 487–500 (2016).

Article
CAS
PubMed

Google Scholar

Schreiber, J., Lu, Y. Y. & Noble, W. S. Ledidi: Designing genomic edits that induce functional activity. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.109686 (2020).

Linder, J. & Seelig, G. Fast activation maximization for molecular sequence design. BMC Bioinformatics 22, 510 (2020).

Article

Google Scholar

Zrimec, J. et al. Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun. 13, 5099 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

de Almeida, B. P. et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 626, 207–211 (2023).

Article
PubMed
PubMed Central

Google Scholar

DaSilva, L. F. et al. DNA-diffusion: leveraging generative models for controlling chromatin accessibility and gene expression via synthetic regulatory elements. Nat. Genet. 58, 180–194 (2026).

Sarkar, A. et al. Designing DNA with tunable regulatory activity using score-entropy discrete diffusion. Preprint at bioRxiv https://doi.org/10.1101/2024.05.23.595630 (2024).

Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Verkuil, R. et al. Language models generalize beyond natural proteins. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521521 (2022).

Bloomfield, D. et al. AI and biosecurity: The need for governance. Science 385, 831–833 (2024).

Article
ADS
PubMed
PubMed Central

Google Scholar

Pathak, A. K. et al. Pervasive ancestry bias in variant effect predictors. Preprint at bioRxiv https://doi.org/10.1101/2024.05.20.594987 (2025).

Schubach, M., Maass, T., Nazaretyan, L., Röner, S. & Kircher, M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 52, D1143–D1154 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).

Article
CAS
PubMed

Google Scholar

Pampari, A. et al. ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants. Preprint at bioRxiv https://doi.org/10.1101/2024.12.25.630221 (2025).

Durrant, M. G. et al. Bridge RNAs direct programmable recombination of target and donor DNA. Nature 630, 984–993 (2024).

Genome modelling and design across all domains of life with Evo 2

Tags: