AI model trained on 100,000+ species learns to read and design genetic code

A new biological foundation model analyzes genomic patterns across the tree of life and may accelerate disease research and synthetic biology.

A large AI model trained on genetic sequences from across the tree of life can identify disease-causing mutations and design new genetic sequences, according to research published in Nature. The model, called Evo 2, represents one of the largest AI systems developed for biology to date.

Researchers from the Arc Institute and NVIDIA developed Evo 2 with collaborators from Stanford University, University of California, Berkeley, and University of California, San Francisco. The work builds on an earlier model, Evo 1, and aims to apply large-scale machine learning to biological sequence data in the same way large language models analyze human text.

The study describes Evo 2 as a “DNA foundation model” trained on more than 9.3 trillion nucleotides — the chemical building blocks of DNA and RNA — drawn from over 128,000 genomes and metagenomic datasets. These sequences span organisms from bacteria and archaea to plants, animals, and humans.

According to the research team, the scale and diversity of the training data allow the model to detect biological patterns that would be difficult to uncover through traditional laboratory experiments alone. Because evolutionary processes encode functional information in genetic sequences, the model can identify relationships between genes and biological function across different species.

“Evolution has left signals in biological sequences over millions of years,” said Brian Hie, a chemical engineering professor at Stanford and co-senior author of the study, in a press release. By learning those patterns, he said, AI systems can begin to infer how genes and molecules function.

Predicting disease-related mutations

One potential application of the model is predicting whether genetic mutations may cause disease.

Continue reading below…

In benchmark tests using variants of the BRCA1 gene, which is associated with breast and ovarian cancer risk, Evo 2 achieved more than 90 percent accuracy in distinguishing benign mutations from those that may be pathogenic. Tools that predict mutation effects could help prioritize variants for laboratory validation, potentially reducing the time and cost required to evaluate genetic changes experimentally.

Researchers said the model’s architecture also allows it to process long stretches of DNA — up to one million nucleotides at once — making it possible to analyze relationships between distant regions of the genome.

Training a large-scale biological model

Training Evo 2 required several months of computing on the NVIDIA DGX Cloud infrastructure running on Amazon Web Services. The training used more than 2,000 NVIDIA H100 GPUs.

To handle the large datasets and long DNA sequences, the researchers designed a new neural network architecture called StripedHyena 2, which enabled the system to process roughly eight fold longer sequences than the earlier Evo 1 model and train on roughly 30 times more data.

The developers have released the model’s training data, code, and weights publicly through Arc Institute repositories and integrated the system into NVIDIA’s BioNeMo framework for biological AI research.

The team also collaborated with the AI interpretability lab Goodfire to build visualization tools that help researchers examine what biological patterns the model learns from genomic data.

Applications across biology

Since the model’s preprint release in early 2025, researchers have begun exploring several scientific uses. These include predicting genetic risk factors for conditions such as Alzheimer’s disease, analyzing variant effects in domesticated animals, and designing synthetic bacteriophages.

Arc Institute scientists reported that Evo 2 could generate functional phage genomes, suggesting possible applications in developing therapies against antibiotic-resistant bacteria. Researchers also say the model could support genetic engineering efforts. For example, it may help design regulatory DNA elements that activate therapeutic genes only in specific cell types — such as neurons or liver cells — potentially improving the precision of gene therapies.

The research team describes Evo 2 as a general-purpose foundation model for genomics, similar to the role large language models play in natural language processing. Dave Burke, Chief Technology Officer at Arc Institute and a co-author on the study, said the model could serve as a base system for building more specialized biological AI tools. Potential applications range from predicting how mutations affect proteins to designing synthetic genetic elements with specific biological behaviors.

Safety and the future of AI in biology

The researchers also addressed biosafety concerns during development. Pathogens that infect humans and other complex organisms were excluded from the model’s training dataset, and safeguards were implemented to prevent the model from generating useful outputs related to those pathogens.

The safety framework was developed in collaboration with Tina Hernandez-Boussard, a professor of medicine at Stanford, whose team contributed to guidelines for responsible deployment.

As AI systems continue to scale, researchers increasingly view foundation models as tools for exploring biological complexity. Anthony Costa, director of digital biology at NVIDIA, said the model’s large dataset and architecture allow it to generalize across biological systems in ways previous models could not.

By making the system openly available, the developers hope researchers worldwide will adapt the model for applications ranging from disease research to biotechnology.“By releasing these capabilities broadly,” Costa said, “scientists now have a new tool to help study the genetic mechanisms underlying health and disease.”

AI model trained on 100,000+ species learns to read and design genetic code

Tags: