Natural protein sequences represent only a fraction of what’s theoretically possible

The genetic code has given rise to the enormous diversity of proteins we see in nature—and rearranging the order of codons that make up a protein’s sequence leads to an immense variety of structure and function. Yet in nature we observe only a fraction of the protein sequences that are theoretically possible.

A group of scientists led by evolutionary biologist Fyodor Kondrashov at the Okinawa Institute of Science and Technology now say that the reason for the relatively limited diversity of proteins in nature is due to constraints imposed by evolution from a single common ancestor in a protein family (Proc. Natl. Acad. Sci. U.S.A. 2026, DOI: 10.1073/pnas.2532018123).

To explore the space explored by evolution, Kondrashov and the team did phylogenetic and statistical analyses using the sequences of thousands of protein families across vertebrates and bacteria. Specifically, they quantified how far modern proteins in a family diverged from the sequence of their common ancestor; they then compared that with simulations of sequence evolution and estimates of a theoretical functional protein space.

The team found that a lot of protein sequence space is unexplored. “The time that sequences have been evolving over the last 3.5 billion years was not enough to explore what could be functional,” says graduate student Lada Isakova. “Biochemical functional constraint plays a role,” she says. But it’s very small compared with the limitations imposed by the starting sequence—an effect Isakova refers to as the “founder effect.”

The idea that there are evolutionary constraints on protein diversity isn’t new, but what wasn’t well understood was how to quantify that constraint.

“I think the general idea that evolution is highly contingent, and that where a protein family starts strongly shapes where it can go, is fairly well understood, but this paper does an excellent job of trying to quantify that effect at scale,” Colin Jackson, a biochemist at Australian National University who wasn’t involved in the research, says via email. “What this study shows quite clearly is that natural protein families have explored only a tiny and very biased part of the total functional sequence space, largely because they radiate from small numbers of ancestral starting points rather than sampling sequence space uniformly.”

Because of evolutionary constraints, artificial intelligence models such as Google DeepMind’s AlphaFold are restricted in what they can accomplish in understanding how proteins work, Kondrashov says. “We will not learn everything just by looking at natural sequences.”

Jackson shares this view. “If we want models that can design genuinely new proteins rather than just variants of what biology has already found, we will probably need some form of data augmentation through experimentally generated sequences outside extant evolutionary data,” he says.

Max Barnhart is an assistant editor and life sciences reporter at C&EN.

Chemical & Engineering News

ISSN 0009-2347

Natural protein sequences represent only a fraction of what’s theoretically possible

Tags: