Large genome model: Open source AI trained on trillions of bases

GettyImages 1400276299

In late 2025, we covered the development of an AI system called Evo that was trained on a large number of bacterial genomes. So many that, when queried for sequences from a group of related genes, it may correctly identify the next gene or suggest an entirely new protein.

That system worked because bacteria group related genes together – something that is not true in organisms with complex cells, which have equally complex genome structures. Given this, our coverage states, “It is not clear whether this approach will work with more complex genomes.”

Apparently, the team behind Evo saw this as a challenge, because today it’s describing Evo 2, an open source AI that’s been trained on the genomes of all three domains of life (bacteria, archaea, and eukaryotes). After training on trillions of base pairs of DNA, Evo2 developed an internal representation of key features even in complex genomes like ours, including things like regulatory DNA and splice sites that can be challenging for humans to recognize.

genome features

Bacterial genomes are organized according to relatively simple principles. Any gene that encodes a protein or RNA is contiguous, with no breaks in the coding sequence. Genes that perform related functions, such as metabolizing sugar or producing amino acids, tend to cluster together, allowing them to be controlled by a single, compact regulatory system. It’s all straightforward and efficient.

Eukaryotes are not like that. The coding sections of the gene are interrupted by introns, which do not code for anything. They are controlled by a sequence that can span hundreds of thousands of base pairs. The sequences that define the edges of introns or the binding sites of regulatory proteins are all poorly defined – while they have some bases that are absolutely essential, there are many bases that have an above-average tendency to have a specific base (something like “45 percent of the time it’s a T”). Most eukaryotic genomes contain large amounts of DNA lying around that have been referred to as junk: dormant viruses, incurably damaged genes, and so on.



<a href

Leave a Comment