
The researchers argue that this setup allows Evo to “link nucleotide-level patterns to kilobase-scale genomic context.” In other words, if you prompt it with a large portion of genomic DNA, Evo can interpret how the LLM will interpret a query and generate an output that, in the genomic sense, is appropriate for that interpretation.
The researchers reasoned that, given training on the bacterial genome, they could use a known gene as a prompt, and Evo should produce an output that included regions that encode proteins with related functions. The key question is whether it will only output sequences of proteins we already know about, or whether it will come up with an output that is less predictable.
new protein
To begin testing the system, the researchers primed it with gene fragments of known proteins and determined whether Evo could accomplish them. In one example, if given 30 percent of the sequence of the gene for a known protein, Evo was able to produce the remaining 85 percent. When prompted with 80 percent sequence, it can return all missing sequence. When a single gene was removed from a functional cluster, Evo could correctly identify and restore the missing gene.
The large amount of training data also ensured that Evo correctly identified the most important regions of the protein. If it did cause changes in sequence, they usually reside in regions of the protein where variability is tolerated. In other words, its training had enabled the system to incorporate the rules of evolutionary limits on changes in known genes.
So, the researchers decided to test what happened when Evo was asked to output something new. To do this, they used bacterial toxins, which are usually encoded with an anti-toxin that prevents the cell from killing itself when the gene is activated. There are plenty of examples, and they evolve rapidly as part of an arms race between bacteria and their competitors. So, the team developed a toxin that was only marginally related to known toxins, and had no known antitoxin, and fed its sequence to the evo as a signal. And this time, they filtered out any responses that looked similar to known antitoxin genes.