Viruses lead a rather repetitive existence. They enter a cell, hijack its machines into a viral copier, and those copies make their way to other cells armed with instructions to do the same. So it’s fine, over and over again. But a little often, in the midst of this repeated copy and paste, things get mixed up. Mutations occur in the copies. Sometimes a mutation means an amino acid isn’t made and a vital protein isn’t folding – so in the dustbin of evolutionary history this viral version goes. Sometimes the mutation does nothing at all, because different sequences encoding the same proteins compensate for the error. But from time to time, the mutations go perfectly well. The changes do not affect the ability of the virus to exist; instead, they produce a useful change, such as making the virus unrecognizable to a person’s immune system. When this allows the virus to escape antibodies generated by past infections or by a vaccine, this mutant variant of the virus is said to have “escaped”.
Scientists are always on the lookout for signs of a potential breakout. This is true for SARS-CoV-2, as new strains emerge and scientists study what the genetic changes might mean for a long-lasting vaccine. (So far, things are going well.) It’s also what confuses researchers who study influenza and HIV, which routinely elude our immune defenses. So, in an effort to see what might happen, the researchers create hypothetical mutants in the lab and see if they can escape antibodies taken from recent patients or vaccinees. But the genetic code offers too many possibilities to test each evolutionary branch that the virus could take over time. It’s a matter of following.
Last winter, Brian Hie, a computer biologist at MIT and a fan of John Donne’s lyric poetry, pondered this problem when he posed an analogy: What if we think of viral sequences as we think of written language? Every viral sequence has some sort of grammar, he told himself – a set of rules that it must follow to be that particular virus. When mutations violate this grammar, the virus reaches an evolutionary dead end. In terms of virology, it lacks “fitness”. Like language, from the point of view of the immune system, one could say that the sequence has a kind of semantics. There are certain sequences that the immune system can interpret – and thus stop the virus with antibodies and other defenses – and others that it cannot. Thus, a viral leak could be seen as a change that preserves the grammar of the sequence but changes its meaning.
The analogy had a simple elegance, almost too simple. But for Hie, it was also convenient. In recent years, AI systems have become very good at modeling the principles of grammar and semantics in human language. To do this, they form a system with data sets of billions of words, organized in sentences and paragraphs, from which the system derives models. This way, without being informed of specific rules, the system learns where the commas should go and how to structure a clause. It can also be said to intuitively mean the meaning of certain sequences – words and phrases – depending on the many contexts in which they appear in the dataset. These are models, all the way down. This is how the most advanced language models, like OpenAI’s GPT-3, can learn to produce perfectly grammatical prose that manages to stay reasonably on topic.
One advantage of this idea is that it is generalizable. For a machine learning model, a sequence is a sequence, whether organized into sonnets or amino acids. According to Jeremy Howard, an AI researcher at the University of San Francisco and an expert in language models, applying such models to biological sequences can be fruitful. With enough data from, for example, genetic sequences of viruses known to be infectious, the model will implicitly learn something about the structure of infectious viruses. “This model will have a lot of sophisticated and complex knowledge,” he says.