In a study published in Science today, Berger and his colleagues put several of these strands together and use NLP to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a known process. known as viral immune leakage. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.
“It’s a neat document, building on the momentum of previous work,” says Ali Madani, Salesforce Scientist, who is use NLP to predict protein sequences.
Berger’s team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary ability of a virus – characteristics such as its ability to infect a host – can be interpreted in terms of grammatical correctness. A successful infectious virus is grammatically correct; failure is not.
Likewise, mutations in a virus can be interpreted in terms of semantics. Mutations that make a virus appear different from what is in its environment – such as changes in its surface proteins that make it invisible to certain antibodies – have changed its meaning. Viruses with different mutations can have different meanings, and a virus with different meaning may need different antibodies to read it.
To model these properties, the researchers used an LTSM, a type of neural network that predates the transformer-based ones used by large language models like GPT-3. These older networks can be trained on much less data than Transformers and still perform well for many applications.
Instead of millions of sentences, they trained the NLP model out of thousands of genetic sequences from three different viruses: 45,000 unique sequences for an influenza strain, 60,000 for an HIV strain, and between 3,000 and 4,000 for a strain of Sars-Cov -2, the virus that causes covid-19. “There is less data on the coronavirus because there is less surveillance,” says Brian Hie, an MIT graduate student who built the models.
NLP models work by encoding words in a mathematical space such that words with similar meanings are closer to each other than words with different meanings. This is called an integration. For viruses, the inclusion of genetic sequences grouped viruses together based on the similarity of their mutations. This makes it easy to predict which mutations are more likely for a particular strain than for others.
The general objective of the approach is to identify mutations that could allow a virus to escape an immune system without making it less infectious, i.e. mutations that change the meaning of a virus without making it grammatically incorrect. To test the tool, the team used a common metric to rate predictions made by machine learning models that score accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). . In this case, they took the main mutations identified by the tool and, using real viruses in a lab, verified how many of them were true escape mutations. Their results ranged from 0.69 for HIV to 0.85 for a strain of coronavirus. This is better than the results of other high-tech models, they say.
Knowing what mutations might occur could help hospitals and public health authorities plan more easily. For example, asking the model to tell you how much a strain of influenza has changed in significance since last year would give you an idea of how well the antibodies people have already developed are going to work this year.
The team says they are now running models on new variants of the coronavirus, including the so-called British mutation, the Danish mink mutation and variants from South Africa, Singapore and Malaysia. They have found a high potential for immune evasion in almost all of them, although this has not yet been tested in nature. One exception is the so-called South African variant, which has raised concerns that it may not be able to escape vaccines but was not reported by the tool. They are trying to figure out why.
Using NLP speeds up a slow process. Previously, the virus genome taken from a covid-19 patient in hospital could be sequenced and its mutations recreated and studied in a laboratory. But it can take weeks, says Bryan Bryson, a biologist at MIT, who also works on the project. The NLP model immediately predicts potential mutations, which concentrates lab work and speeds it up.
“It’s an amazing time to work on this,” says Bryson. New viral sequences come out every week. “It’s great to simultaneously update your model and then run to the lab to test it in experiments. It’s the best of computational biology, ”he says.
But this is only the beginning. Treating genetic mutations as changes in meaning could be applied in different ways across biology. “A good analogy can go a long way,” says Bryson.
For example, Hie believes their approach can be applied to drug resistance. “Think of a cancer protein that acquires resistance to chemotherapy or a bacterial protein that acquires resistance to an antibiotic,” he says. These mutations can again be thought of as shifts in meaning: “There are many creative ways to begin to interpret language patterns.”
“I think synthetic biology is on the cusp of a revolution,” says Madani. “We are now moving from just collecting data to learning to understand it in depth.”
Researchers are observing the progress of NLP and brainstorming new analogies between language and biology to take advantage of it. But Bryson, Berger, and Hie believe this cross could go both ways, with new NLP algorithms inspired by concepts from biology. “Biology has its own language,” says Berger.