Can an AI predict the language of viral mutation?

Viruses lead to quite repetitive existence. They enter a cell, hijack your machine to turn it into a viral copy machine, and these copies go to other cells armed with instructions to do the same. And so it goes, indefinitely. But, with some frequency, in the midst of these repetitions of copy and paste, things get confused. Mutations appear in the copies. Sometimes a mutation means that an amino acid is not produced and a vital protein is not folded – so, for the trash can of evolutionary history, this viral version goes. Sometimes, the mutation does nothing, because different sequences encoding the same proteins make up for the error. But every now and then, mutations happen perfectly. The changes do not affect the ability of the virus to exist; instead, they produce a useful change, such as making the virus unrecognizable to a person’s immune defenses. When this allows the virus to avoid antibodies generated from previous infections or a vaccine, it is said that this mutant variant of the virus “escaped”.

Scientists are always looking for signs of potential flight. This is true for SARS-CoV-2, as new strains emerge and scientists are investigating what genetic changes could mean for a long-term vaccine. (So ​​far, things seem to be looking good.) This is also what confuses researchers who study flu and HIV, who routinely avoid our immune defenses. Therefore, in an effort to see what is possibly coming, the researchers create hypothetical mutants in the laboratory and see if they can escape antibodies taken from recent patients or vaccine recipients. But the genetic code offers many possibilities for testing each evolutionary branch of the virus can take over time. It is a matter of following up.

Last winter, Brian Hie, a computational biologist at MIT and a fan of John Donne’s lyric poetry, was thinking about this problem when he discovered an analogy: what if we think of viral sequences the same way we think of written language? Each viral sequence has a kind of grammar, he reasoned – a set of rules that you need to follow to be that specific virus. When mutations violate this grammar, the virus reaches an evolutionary dead end. In terms of virology, “good shape” is lacking. Also like language, from the point of view of the immune system, the sequence could also be considered to have a kind of semantics. There are some sequences that the immune system can interpret – and thus stop the virus with antibodies and other defenses – and some that it cannot. Therefore, a viral escape can be seen as a change that preserves the grammar of the sequence, but changes its meaning.

The analogy had a simple elegance, almost too simple. But for Hie, it was also practical. In recent years, AI systems have become very good at modeling grammar and semantics principles in human language. They do this by training a system with billions of word data sets, organized into sentences and paragraphs, from which the system derives patterns. In this way, without being informed of any specific rules, the system learns where the commas should go and how to structure a clause. It can also be said to intuit the meaning of certain strings – words and phrases – based on the many contexts in which they appear throughout the data set. They are patterns from the beginning. This is how the most advanced language models, like OpenAI’s GPT-3, can learn to produce perfect grammatical prose that can reasonably stay on topic.

An advantage of this idea is that it is generalizable. For a machine learning model, a sequence is a sequence, whether it is organized in sonnets or amino acids. According to Jeremy Howard, an AI researcher at the University of San Francisco and an expert on language models, applying these models to biological sequences can be productive. With enough data from, say, genetic sequences of viruses known to be infectious, the model will implicitly learn something about how infectious viruses are structured. “This model will have a lot of sophisticated and complex knowledge,” he says. Hie knew that was the case. His undergraduate advisor, computer scientist Bonnie Berger, had previously done similar work with another member of his lab, using AI to predict protein folding patterns.

.Source