The success of ChatGPT and its competitors is based on what is called pop-up behavior. These systems, called large language models (LLMs), were not trained to produce natural-sounding language (or effective malware). they were simply tasked with tracking word usage statistics. But given a large enough training set of language samples and a sufficiently complex neural network, their training resulted in an internal representation that “understood” English usage and a large collection of facts. Their complex behavior resulted from a much simpler training.
A team at Meta has now thought that this kind of emergent understanding should not be limited to languages. So he has trained an LLM on the statistics of amino acid occurrences within proteins and used the system’s internal representation of what he learned to extract information about the structure of those proteins. The result is not as good as the best competing AI systems for predicting protein structures, but it is significantly faster and still improving.
LLMs: Not just about language
The first thing you need to know to understand this work is that while the “language” in the name “LLM” refers to their initial development for language processing work, they can potentially be used for a variety of purposes. So while language processing is a common use case for LLM, these models have other capabilities. In fact, the term “Large” is much more informative, since all LLMs have a large number of nodes—the “neurons” in a neural network—and an even larger number of values that describe the weights of the connections between those nodes. While they were first developed for language processing, they can potentially be used for a variety of tasks.
The goal in this new work was to take the linear sequence of amino acids that make up a protein and use it to predict how those amino acids are arranged in three-dimensional space once the protein matures. This three-dimensional structure is essential for protein function and can help us understand how proteins misbehave after picking up mutations or allow us to design drugs to inactivate pathogen proteins, among other uses. Predicting protein structures was a challenge that baffled generations of scientists until this decade, when Google’s DeepMind artificial intelligence team announced a system that, by most practical definitions of “solved,” solved the problem. Google’s system was quickly followed by a system developed along similar lines by the academic community.
Both of these efforts relied on the fact that evolution had already produced large sets of related proteins that adopted similar three-dimensional conformations. By joining these related proteins, AI systems could make inferences about where and what kinds of changes could be tolerated while maintaining a similar structure, as well as how changes in one part of the protein could be offset by changes in another . These evolutionary constraints allow systems to determine which parts of the protein should be close to each other in three-dimensional space, and thus what the structure was likely to be.
The reasoning behind Meta’s new work is that training an LLM-type neural network could be done in a way that would allow the system to handle the same type of evolutionary constraints without having to deal with the messy work of aligning all the protein of sequences in the first position. Just as the rules of grammar would emerge from training an LLM on language samples, the constraints imposed by evolution would emerge from training the system on protein samples.
Watch out for amino acids
How this worked in practice was that the researchers took a large sample of proteins and randomly ruled out the identity of a few individual amino acids. The system was then asked to predict the amino acid that should be present. In the process of this training, the system developed the ability to use information such as statistics about amino acid frequency and the context of the surrounding protein to make its guesses. Implicit in this context are the things that required special treatment in previous efforts: the identity of evolutionarily related proteins, and what variations in those relatives tells us about which parts of the protein are close to each other in three-dimensional space .
Assuming that the reasoning for how LLMs would work is true (and Meta was relying on earlier research that suggested so), the trick to developing an operating system is to extract the information contained in the neural network. Neural networks are often considered a “black box” as we don’t necessarily know how they arrive at their decisions. But this becomes less and less true over time as people incorporate traits such as the ability to control their decision-making process.
In this case, the researchers relied on the LLM’s ability to describe what is called a “pattern of attention.” In practice, when you give LLM a series of amino acids and ask it to evaluate them, the attention pattern is the set of features it looks at in order to perform its analysis.
To convert the attention pattern into a 3D structure, the researchers trained a second artificial intelligence system to correlate the attention pattern for proteins where we know their 3D structures with the actual structure. Since we only have experimentally determined structures for a limited number of proteins, the researchers also used some of the structures predicted by one of the other AI systems as part of this training.
The resulting system was named ESM-2. Once fully trained, ESM-2 was able to ingest a raw sequence of amino acids and produce a 3D protein structure, along with a score representing its confidence in the accuracy of that structure.