Meta AI creates the largest metagenomic protein database
Ever wondered what proteins in the ocean, or in your own body were made up of? Meta AI's latest database might have an answer. The ESM Metagenomic Atlas is the first of its kind, comprising more than 600 million metagenomic protein structures. Discovering new metagenomic proteins from this repository might aid in curing diseases, cleaning the environment, and producing cleaner energy.
Why does this story matter?
Structures of billions of new proteins have already been documented in other databases led by NCBI, Joint Genome Institue, and European Bioinformatics Institute. So what's different about Meta AI's new database? The novelty lies in their language model which provides the 'first comprehensive view of the structures of proteins in a metagenomics database at the scale of hundreds of millions of proteins.'
Take a look at the official announcement
What is metagenomics?
According to National Human Research Institute, metagenomics is defined as the "study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample." In other words, it involves the study of a specific community of microorganisms, such as those residing on human skin, in the soil, or in a water sample.
Amino acids are denoted by a specific character
Proteins are complex molecules made up of building blocks called amino acids. Generally, there are 20 different amino acids. Just as how essays contain words, proteins contain a sequence of characters, with each character denoting a specific amino acid.
The language model was developed by experimenting with various proteins
"Using a form of self-supervised learning known as masked language modeling, we trained a language model on the sequences of millions of natural proteins," explains Meta AI. "We trained a language model to fill in the blanks in a protein sequence, like "GL_KKE_AHY_G" across millions of diverse proteins. We found that information about the structure and function of proteins emerges from this training."
The latest language model enables structure prediction with high resolution
Evolutionary scale modeling (ESM) utilizes AI to read protein sequences. These language models can pick up the properties of proteins including their structure and function. ESM1b, which was released in 2020, has been employed to predict the evolution of COVID-19 and to determine the genetic causes of diseases. It has been scaled to ESM-2, the next-generation version. This prediction model offers an atomic-scale resolution.
What is unique about the ESM Metagenomic Atlas?
Meta AI claims that ESM Metagenomic Atlas is the first database to provide a comprehensive record of metagenomic proteins. Further, it is the largest known database of protein structures predicted with high resolution. The Atlas is three times larger than existing protein databases. Furthermore, the company's novel protein-folding technique, ESMFold, can make predictions sixty times faster than the current approaches.
The enormous database might be a significant tool for researchers
"The ESM Metagenomic Atlas will enable scientists to search and analyze the structures of metagenomic proteins at the scale of hundreds of millions of proteins," said Meta AI in its official blog post. "This can help researchers to identify structures that have not been characterized before, search for distant evolutionary relationships, and discover new proteins that can be useful in medicine and other applications."
Their language model works several times faster than existing ones
"This new structure prediction capability enabled us to predict sequences for the more than 600 million metagenomic proteins in the atlas in just two weeks on a cluster of approximately 2,000 GPUs," states Meta AI, crediting the speed of their prediction algorithm.