Researchers in the Ensembl team are making the most of machine learning methods to speed up genome annotation pipelines
EMBL’s European Bioinformatics Institute (EMBL-EBI) stores vast amounts of biological data and our researchers have expert knowledge of what these data are and how best to curate them. This makes EMBL-EBI well equipped to solve biological problems using machine learning – an artificial intelligence (AI) approach requiring extensive input of high-quality data to rapidly generate results.
One project initiated in this way came from within the Ensembl team, who are using machine learning to help allocate a function to different genes in their newly-annotated genomes at an unprecedented rate.
Adding gene function to genome annotations
Annotating a genome means identifying and mapping the locations and structures of genes and other genomic features. Having access to genome annotation gives researchers information about the location of a gene but assigning a potential function requires additional work and experimental evidence.
“We can start to extrapolate the function of a gene by looking at related genes in other species, but this can be costly, both computationally and in terms of human effort to manually curate the results,” said Fergal Martin, Eukaryotic Annotation Team Leader at EMBL-EBI. “This led us to try a machine learning approach to streamline this process of allocating potential gene function to the genes in our new species annotations deployed through Ensembl Rapid Release. Currently we are focusing on vertebrate species, but we want to extend the approach across eukaryotes.”
Machine learning to allocate gene function
The HUGO Gene Nomenclature Committee (HGNC) and the Vertebrate Gene Nomenclature Committee (VGNC) teams at EMBL-EBI work hard to manually assign gene symbols – a short-form abbreviation for a particular gene usually with an associated function – to a variety of genomes. While the manual assignment efforts continue to be streamlined and cover a growing number of vertebrate species, many species in Ensembl have the majority of their gene symbols assigned through automated methods.
Historically gene symbols have been assigned through building gene trees, which describe the evolutionary relationships between genes both across and within species. This approach is computationally costly, especially with the recent rapid growth in the number of sequenced vertebrate genomes. Fergal and his team wanted to see if they could assign gene symbols and thereby infer function through a machine learning approach.
“We trained a neural network by feeding it roughly three and a half million protein sequences from a variety of different vertebrate species from Ensembl,” said Fergal. “For these sequences, we already had existing gene symbols with associated functions. The end result is that we have built a classifier that can replicate the existing assignments with around 94-97 percent accuracy, depending on the species. Crucially, it takes less than a minute to generate assignments and confidence values for a vertebrate gene set.”
Why use a machine learning approach?
Using machine learning is saving the team a huge amount of computing time and the system is a lot less complex in terms of implementation than the existing approach that it replicates. Therefore, the team is looking at deploying it to the larger community.
“While the system is not a replacement for the very high-quality manual assignments produced by teams like HGNC and VGNC, this approach could potentially be useful to curators as an additional tool to help manually validate assignments. It’s also something that individual users could use in assessing their own annotations,” said Fergal.
The benefits of machine learning
“As technology advances scientists are increasingly using machine learning to answer biological questions. One example is the protein structure predictions from AlphaFold. We didn’t have a highly accurate algorithmic approach to figure out how to fold proteins, but deep learning is helping to solve this complex biological mystery,” said Fergal.
“That’s different from what we are trying to do,” he added. “AlphaFold is an example of solving a problem where we didn’t understand all the rules and variables of the system we were trying to model. What we’re doing here is replicating a system that we do understand, but which requires a lot of computing power to run. It’s exciting that deep learning approaches can provide such valuable solutions to challenges across the life sciences.”
Going forward, there is huge potential for using machine learning methods like this, both within Ensembl and across the organisation to benefit other data resources. Machine learning approaches can reduce both computational time and complexity. Large scale genomics projects such as the Darwin Tree of Life (DToL) and the Earth BioGenome Project (EBP) will also benefit greatly from these approaches as the new species annotations created for these can be deployed faster at a high-standard.
“If we are to annotate the genomes of all species on Earth, we need to think of where we can make computational savings,” said Fergal. “There’s an incredible wealth of both in-house knowledge and high quality training data at EMBL-EBI and it’s really exciting to think about how machine learning could not only improve the quality of our data but also drastically reduce the associated computational cost and environmental footprint.”
This machine learning approach will be rolled out as part of Ensembl Rapid Release, Ensembl’s lightweight genome browser designed to allow fast access to the latest genome annotations for a large number of species. Check out Ensembl Rapid Release to find the latest annotations from large biodiversity initiatives such as Darwin Tree of Life, the Vertebrate Genomes Project, and the Earth BioGenome Project.