by Vicky Hatch

Streamlining proteomics data access for machine learning applications

Mass spec data with 0s and 1s.
ProteomicsML. Credit: Karen Arnott/EMBL-EBI


  • ProteomicsML is a unified platform to streamline how researchers access proteomics data for machine learning applications.
  • The platform enhances accessibility and reproducibility in proteomics research by providing comprehensive tutorials.
  • ProteomicsML is a community-driven project and encourages contributions that expand and evolve with the rapidly advancing field of proteomics.

ProteomicsML – a free online resource – aims to simplify the complex, time-consuming process of having available proteomics datasets that train machine learning algorithms. By acting as a centralised platform, ProteomicsML aims to make proteomics research more accessible and reproducible. The platform also contains a range of freely available tutorials to aid proteomics research for both experienced scientists and newcomers to the field.

“ProteomicsML emerged as a community-driven project,” said Juan Antonio Vizcaino, Team Leader, Proteomics at EMBL’s European Bioinformatics Institute (EMBL-EBI). “The aim of the platform is to promote AI and machine learning applications for mass spectrometry-based proteomics data. The community will create and document training datasets and tutorials, making ProteomicsML a vital resource for anyone new to the field or looking to apply machine learning to proteomics data.”

Proteomics data processing

Preparing proteomics data for machine learning is complex and time-consuming. Different labs use varied methods, making it hard to share and use data widely. ProteomicsML tackles this challenge by offering an online platform with easy-to-use data formats and detailed tutorials to aid accessibility across the field.

ProteomicsML also facilitates the application of machine learning to proteomics data by offering openly available datasets that can be used to train machine learning algorithms, and providing educational materials to help researchers get the most out of these datasets. The resource covers a wide range of data types including ion fragmentation intensity, ion mobility, retention time, protein detectability, and more, making ProteomicsML a valuable tool for the proteomics community and also for AI practitioners.

Expansion and community contribution

ProteomicsML is designed to evolve with the field. It encourages community contributions, allowing researchers to upload data and tutorials on their data handling and machine learning methodologies.

ProteomicsML also stands as a comprehensive resource for those working with machine learning methods to analyse mass spectrometry proteomics data. By providing multiple datasets on a range of liquid chromatography and mass spectrometry peptide properties, it offers an accessible starting point for newcomers to the field.

Community-driven resource

ProteomicsML was created by the University of Southern Denmark (SDU), CompOmics, Leiden University Medical Center (LUMC), PeptideAtlas, the National Institute of Standards and Technology (NIST), the PRoteomics IDEntification database (PRIDE), and MSAID. It was a direct output from a workshop held at the Lorentz Center, in Leiden.

This was originally published by EMBL News.