JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek

FT
PubMed

Záznam pochází z PubMed

An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

Strobel, Michael
Autor Strobel, Michael ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA
Gil-de-la-Fuente, Alberto
Autor Gil-de-la-Fuente, Alberto ORCID Information Technologies Department, Escuela Politécnica Superior, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla Del monte, 28668, Madrid, Spain
Zare Shahneh, Mohammad Reza
Autor Zare Shahneh, Mohammad Reza ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA
Abiead, Yasin El
Autor Abiead, Yasin El ORCID Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, 9255 Pharmacy Ln, San Diego, CA, 92093, USA
Bushuiev, Roman
Autor Bushuiev, Roman ORCID Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic
Bushuiev, Anton
Autor Bushuiev, Anton ORCID Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic
Pluskal, Tomáš
Autor Pluskal, Tomáš ORCID Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic
Wang, Mingxun
Autor Wang, Mingxun ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA. mingxun.wang@cs.ucr.edu

BMC bioinformatics. 2025 Jul 11 ; 26 (1) : 174. [epub] 20250711

BMC Bioinformatics
ISSN 1471-2105
Zdroj

Jazyk angličtina Země Velká Británie, Anglie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz https://www.medvik.cz/link/pmid40646448

Grantová podpora
1R03OD034493-01 NIH HHS - United States
NIH 5U24DK133658-02 NIH HHS - United States

Online Plný text

PubMed 40646448
PubMed Central PMC12247221
DOI 10.1186/s12859-025-06194-1
PII: 10.1186/s12859-025-06194-1
Knihovny.cz E-zdroje

Klíčová slova
Benchmark, Machine learning, Mass spectrometry, Metabolomics, Spectral similarity measure,
MeSH
algoritmy MeSH
strojové učení * MeSH
tandemová hmotnostní spektrometrie * metody MeSH
Publikační typ
časopisecké články MeSH

BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

Czech Institute of Informatics Robotics and Cybernetics Jugoslávských partyzánů 1580 3 Prague 16000 Czech Republic

Department of Computer Science and Engineering University of California Riverside 900 University Ave Riverside CA 92521 USA

Information Technologies Department Escuela Politécnica Superior Universidad San Pablo CEU CEU Universities Urbanización Montepríncipe Boadilla Del monte 28668 Madrid Spain

Institute of Organic Chemistry and Biochemistry Czech Academy of Sciences Flemingovo nám 542 2 Prague 16000 Czech Republic

Skaggs School of Pharmacy and Pharmaceutical Science University of California San Diego 9255 Pharmacy Ln San Diego CA 92093 USA

Zobrazit více v PubMed

Watrous J, et al. Mass spectral molecular networking of living microbial colonies. Proc Natl Acad Sci. 2012;109. PubMed PMC

Nothias L-F, et al. Feature-based molecular networking in the GNPS analysis environment. Nat Methods. 2020;17:905–8. PubMed PMC

Stein SE, Scott DR. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom. 1994;5:859–66. PubMed

Li Y, et al. Spectral entropy outperforms MS/MS Dot product similarity for small-molecule compound identification. Nat Methods. 2021;18:1524–31. PubMed PMC

Wang X, et al. Network topology evaluation and transitive alignments for molecular networking. J Am Soc Mass Spectrom. 2024;35:2165–75. PubMed PMC

Bushuiev R, et al. Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nat Biotechnol. 2025; 10.1038/s41587-025-02663-3 PubMed

Huber F et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput Biol. 2021;17. PubMed PMC

Huber F, van der Burg S, van der Hooft JJJ, Ridder L. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J Cheminform. 2021;13. PubMed PMC

Guo H, Xue K, Sun H, Jiang W, Pu S. Contrastive learning-based embedder for the representation of tandem mass spectra. Anal Chem. 2023;95:7888–96. PubMed

Wang M, et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat Biotechnol. 2016;34:828–37. PubMed PMC

Horai H, et al. MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom. 2010;45:703–14. PubMed

Ye N et al. OoD-Bench: quantifying and Understanding two dimensions of Out-of-Distribution generalization. 2021. Preprint at 10.48550/ARXIV.2106.03721.

Yang J, Zhou K, Li Y, Liu Z. Generalized out-of-distribution detection: a survey. Int J Comput Vis. 2024;132:5635–62.

Zhao B, et al. OOD-CV-v2: an extended benchmark for robustness to Out-of-Distribution shifts of individual nuisances in natural images. IEEE Trans Pattern Anal Mach Intell. 2024;46:11104–18. PubMed

Hupkes D, et al. A taxonomy and review of generalization research in NLP. Nat Mach Intell. 2023;5:1161–74.

De Jonge NF, et al. Reproducible MS/MS library cleaning pipeline in matchms. J Cheminform. 2024;16:88. PubMed PMC

Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Comput. 1991;3:79–87. PubMed

Yuksel SE, Wilson JN, Gader PD. Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst. 2012;23:1177–93. PubMed

López-Pérez K, et al. Molecular similarity: theory, applications, and perspectives. Artif Intell Chem. 2024;2:100077. PubMed PMC

Medina-Franco JL, Sánchez-Cruz N, López-López E, Díaz-Eufracio, BI. Progress on open chemoinformatic tools for expanding and exploring the chemical space. J Comput Aided Mol Des. 2022;36:341–54. PubMed PMC

Maggiora G, Vogt M, Stumpfe D, Bajorath J. Molecular similarity in medicinal chemistry: miniperspective. J Med Chem. 2014;57:3186–204. PubMed

Bushuiev R, et al. MassSpecGym: A benchmark for the discovery and identification of molecules. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2024. p. 110010–27. https://proceedings.neurips.cc/paper_files/paper/2024/file/c6c31413d5c53b7d1c343c1498734b0f-Paper-Datasets_and_Benchmarks_Track.pdf

Greg, Landrum et al. rdkit/rdkit: 2024_09_1 (Q3 2024) Release. Zenodo 10.5281/ZENODO.591637 (2024).

Martin YC. Let’s not forget tautomers. J Comput Aided Mol Des. 2009;23:693. PubMed PMC

Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 2015;7:20. PubMed PMC

Di Tommaso P, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9. PubMed

Huber F, et al. matchms - processing and similarity evaluation of mass spectrometry data. J Open Source Softw. 2020;5:2411.

Bittremieux W, et al. Comparison of cosine, modified cosine, and neutral loss based spectrum alignment for discovery of structurally related molecules. J Am Soc Mass Spectrom. 2022;33:1733–44. PubMed

Najít záznam

v BMČ

An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

Najít záznam

Citační ukazatele

Možnosti archivace