-
Je něco špatně v tomto záznamu ?
An evaluation methodology for machine learning-based tandem mass spectra similarity prediction
M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang
Jazyk angličtina Země Anglie, Velká Británie
Typ dokumentu časopisecké články
Grantová podpora
1R03OD034493-01
NIH HHS - United States
NIH 5U24DK133658-02
NIH HHS - United States
NLK
BioMedCentral
od 2000-01-12
BioMedCentral Open Access
od 2000
Directory of Open Access Journals
od 2000
Free Medical Journals
od 2000
PubMed Central
od 2000
Europe PubMed Central
od 2000
ProQuest Central
od 2009-01-01
Open Access Digital Library
od 2000-07-01
Open Access Digital Library
od 2000-01-01
Open Access Digital Library
od 2000-01-01
Medline Complete (EBSCOhost)
od 2000-01-01
Health & Medicine (ProQuest)
od 2009-01-01
ROAD: Directory of Open Access Scholarly Resources
od 2000
Springer Nature OA/Free Journals
od 2000-12-01
- MeSH
- algoritmy MeSH
- strojové učení * MeSH
- tandemová hmotnostní spektrometrie * metody MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.
Citace poskytuje Crossref.org
- 000
- 00000naa a2200000 a 4500
- 001
- bmc25022381
- 003
- CZ-PrNML
- 005
- 20251023080249.0
- 007
- ta
- 008
- 251014s2025 enk f 000 0|eng||
- 009
- AR
- 024 7_
- $a 10.1186/s12859-025-06194-1 $2 doi
- 035 __
- $a (PubMed)40646448
- 040 __
- $a ABA008 $b cze $d ABA008 $e AACR2
- 041 0_
- $a eng
- 044 __
- $a enk
- 100 1_
- $a Strobel, Michael $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0009000038290048
- 245 13
- $a An evaluation methodology for machine learning-based tandem mass spectra similarity prediction / $c M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang
- 520 9_
- $a BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.
- 650 12
- $a strojové učení $7 D000069550
- 650 12
- $a tandemová hmotnostní spektrometrie $x metody $7 D053719
- 650 _2
- $a algoritmy $7 D000465
- 655 _2
- $a časopisecké články $7 D016428
- 700 1_
- $a Gil-de-la-Fuente, Alberto $u Information Technologies Department, Escuela Politécnica Superior, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla Del monte, 28668, Madrid, Spain $1 https://orcid.org/0000000259511601
- 700 1_
- $a Zare Shahneh, Mohammad Reza $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0000000257603190
- 700 1_
- $a Abiead, Yasin El $u Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, 9255 Pharmacy Ln, San Diego, CA, 92093, USA $1 https://orcid.org/0000000343927706
- 700 1_
- $a Bushuiev, Roman $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0000000317691509
- 700 1_
- $a Bushuiev, Anton $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0009000747836584
- 700 1_
- $a Pluskal, Tomáš $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $1 https://orcid.org/0000000269403006
- 700 1_
- $a Wang, Mingxun $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA. mingxun.wang@cs.ucr.edu $1 https://orcid.org/0000000176476097
- 773 0_
- $w MED00008167 $t BMC bioinformatics $x 1471-2105 $g Roč. 26, č. 1 (2025), s. 174
- 856 41
- $u https://pubmed.ncbi.nlm.nih.gov/40646448 $y Pubmed
- 910 __
- $a ABA008 $b sig $c sign $y - $z 0
- 990 __
- $a 20251014 $b ABA008
- 991 __
- $a 20251023080255 $b ABA008
- 999 __
- $a ok $b bmc $g 2417268 $s 1260544
- BAS __
- $a 3
- BAS __
- $a PreBMC-MEDLINE
- BMC __
- $a 2025 $b 26 $c 1 $d 174 $e 20250711 $i 1471-2105 $m BMC bioinformatics $n BMC Bioinformatics $x MED00008167
- GRA __
- $a 1R03OD034493-01 $p NIH HHS $2 United States
- GRA __
- $a NIH 5U24DK133658-02 $p NIH HHS $2 United States
- LZP __
- $a Pubmed-20251014