JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek
Článek online

FT
Medvik - BMČ

Je něco špatně v tomto záznamu ?

An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang

Strobel, Michael
Autor Strobel, Michael ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA
Gil-de-la-Fuente, Alberto
Autor Gil-de-la-Fuente, Alberto ORCID Information Technologies Department, Escuela Politécnica Superior, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla Del monte, 28668, Madrid, Spain
Zare Shahneh, Mohammad Reza
Autor Zare Shahneh, Mohammad Reza ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA
Abiead, Yasin El
Autor Abiead, Yasin El ORCID Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, 9255 Pharmacy Ln, San Diego, CA, 92093, USA
Bushuiev, Roman
Autor Bushuiev, Roman ORCID Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic
Bushuiev, Anton
Autor Bushuiev, Anton ORCID Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic
Pluskal, Tomáš
Autor Pluskal, Tomáš ORCID Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic
Wang, Mingxun
Autor Wang, Mingxun ORCID Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA. mingxun.wang@cs.ucr.edu

BMC bioinformatics. 2025 ; 26 (1) : 174. [pub] 20250711

BMC Bioinformatics
ISSN 1471-2105
Medvik
Zdroj

Jazyk angličtina Země Anglie, Velká Británie

Typ dokumentu časopisecké články

Perzistentní odkaz https://www.medvik.cz/link/bmc25022381

Grantová podpora
1R03OD034493-01 NIH HHS - United States
NIH 5U24DK133658-02 NIH HHS - United States

Online Plný text

NLK BioMedCentral od 2000-01-12
BioMedCentral Open Access od 2000
Directory of Open Access Journals od 2000
Free Medical Journals od 2000
PubMed Central od 2000
Europe PubMed Central od 2000
ProQuest Central od 2009-01-01
Open Access Digital Library od 2000-07-01
Open Access Digital Library od 2000-01-01
Open Access Digital Library od 2000-01-01
Medline Complete (EBSCOhost) od 2000-01-01
Health & Medicine (ProQuest) od 2009-01-01
ROAD: Directory of Open Access Scholarly Resources od 2000
Springer Nature OA/Free Journals od 2000-12-01

PubMed 40646448
DOI 10.1186/s12859-025-06194-1
Knihovny.cz E-zdroje

MeSH
algoritmy MeSH
strojové učení * MeSH
tandemová hmotnostní spektrometrie * metody MeSH
Publikační typ
časopisecké články MeSH

BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

Czech Institute of Informatics Robotics and Cybernetics Jugoslávských partyzánů 1580 3 Prague 16000 Czech Republic

Department of Computer Science and Engineering University of California Riverside 900 University Ave Riverside CA 92521 USA

Information Technologies Department Escuela Politécnica Superior Universidad San Pablo CEU CEU Universities Urbanización Montepríncipe Boadilla Del monte 28668 Madrid Spain

Institute of Organic Chemistry and Biochemistry Czech Academy of Sciences Flemingovo nám 542 2 Prague 16000 Czech Republic

Skaggs School of Pharmacy and Pharmaceutical Science University of California San Diego 9255 Pharmacy Ln San Diego CA 92093 USA

Citace poskytuje Crossref.org

000: 00000naa a2200000 a 4500

001: bmc25022381

003: CZ-PrNML

005: 20251023080249.0

007: ta

008: 251014s2025 enk f 000 0|eng||

009: AR

024 7_: $a 10.1186/s12859-025-06194-1 $2 doi

035 __: $a (PubMed)40646448

040 __: $a ABA008 $b cze $d ABA008 $e AACR2

041 0_: $a eng

044 __: $a enk

100 1_: $a Strobel, Michael $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0009000038290048

245 13: $a An evaluation methodology for machine learning-based tandem mass spectra similarity prediction / $c M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang

520 9_: $a BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

650 12: $a strojové učení $7 D000069550

650 12: $a tandemová hmotnostní spektrometrie $x metody $7 D053719

650 _2: $a algoritmy $7 D000465

655 _2: $a časopisecké články $7 D016428

700 1_: $a Gil-de-la-Fuente, Alberto $u Information Technologies Department, Escuela Politécnica Superior, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla Del monte, 28668, Madrid, Spain $1 https://orcid.org/0000000259511601

700 1_: $a Zare Shahneh, Mohammad Reza $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0000000257603190

700 1_: $a Abiead, Yasin El $u Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, 9255 Pharmacy Ln, San Diego, CA, 92093, USA $1 https://orcid.org/0000000343927706

700 1_: $a Bushuiev, Roman $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0000000317691509

700 1_: $a Bushuiev, Anton $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0009000747836584

700 1_: $a Pluskal, Tomáš $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $1 https://orcid.org/0000000269403006

700 1_: $a Wang, Mingxun $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA. mingxun.wang@cs.ucr.edu $1 https://orcid.org/0000000176476097

773 0_: $w MED00008167 $t BMC bioinformatics $x 1471-2105 $g Roč. 26, č. 1 (2025), s. 174

856 41: $u https://pubmed.ncbi.nlm.nih.gov/40646448 $y Pubmed

910 __: $a ABA008 $b sig $c sign $y - $z 0

990 __: $a 20251014 $b ABA008

991 __: $a 20251023080255 $b ABA008

999 __: $a ok $b bmc $g 2417268 $s 1260544

BAS __: $a 3

BAS __: $a PreBMC-MEDLINE

BMC __: $a 2025 $b 26 $c 1 $d 174 $e 20250711 $i 1471-2105 $m BMC bioinformatics $n BMC Bioinformatics $x MED00008167

GRA __: $a 1R03OD034493-01 $p NIH HHS $2 United States

GRA __: $a NIH 5U24DK133658-02 $p NIH HHS $2 United States

LZP __: $a Pubmed-20251014

Najít záznam

v PubMed

Citační ukazatele

Pouze přihlášení uživatelé

An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

Najít záznam

Citační ukazatele

Možnosti archivace