Detail
Článek
Článek online
FT
Medvik - BMČ
  • Je něco špatně v tomto záznamu ?

An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang

. 2025 ; 26 (1) : 174. [pub] 20250711

Jazyk angličtina Země Anglie, Velká Británie

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/bmc25022381

Grantová podpora
1R03OD034493-01 NIH HHS - United States
NIH 5U24DK133658-02 NIH HHS - United States

BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

Citace poskytuje Crossref.org

000      
00000naa a2200000 a 4500
001      
bmc25022381
003      
CZ-PrNML
005      
20251023080249.0
007      
ta
008      
251014s2025 enk f 000 0|eng||
009      
AR
024    7_
$a 10.1186/s12859-025-06194-1 $2 doi
035    __
$a (PubMed)40646448
040    __
$a ABA008 $b cze $d ABA008 $e AACR2
041    0_
$a eng
044    __
$a enk
100    1_
$a Strobel, Michael $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0009000038290048
245    13
$a An evaluation methodology for machine learning-based tandem mass spectra similarity prediction / $c M. Strobel, A. Gil-de-la-Fuente, MR. Zare Shahneh, YE. Abiead, R. Bushuiev, A. Bushuiev, T. Pluskal, M. Wang
520    9_
$a BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.
650    12
$a strojové učení $7 D000069550
650    12
$a tandemová hmotnostní spektrometrie $x metody $7 D053719
650    _2
$a algoritmy $7 D000465
655    _2
$a časopisecké články $7 D016428
700    1_
$a Gil-de-la-Fuente, Alberto $u Information Technologies Department, Escuela Politécnica Superior, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla Del monte, 28668, Madrid, Spain $1 https://orcid.org/0000000259511601
700    1_
$a Zare Shahneh, Mohammad Reza $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA $1 https://orcid.org/0000000257603190
700    1_
$a Abiead, Yasin El $u Skaggs School of Pharmacy and Pharmaceutical Science, University of California San Diego, 9255 Pharmacy Ln, San Diego, CA, 92093, USA $1 https://orcid.org/0000000343927706
700    1_
$a Bushuiev, Roman $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0000000317691509
700    1_
$a Bushuiev, Anton $u Czech Institute of Informatics, Robotics and Cybernetics, Jugoslávských partyzánů 1580/3, Prague, 16000, Czech Republic $1 https://orcid.org/0009000747836584
700    1_
$a Pluskal, Tomáš $u Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Flemingovo nám. 542/2, Prague, 16000, Czech Republic $1 https://orcid.org/0000000269403006
700    1_
$a Wang, Mingxun $u Department of Computer Science and Engineering, University of California Riverside, 900 University Ave., Riverside, CA, 92521, USA. mingxun.wang@cs.ucr.edu $1 https://orcid.org/0000000176476097
773    0_
$w MED00008167 $t BMC bioinformatics $x 1471-2105 $g Roč. 26, č. 1 (2025), s. 174
856    41
$u https://pubmed.ncbi.nlm.nih.gov/40646448 $y Pubmed
910    __
$a ABA008 $b sig $c sign $y - $z 0
990    __
$a 20251014 $b ABA008
991    __
$a 20251023080255 $b ABA008
999    __
$a ok $b bmc $g 2417268 $s 1260544
BAS    __
$a 3
BAS    __
$a PreBMC-MEDLINE
BMC    __
$a 2025 $b 26 $c 1 $d 174 $e 20250711 $i 1471-2105 $m BMC bioinformatics $n BMC Bioinformatics $x MED00008167
GRA    __
$a 1R03OD034493-01 $p NIH HHS $2 United States
GRA    __
$a NIH 5U24DK133658-02 $p NIH HHS $2 United States
LZP    __
$a Pubmed-20251014

Najít záznam

Citační ukazatele

Pouze přihlášení uživatelé

Možnosti archivace

Nahrávání dat ...