Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

. 2025 May 23 ; () : . [epub] 20250523

Status Publisher Jazyk angličtina Země Spojené státy americké Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid40410407

Grantová podpora
891397 EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Sklodowska-Curie Actions (H2020 Excellent Science - Marie Sklodowska-Curie Actions)
101097822 EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 European Research Council (H2020 Excellent Science - European Research Council)
101120237 EC | Horizon 2020 Framework Programme (EU Framework Programme for Research and Innovation H2020)

Odkazy

PubMed 40410407
DOI 10.1038/s41587-025-02663-3
PII: 10.1038/s41587-025-02663-3
Knihovny.cz E-zdroje

Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we named Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas-a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.

Zobrazit více v PubMed

Atanasov, A. G. et al. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021). PubMed PMC

Vermeulen, R., Schymanski, E. L., Barabási, A.-L. & Miller, G. W. The exposome and health: where chemistry meets biology. Science 367, 392–396 (2020). PubMed PMC

Banerjee, S. Empowering clinical diagnostics with mass spectrometry. ACS Omega 5, 2041–2048 (2020). PubMed PMC

Alseekh, S. et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nat. Methods 18, 747–756 (2021). PubMed PMC

da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015). PubMed PMC

Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Analyt. Chem. 78, 23–35 (2016).

de Jonge, N. F. et al. Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools. Metabolomics 18, 103 (2022). PubMed PMC

Bittremieux, W. et al. Comparison of cosine, modified cosine, and neutral loss based spectrum alignment for discovery of structurally related molecules. J. Am. Soc. Mass Spectrom. 33, 1733–1744 (2022). PubMed

Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021). PubMed PMC

Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nat. Protocols 15, 1954–1991 (2020). PubMed

van der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016). PubMed PMC

Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021). PubMed PMC

Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. Ms2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13, 84 (2021). PubMed PMC

Voronov, G. et al. Multi-scale sinusoidal embeddings enable learning on high resolution mass spectrometry data. Preprint at https://arxiv.org/abs/2207.02980 (2022).

Bittremieux, W., May, D. H., Bilmes, J. & Noble, W. S. A learned embedding for efficient joint analysis of millions of mass spectra. Nat. Methods 19, 675–678 (2022). PubMed PMC

Bittremieux, W., Wang, M. & Dorrestein, P. C. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 18, 94 (2022). PubMed PMC

Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021). PubMed PMC

Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016). PubMed PMC

Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. In Proc. Machine Learning Research (eds Krause, A. et al.) 25549–25562 (PMLR, 2023).

Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024). PubMed

Goldman, S., Bradshaw, J., Xin, J. & Coley, C. W. Prefix-tree decoding for predicting mass spectra from molecules. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023).

Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015). PubMed PMC

Xing, S., Shen, S., Xu, B., Li, X. & Huan, T. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods 20, 881–890 (2023). PubMed

Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016). PubMed PMC

Voronov, G. et al. MS2Prop: a machine learning model that directly predicts chemical properties from mass spectrometry data for novel compounds. Preprint at bioRxiv https://doi.org/10.1101/2022.10.09.511482 (2022).

Gebhard, T. D. et al. Inferring molecular complexity from mass spectrometry data using machine learning. In Proc. Machine Learning and the Physical Sciences (NeurIPS, 2022).

Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022). PubMed PMC

Butler, T. et al. MS2Mol: a transformer model for illuminating dark chemical space from mass spectra. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2023-vsmpx-v2 (2023).

Shrivastava, A. D. et al. MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 11, 1793 (2021).

Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019). PubMed

Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022). PubMed

Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).

Goldman, S., Xin, J., Provenzano, J. & Coley, C. W. MIST-CF: chemical formula inference from tandem mass spectra. J. Chem. Inf. Model. 64, 2421–2431 (2024).

Dührkop, K., Ludwig, M., Meusel, M. & Böcker. in Algorithms in Bioinformatics (eds Darling, A. & Stoye, J.) 45–58 (Springer, 2013).

Ridder, L. et al. Substructure-based annotation of high-resolution multistage MS PubMed

Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010). PubMed

Tandem Mass Spectral Library (National Institute of Standards and Technology, 2020); https://www.nist.gov/programs-projects/tandem-mass-spectral-library

Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024). PubMed PMC

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). PubMed

Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023). PubMed PMC

Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023). PubMed

Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). PubMed PMC

Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (Curran Associates, 2020).

He, K. et al. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15979–15988 (IEEE, 2022).

Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).

Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016). PubMed PMC

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).

Gemini Team, Google Gemini: A Family of Highly Capable Multimodal Models (Google, 2023); https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

Gunasekar, S. et al. Textbooks are all you need. Preprint at https://arxiv.org/abs/2306.11644 (2023).

Singh, A. Mass Spectrometry Search Tool (MASST). Nat. Methods 17, 128 (2020). PubMed

Quiros-Guerrero, L.-M. et al. Inventa: a computational tool to discover structural novelty in natural extracts libraries. Front. Mol. Biosci. 9, 1028334 (2022). PubMed PMC

Hu, H., Bindu, J. P. & Laskin, J. Self-supervised clustering of mass spectrometry imaging data using contrastive learning. Chem. Sci. 13, 90–98 (2022).

Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007). PubMed

Osipenko, S., Botashev, K., Nikolaev, E. & Kostyukevich, Y. Transfer learning for small molecule retention predictions. J. Chromatogr. A 1644, 462119 (2021). PubMed

Xu, L. L. & Röst, H. L. Peak detection on data independent acquisition mass spectrometry data with semisupervised convolutional transformers. Preprint at https://arxiv.org/abs/2010.13841 (2020).

Velickovic, P. Message passing all the way up. Preprint at https://arxiv.org/abs/2202.11097 (2022).

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).

Tancik, M. et al. Fourier features let networks learn high frequency functions in low dimensional domains. In Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (Curran Associates, 2020).

Kim, S., Rodgers, R. P. & Marshall, A. G. Truly ‘exact’ mass: elemental composition can be determined uniquely from molecular mass measurement at ~0.1 mDa accuracy for molecules up to ~500 Da. Int. J. Mass Spectrom. 251, 260–265 (2006).

Ying, C. et al. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems 34 (eds Ranzato, M. et al.) 28877–28888 (Curran Associates, 2021).

Alain, G. & Bengio, Y. Understanding intermediate layers using linear classifier probes. OpenReview https://openreview.net/forum?id=HJ4-rAVtl (2016).

Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002). PubMed

Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996). PubMed

Morgan, H. L. The generation of a unique machine description for chemical structures—a technique developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965).

Kretschmer, F. et al. Coverage bias in small molecule machine learning. Nat. Commun. 16, 554 (2025). PubMed PMC

McInnes, L. et al. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).

Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23, 3–25 (1997).

Liu, Y., D’Agostino, L. A., Qu, G., Jiang, G. & Martin, J. W. High-resolution mass spectrometry (HRMS) methods for nontarget discovery and characterization of poly- and per-fluoroalkyl substances (PFASs) in environmental and human samples. Trends Analyt. Chem. 121, 115420 (2019).

Mongia, M. et al. Fast mass spectrometry search and clustering of untargeted metabolomics data. Nat. Biotechnol. 42, 1672–1677 (2024). PubMed

Bittremieux, W. et al. Open access repository-scale propagated nearest neighbor suspect spectral library for untargeted metabolomics. Nat. Commun. 14, 8488 (2023). PubMed PMC

Griffiths, C. E. M., Armstrong, A. W., Gudjonsson, J. E. & Barker, J. N. W. N. Psoriasis. Lancet 397, 1301–1315 (2021). PubMed

Hu, W. et al. Co-detection of azoxystrobin and thiabendazole fungicides in mold and mildew resistant wallboards and in children. Heliyon 10, e27980 (2024). PubMed PMC

Haug, K. et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 48, D440–D444 (2020). PubMed

West, K. A., Schmid, R., Gauglitz, J. M., Wang, M. & Dorrestein, P. C. foodMASST a mass spectrometry search tool for foods and beverages. NPJ Sci. Food 6, 22 (2022). PubMed PMC

OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Dong, W., Charikar, M. & Li, K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th International Conference on World Wide Web, WWW ʼ11 (eds Srinivasan, S. et al.) 577–586 (Association for Computing Machinery, 2011).

Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 12, 12 (2020). PubMed PMC

Bushuiev, R. & Pluskal, T. Self-Supervised Machine Learning for the Interpretation of Molecular Mass Spectrometry Data. Master thesis, Czech Technical Univ. in Prague (2023).

Charikar, M. S. Similarity estimation techniques from rounding algorithms. In STOC '02: Proc. 34th Annual ACM Symposium on Theory of Computing 380–388 (Association for Computing Machinery, 2002).

Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In Proc. Machine Learning Research (eds Chaudhuri, K. et al.) 25514–25522 (PMLR, 2022).

Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. Preprint at https://arxiv.org/abs/2104.13478 (2021).

Jin, Z. et al. ContraNovo: a contrastive learning approach to enhance de novo peptide sequencing. In Proc. 38th AAAI Conference on Artificial Intelligence (eds Wooldridge, M. J. et al.) 144–152 (AAAI Press, 2024).

Eloff, K. et al. InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments. Nat. Mach. Intell. 7, 565–579 (2025).

Yilmaz, M. et al. Sequence-to-sequence translation from mass spectra to peptides with a transformer model. Nat. Commun. 15, 6427 (2024). PubMed PMC

Xiong, R. et al. On layer normalization in the transformer architecture. In Proc. Machine Learning Research 10524–10533 (PMLR, 2020).

Nguyen, T. Q. & Salazar, J. Transformers without tears: improving the normalization of self-attention. In Proc. 16th International Conference on Spoken Language Translation, IWSLT 2019 (eds Niehues, J. et al.) (Association for Computational Linguistics, 2019).

Zaheer, M. et al. Deep sets. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 3391–3401 (Curran Associates, 2017).

Zhang, R., Isola, P., Efros, A. A. Colorful image colorization. In Computer Vision–ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, Vol. 9907 (eds Leibe, B. et al.) (Springer, 2016).

Ekvall, M., Truong, P., Gabriel, W., Wilhelm, M. & Käll, L. Prosit Transformer: a transformer for prediction of MS2 spectrum intensities. J. Proteome Res. 21, 1359–1364 (2022). PubMed PMC

Zeng, W.-F. et al. AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics. Nat. Commun. 13, 7238 (2022). PubMed PMC

Pham, T. V. et al. A transformer architecture for retention time prediction in liquid chromatography mass spectrometry-based proteomics. Proteomics 23, 2200041 (2023).

Bouwmeester, R., Gabriels, R., Hulstaert, N., Martens, L. & Degroeve, S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat. Methods 18, 1363–1369 (2021). PubMed

Chechik, G., Sharma, V., Shalit, U. & Bengio, S. Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010).

Heirman, J. & Bittremieux, W. Reusability report: annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 6, 1296–1302 (2024).

Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In KDD ʼ16: Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

Lin, T., Goyal, P., Girshick, R. B., He, K. & Dollár, P. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV) 2999–3007 (IEEE, 2017).

Bushuiev, R. et al. MassSpecGym: a benchmark for the discovery and identification of molecules. In Advances in Neural Information Processing Systems 37 (eds Globerson, A. et al.) (Curran Associates, 2024).

Kingma, D. P., Ba, J. Proc. 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).

Huber, F. et al. matchms—processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020).

Röst, H. L., Schmitt, U., Aebersold, R. & Malmström, L. pyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 14, 74–77 (2014). PubMed

Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. M. et al.) 8024–8035 (Curran Associates, 2019).

Falcon, W. PyTorch lightning. CiNii https://cir.nii.ac.jp/crid/1370013168774120069 (2019).

Bushuiev, R. et al. GeMS (GNPS experimental mass spectra). Hugging Face https://huggingface.co/datasets/roman-bushuiev/GeMS (2025).

Bushuiev, R. et al. Weights of pre-trained DreaMS models. Zenodo https://zenodo.org/records/10997887 (2025).

Brungs, C., Schmid, R. & Pluskal, T. GNPS - MSnLib - Multi-stage fragmentation mass spectral library. MassIVE https://doi.org/10.25345/C5610W36Q (2024).

Gauglitz, J. & Dorrestein, P. GNPS Global Foodomics dataset 3500. MassIVE https://doi.org/10.25345/C5RH6S (2020).

Horai, H. et al. MoNA - MassBank of North America. https://mona.fiehnlab.ucdavis.edu/

Bushuiev, R. et al. MassSpecGym. Hugging Face https://huggingface.co/datasets/roman-bushuiev/MassSpecGym (2024).

Bushuiev, R. et al. pluskal-lab / DreaMS. GitHub https://github.com/pluskal-lab/DreaMS (2025).

Bushuiev, R. et al. DreaMS (Deep Representations Empowering the Annotation of Mass Spectra). Zenodo https://zenodo.org/records/13843034 (2025).

Bushuiev, R. et al. DreaMS (Deep Representations Empowering the Annotation of Mass Spectra). DreaMS https://dreams-docs.readthedocs.io/en/latest/ (2025).

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...