Most cited article - PubMed ID 32839597
Feature-based molecular networking in the GNPS analysis environment
BACKGROUND: Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) - called molecular networking - organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability. RESULT: In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use. CONCLUSION: It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.
- Keywords
- Benchmark, Machine learning, Mass spectrometry, Metabolomics, Spectral similarity measure,
- MeSH
- Algorithms MeSH
- Machine Learning * MeSH
- Tandem Mass Spectrometry * methods MeSH
- Publication type
- Journal Article MeSH
Plant specialized metabolites play key roles in diverse physiological processes and ecological interactions. Identifying structurally novel metabolites, as well as discovering known compounds in new species, is often crucial for answering broader biological questions. The Piper genus (Piperaceae family) is known for its special phytochemistry and has been extensively studied over the past decades. Here, we investigated the alkaloid diversity of Piper fimbriulatum, a myrmecophytic plant native to Central America, using a metabolomics workflow that combines untargeted LC-MS/MS analysis with a range of recently developed computational tools. Specifically, we leverage open MS/MS spectral libraries and metabolomics data repositories for metabolite annotation, guiding isolation efforts toward structurally new compounds (i.e., dereplication). As a result, we identified several alkaloids belonging to five different classes and isolated one novel seco-benzylisoquinoline alkaloid featuring a linear quaternary amine moiety which we named fimbriulatumine. Notably, many of the identified compounds were never reported in Piperaceae plants. Our findings expand the known alkaloid diversity of this family and demonstrate the value of revisiting well-studied plant families using state-of-the-art computational metabolomics workflows to uncover previously overlooked chemodiversity. To contextualize our findings within a broader biological context, we employed a workflow for automated mining of literature reports of the identified alkaloid scaffolds and mapped the results onto the angiosperm tree of life. By doing so, we highlight the remarkable alkaloid diversity within the Piper genus and provide a framework for generating hypotheses on the biosynthetic evolution of these specialized metabolites. Many of the computational tools and data resources used in this study remain underutilized within the plant science community. This manuscript demonstrates their potential through a practical application and aims to promote broader accessibility to untargeted metabolomics approaches.
- Keywords
- Piper fimbriulatum, Piperaceae, Wikidata, alkaloids, angiosperms, computational metabolomics, mass spectrometry, technical advance,
- MeSH
- Alkaloids * metabolism chemistry MeSH
- Chromatography, Liquid MeSH
- Metabolomics * methods MeSH
- Myrmecophytes MeSH
- Piper * metabolism chemistry MeSH
- Tandem Mass Spectrometry MeSH
- Publication type
- Journal Article MeSH
- Names of Substances
- Alkaloids * MeSH
Feature-based molecular networking (FBMN) is a popular analysis approach for liquid chromatography-tandem mass spectrometry-based non-targeted metabolomics data. While processing liquid chromatography-tandem mass spectrometry data through FBMN is fairly streamlined, downstream data handling and statistical interrogation are often a key bottleneck. Especially users new to statistical analysis struggle to effectively handle and analyze complex data matrices. Here we provide a comprehensive guide for the statistical analysis of FBMN results, focusing on the downstream analysis of the FBMN output table. We explain the data structure and principles of data cleanup and normalization, as well as uni- and multivariate statistical analysis of FBMN results. We provide explanations and code in two scripting languages (R and Python) as well as the QIIME2 framework for all protocol steps, from data clean-up to statistical analysis. All code is shared in the form of Jupyter Notebooks ( https://github.com/Functional-Metabolomics-Lab/FBMN-STATS ). Additionally, the protocol is accompanied by a web application with a graphical user interface ( https://fbmn-statsguide.gnps2.org/ ) to lower the barrier of entry for new users and for educational purposes. Finally, we also show users how to integrate their statistical results into the molecular network using the Cytoscape visualization tool. Throughout the protocol, we use a previously published environmental metabolomics dataset for demonstration purposes. Together, the protocol, code and web application provide a complete guide and toolbox for FBMN data integration, cleanup and advanced statistical analysis, enabling new users to uncover molecular insights from their non-targeted metabolomics data. Our protocol is tailored for the seamless analysis of FBMN results from Global Natural Products Social Molecular Networking and can be easily adapted to other mass spectrometry feature detection, annotation and networking tools.
Plant specialized metabolites have diversified vastly over the course of plant evolution, and they are considered key players in complex interactions between plants and their environment. The chemical diversity of these metabolites has been widely explored and utilized in agriculture and crop enhancement, the food industry, and drug development, among other areas. However, the immensity of the plant metabolome can make its exploration challenging. Here we describe a protocol for exploring plant specialized metabolites that combines high-resolution mass spectrometry and computational metabolomics strategies, including molecular networking, identification of structural motifs, as well as prediction of chemical structures and metabolite classes.
- Keywords
- GNPS, MS2LDA, MS2Query, MZmine, Molecular networking, Plant metabolomics, SIRIUS, Specialized metabolites,
- MeSH
- Mass Spectrometry * methods MeSH
- Metabolome * MeSH
- Metabolomics * methods MeSH
- Plants * metabolism MeSH
- Computational Biology methods MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Despite extensive efforts, extracting information on medication exposure from clinical records remains challenging. To complement this approach, we developed the tandem mass spectrometry (MS/MS) based GNPS Drug Library. This resource integrates MS/MS data for drugs and their metabolites/analogs with controlled vocabularies on exposure sources, pharmacologic classes, therapeutic indications, and mechanisms of action. It enables direct analysis of drug exposure and metabolism from untargeted metabolomics data independent of clinical records. Our library facilitates stratification of individuals in clinical studies based on the empirically detected medications, exemplified by drug-dependent microbiota-derived N-acyl lipid changes in a cohort with human immunodeficiency virus. The GNPS Drug Library holds potential for broader applications in drug discovery and precision medicine.
- Publication type
- Journal Article MeSH
- Preprint MeSH
Untargeted mass spectrometry (MS) experiments produce complex, multidimensional data that are practically impossible to investigate manually. For this reason, computational pipelines are needed to extract relevant information from raw spectral data and convert it into a more comprehensible format. Depending on the sample type and/or goal of the study, a variety of MS platforms can be used for such analysis. MZmine is an open-source software for the processing of raw spectral data generated by different MS platforms. Examples include liquid chromatography-MS, gas chromatography-MS and MS-imaging. These data might typically be associated with various applications including metabolomics and lipidomics. Moreover, the third version of the software, described herein, supports the processing of ion mobility spectrometry (IMS) data. The present protocol provides three distinct procedures to perform feature detection and annotation of untargeted MS data produced by different instrumental setups: liquid chromatography-(IMS-)MS, gas chromatography-MS and (IMS-)MS imaging. For training purposes, example datasets are provided together with configuration batch files (i.e., list of processing steps and parameters) to allow new users to easily replicate the described workflows. Depending on the number of data files and available computing resources, we anticipate this to take between 2 and 24 h for new MZmine users and nonexperts. Within each procedure, we provide a detailed description for all processing parameters together with instructions/recommendations for their optimization. The main generated outputs are represented by aligned feature tables and fragmentation spectra lists that can be used by other third-party tools for further downstream analysis.
Although metabolomics data acquisition and analysis technologies have become increasingly sophisticated over the past 5-10 years, deciphering a metabolite's function from a description of its structure and its abundance in a given experimental setting is still a major scientific and intellectual challenge. To point out ways to address this "data to knowledge" challenge, we developed a functional metabolomics strategy that combines state-of-the-art data analysis tools and applied it to a human scalp metabolomics data set: skin swabs from healthy volunteers with normal or oily scalp (Sebumeter score 60-120, n = 33; Sebumeter score > 120, n = 41) were analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS), yielding four metabolomics data sets for reversed phase chromatography (C18) or hydrophilic interaction chromatography (HILIC) separation in electrospray ionization (ESI) + or - ionization mode. Following our data analysis strategy, we were able to obtain increasingly comprehensive structural and functional annotations, by applying the Global Natural Product Social Networking (M. Wang, J. J. Carver, V. V. Phelan, L. M. Sanchez, et al., Nat Biotechnol 34:828-837, 2016, https://doi.org/10.1038/nbt.3597), SIRIUS (K. Dührkop, M. Fleischauer, M. Ludwig, A. A. Aksenov, et al., Nat Methods 16:299-302, 2019, https://doi.org/10.1038/s41592-019-0344-8), and MicrobeMASST (S. ZuffaS, R. Schmid, A. Bauermeister, P. W, P. Gomes, et al., bioRxiv:rs.3.rs-3189768, 2023, https://doi.org/10.21203/rs.3.rs-3189768/v1) tools. We finally combined the metabolomics data with a corresponding metagenomic sequencing data set using MMvec (J. T. Morton, A. A. Aksenov, L. F. Nothias, J. R. Foulds, et. al., Nat Methods 16:1306-1314, 2019, https://doi.org/10.1038/s41592-019-0616-3), gaining insights into the metabolic niche of one of the most prominent microbes on the human skin, Staphylococcus epidermidis.IMPORTANCESystems biology research on host-associated microbiota focuses on two fundamental questions: which microbes are present and how do they interact with each other, their host, and the broader host environment? Metagenomics provides us with a direct answer to the first part of the question: it unveils the microbial inhabitants, e.g., on our skin, and can provide insight into their functional potential. Yet, it falls short in revealing their active role. Metabolomics shows us the chemical composition of the environment in which microbes thrive and the transformation products they produce. In particular, untargeted metabolomics has the potential to observe a diverse set of metabolites and is thus an ideal complement to metagenomics. However, this potential often remains underexplored due to the low annotation rates in MS-based metabolomics and the necessity for multiple experimental chromatographic and mass spectrometric conditions. Beyond detection, prospecting metabolites' functional role in the host/microbiome metabolome requires identifying the biological processes and entities involved in their production and biotransformations. In the present study of the human scalp, we developed a strategy to achieve comprehensive structural and functional annotation of the metabolites in the human scalp environment, thus diving one step deeper into the interpretation of "omics" data. Leveraging a collection of openly accessible software tools and integrating microbiome data as a source of functional metabolite annotations, we finally identified the specific metabolic niche of Staphylococcus epidermidis, one of the key players of the human skin microbiome.
- Keywords
- metabolite annotation, metabolomics, multi-omics integration, scalp, skin microbiome,
- MeSH
- Chromatography, Liquid MeSH
- Humans MeSH
- Metabolomics methods MeSH
- Scalp * MeSH
- Staphylococcus epidermidis * MeSH
- Tandem Mass Spectrometry MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
Trapped ion mobility spectrometry (TIMS) adds an additional separation dimension to mass spectrometry (MS) imaging, however, the lack of fragmentation spectra (MS2) impedes confident compound annotation in spatial metabolomics. Here, we describe spatial ion mobility-scheduled exhaustive fragmentation (SIMSEF), a dataset-dependent acquisition strategy that augments TIMS-MS imaging datasets with MS2 spectra. The fragmentation experiments are systematically distributed across the sample and scheduled for multiple collision energies per precursor ion. Extendable data processing and evaluation workflows are implemented into the open source software MZmine. The workflow and annotation capabilities are demonstrated on rat brain tissue thin sections, measured by matrix-assisted laser desorption/ionisation (MALDI)-TIMS-MS, where SIMSEF enables on-tissue compound annotation through spectral library matching and rule-based lipid annotation within MZmine and maps the (un)known chemical space by molecular networking. The SIMSEF algorithm and data analysis pipelines are open source and modular to provide a community resource.
- MeSH
- Algorithms MeSH
- Ion Mobility Spectrometry * MeSH
- Rats MeSH
- Metabolomics * methods MeSH
- Software MeSH
- Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization methods MeSH
- Animals MeSH
- Check Tag
- Rats MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Cyanobacteria are infamous producers of toxins. While the toxic potential of planktonic cyanobacterial blooms is well documented, the ecosystem level effects of toxigenic benthic and epiphytic cyanobacteria are an understudied threat. The freshwater epiphytic cyanobacterium Aetokthonos hydrillicola has recently been shown to produce the "eagle killer" neurotoxin aetokthonotoxin (AETX) causing the fatal neurological disease vacuolar myelinopathy. The disease affects a wide array of wildlife in the southeastern United States, most notably waterfowl and birds of prey, including the bald eagle. In an assay for cytotoxicity, we found the crude extract of the cyanobacterium to be much more potent than pure AETX, prompting further investigation. Here, we describe the isolation and structure elucidation of the aetokthonostatins (AESTs), linear peptides belonging to the dolastatin compound family, featuring a unique modification of the C-terminal phenylalanine-derived moiety. Using immunofluorescence microscopy and molecular modeling, we confirmed that AEST potently impacts microtubule dynamics and can bind to tubulin in a similar matter as dolastatin 10. We also show that AEST inhibits reproduction of the nematode Caenorhabditis elegans. Bioinformatic analysis revealed the AEST biosynthetic gene cluster encoding a nonribosomal peptide synthetase/polyketide synthase accompanied by a unique tailoring machinery. The biosynthetic activity of a specific N-terminal methyltransferase was confirmed by in vitro biochemical studies, establishing a mechanistic link between the gene cluster and its product.
- Keywords
- aetokthonostatin, biosynthesis, cyanotoxin, cytotoxicity, dolastatin,
- MeSH
- Eagles * MeSH
- Caenorhabditis elegans MeSH
- Ecosystem MeSH
- Cyanobacteria * genetics MeSH
- Fresh Water MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- aetokthonotoxin MeSH Browser
Non-targeted liquid chromatography-tandem mass spectrometry (LC-MS/MS) is a widely used tool for metabolomics analysis, enabling the detection and annotation of small molecules in complex environmental samples. Data-dependent acquisition (DDA) of product ion spectra is thereby currently one of the most frequently applied data acquisition strategies. The optimization of DDA parameters is central to ensuring high spectral quality, coverage, and number of compound annotations. Here, we evaluated the influence of 10 central DDA settings of the Q Exactive mass spectrometer on natural organic matter samples from ocean, river, and soil environments. After data analysis with classical and feature-based molecular networking using MZmine and GNPS, we compared the total number of network nodes, multivariate clustering, and spectrum quality-related metrics such as annotation and singleton rates, MS/MS placement, and coverage. Our results show that automatic gain control, microscans, mass resolving power, and dynamic exclusion are the most critical parameters, whereas collision energy, TopN, and isolation width had moderate and apex trigger, monoisotopic selection, and isotopic exclusion minor effects. The insights into the data acquisition ergonomics of the Q Exactive platform presented here can guide new users and provide them with initial method parameters, some of which may also be transferable to other sample types and MS platforms.