Statistical analysis of feature-based molecular networking results from non-targeted metabolomics data
Language English Country Great Britain, England Media print-electronic
Document type Journal Article
Grant support
EXC 2124
Deutsche Forschungsgemeinschaft (German Research Foundation)
PubMed
39304763
DOI
10.1038/s41596-024-01046-3
PII: 10.1038/s41596-024-01046-3
Knihovny.cz E-resources
- MeSH
- Chromatography, Liquid methods MeSH
- Data Interpretation, Statistical MeSH
- Metabolomics * methods MeSH
- Software MeSH
- Tandem Mass Spectrometry methods MeSH
- Publication type
- Journal Article MeSH
Feature-based molecular networking (FBMN) is a popular analysis approach for liquid chromatography-tandem mass spectrometry-based non-targeted metabolomics data. While processing liquid chromatography-tandem mass spectrometry data through FBMN is fairly streamlined, downstream data handling and statistical interrogation are often a key bottleneck. Especially users new to statistical analysis struggle to effectively handle and analyze complex data matrices. Here we provide a comprehensive guide for the statistical analysis of FBMN results, focusing on the downstream analysis of the FBMN output table. We explain the data structure and principles of data cleanup and normalization, as well as uni- and multivariate statistical analysis of FBMN results. We provide explanations and code in two scripting languages (R and Python) as well as the QIIME2 framework for all protocol steps, from data clean-up to statistical analysis. All code is shared in the form of Jupyter Notebooks ( https://github.com/Functional-Metabolomics-Lab/FBMN-STATS ). Additionally, the protocol is accompanied by a web application with a graphical user interface ( https://fbmn-statsguide.gnps2.org/ ) to lower the barrier of entry for new users and for educational purposes. Finally, we also show users how to integrate their statistical results into the molecular network using the Cytoscape visualization tool. Throughout the protocol, we use a previously published environmental metabolomics dataset for demonstration purposes. Together, the protocol, code and web application provide a complete guide and toolbox for FBMN data integration, cleanup and advanced statistical analysis, enabling new users to uncover molecular insights from their non-targeted metabolomics data. Our protocol is tailored for the seamless analysis of FBMN results from Global Natural Products Social Molecular Networking and can be easily adapted to other mass spectrometry feature detection, annotation and networking tools.
Applied Bioinformatics Department of Computer Science University of Tübingen Tübingen Germany
Bigelow Laboratory for Ocean Sciences East Boothbay ME USA
Bioinformatics Group Wageningen University and Research Wageningen the Netherlands
Department of Analytical Chemistry University of Vienna Vienna Austria
Department of Biochemistry and Microbiology Rhodes University Makhanda South Africa
Department of Biochemistry University of California Riverside Riverside CA USA
Department of Biochemistry University of Johannesburg Johannesburg South Africa
Department of Bioinformatics University of Jena Jena Germany
Department of Chemistry and Biochemistry University of Denver Denver CO USA
Department of Computer Science University of California Riverside Riverside CA USA
Department of Ecology Behavior and Evolution University of California San Diego San Diego CA USA
Department of Environmental Science Aarhus University Roskilde Denmark
Department of Environmental Systems Analysis University of Tübingen Tübingen Germany
Department of Food Chemistry and Toxicology University of Vienna Vienna Austria
Department of Nutrition Exercise and Sports University of Copenhagen Frederiksberg C Denmark
German Center for Infection Research Partner Site Braunschweig Hannover Braunschweig Germany
Institute of Inorganic and Analytical Chemistry University of Münster Münster Germany
Leibniz Institute DSMZ German Collection of Microorganisms and Cell Cultures Braunschweig Germany
Leibniz Institute of Freshwater Ecology and Inland Fisheries Berlin Germany
Saarland University Saarbrücken Germany
School of Marine Sciences Darling Marine Center University of Maine Walpole ME USA
Universidad EAFIT Medellín Antioquia Colombia
Virtual Multi Omics Laboratory The Internet Riverside CA USA
See more in PubMed
Vailati-Riboni, M., Palombo, V. & Loor, J. J. What are omics sciences? in Periparturient Diseases of Dairy Cows (ed. Ametaj, B.) Ch. 1 (Springer, 2017); https://doi.org/10.1007/978-3-319-43033-1_1 .
Patti, G. J., Yanes, O. & Siuzdak, G. Metabolomics: the apogee of the omics trilogy. Nat. Rev. Mol. Cell Biol. 13, 263–269 (2012). PubMed DOI PMC
Dayalan, S., Xia, J., Spicer, R. A., Salek, R. & Roessner, U. Metabolome analysis. in Encyclopedia of Bioinformatics and Computational Biology (eds. Ranganathan, S., Gribskov, M., Nakai, K. & Schönbach, C.) 396–409 (Academic Press, 2019); https://doi.org/10.1016/B978-0-12-809633-8.20251-3 .
Tolstikov, V., Moser, A. J., Sarangarajan, R., Narain, N. R. & Kiebish, M. A. Current status of metabolomic biomarker discovery: impact of study design and demographic characteristics. Metabolites 10, 224 (2020). PubMed DOI PMC
de Jonge, N. F. et al. Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools. Metabolomics 18, 103 (2022). PubMed DOI PMC
Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020). PubMed DOI PMC
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016). PubMed DOI PMC
Ottosson, F. et al. Effects of long-term storage on the biobanked neonatal dried blood spot metabolome. J. Am. Soc. Mass Spectrom. 34, 685–694 (2023). PubMed DOI PMC
Dantas Machado, A. C. et al. Portosystemic shunt placement reveals blood signatures for the development of hepatic encephalopathy through mass spectrometry. Nat. Commun. 14, 5303 (2023). PubMed DOI PMC
Xie, H.-F. et al. Feature-based molecular networking analysis of the metabolites produced by in vitro solid-state fermentation reveals pathways for the bioconversion of epigallocatechin gallate. J. Agric. Food Chem. 68, 7995–8007 (2020). PubMed DOI
Berlanga-Clavero, M. V. et al. Bacillus subtilis biofilm matrix components target seed oil bodies to promote growth and anti-fungal resistance in melon. Nat. Microbiol. 7, 1001–1015 (2022). PubMed DOI PMC
Raheem, D. J., Tawfike, A. F., Abdelmohsen, U. R., Edrada-Ebel, R. & Fitzsimmons-Thoss, V. Application of metabolomics and molecular networking in investigating the chemical profile and antitrypanosomal activity of British bluebells (Hyacinthoides non-scripta). Sci. Rep. 9, 2547 (2019). PubMed DOI PMC
Pendergraft, M. A. et al. Bacterial and chemical evidence of coastal water pollution from the Tijuana River in sea spray aerosol. Environ. Sci. Technol. 57, 4071–4081 (2023). PubMed DOI PMC
Petras, D. et al. Non-targeted tandem mass spectrometry enables the visualization of organic matter chemotype shifts in coastal seawater. Chemosphere 271, 129450 (2021). PubMed DOI PMC
Stincone, P. et al. Evaluation of data-dependent MS/MS acquisition parameters for non-targeted metabolomics and molecular networking of environmental samples: focus on the Q exactive platform. Anal. Chem. 95, 12673–12682 (2023). PubMed DOI PMC
Wegley Kelly, L. et al. Distinguishing the molecular diversity, nutrient content, and energetic potential of exometabolomes produced by macroalgae and reef-building corals. Proc. Natl Acad. Sci. Usa. 119, e2110283119 (2022). PubMed DOI PMC
Mannochio-Russo, H. et al. Microbiomes and metabolomes of dominant coral reef primary producers illustrate a potential role for immunolipids in marine symbioses. Commun. Biol. 6, 896 (2023). PubMed DOI PMC
Shaffer, J. P. et al. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat. Microbiol. 7, 2128–2150 (2022). PubMed DOI PMC
Molina-Santiago, C. et al. Chemical interplay and complementary adaptative strategies toggle bacterial antagonism and co-existence. Cell Rep. 36, 109449 (2021). PubMed DOI PMC
Reher, R. et al. Native metabolomics identifies the rivulariapeptolide family of protease inhibitors. Nat. Commun. 13, 4619 (2022). PubMed DOI PMC
Aron, A. T. et al. Native mass spectrometry-based metabolomics identifies metal-binding compounds. Nat. Chem. 14, 100–109 (2022). PubMed DOI
Behnsen, J. et al. Siderophore-mediated zinc acquisition enhances enterobacterial colonization of the inflamed gut. Nat. Commun. 12, 7016 (2021). PubMed DOI PMC
Pang, Z. et al. MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights. Nucleic Acids Res. 49, W388–W396 (2021). PubMed DOI PMC
Pang, Z. et al. Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data. Nat. Protoc. 17, 1735–1761 (2022). PubMed DOI
Cajka, T. & Fiehn, O. Toward merging untargeted and targeted methods in mass spectrometry-based metabolomics and lipidomics. Anal. Chem. 88, 524–545 (2016). PubMed DOI
Alder, L., Greulich, K., Kempe, G. & Vieth, B. Residue analysis of 500 high priority pesticides: better by GC–MS or LC–MS/MS? Mass Spectrom. Rev. 25, 838–865 (2006). PubMed DOI
Díaz-Cruz, M. S., López de Alda, M. J., López, R. & Barceló, D. Determination of estrogens and progestogens by mass spectrometric techniques (GC/MS, LC/MS and LC/MS/MS). J. Mass Spectrom. 38, 917–923 (2003). PubMed DOI
Michely, J. A., Helfer, A. G., Brandt, S. D., Meyer, M. R. & Maurer, H. H. Metabolism of the new psychoactive substances N,N-diallyltryptamine (DALT) and 5-methoxy-DALT and their detectability in urine by GC–MS, LC–MSn, and LC–HR–MS–MS. Anal. Bioanal. Chem. 407, 7831–7842 (2015). PubMed DOI
Di Masi, S. et al. HPLC–MS/MS method applied to an untargeted metabolomics approach for the diagnosis of “olive quick decline syndrome”. Anal. Bioanal. Chem. 414, 465–473 (2022). PubMed DOI
Reveglia, P. et al. Untargeted and targeted LC–MS/MS based metabolomics study on in vitro culture of phaeoacremonium species. J. Fungi 8, 55 (2022). DOI
Baig, F., Pechlaner, R. & Mayr, M. Caveats of untargeted metabolomics for biomarker discovery∗. J. Am. Coll. Cardiol. 68, 1294–1296 (2016). PubMed DOI
Xiao, J. F., Zhou, B. & Ressom, H. W. Metabolite identification and quantitation in LC–MS/MS-based metabolomics. TrAC Trends Anal. Chem. 32, 1–14 (2012). DOI
Blaženović, I. et al. Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy. J. Cheminformatics 9, 32 (2017). DOI
Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC–MS/MS data in metabolomics. Metabolites 8, 31 (2018). PubMed DOI PMC
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015). PubMed DOI PMC
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009). PubMed DOI
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022). PubMed DOI PMC
Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nat. Protoc. 15, 1954–1991 (2020). PubMed DOI
Schmid, R. et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat. Commun. 12, 3832 (2021). PubMed DOI PMC
Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536 (2008). PubMed DOI PMC
Hulstaert, N. et al. ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion. J. Proteome Res. 19, 537–542 (2020). PubMed DOI
Adusumilli, R. & Mallick, P. Data conversion with ProteoWizard msConvert. Methods Mol. Biol. 1550, 339–368 (2017). PubMed DOI
Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R. & Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal. Chem. 78, 779–787 (2006). PubMed DOI
Kuhl, C., Tautenhahn, R., Böttcher, C., Larson, T. R. & Neumann, S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal. Chem. 84, 283–289 (2012). PubMed DOI
Schmid, R. et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat. Biotechnol. 41, 447–449 (2023). PubMed DOI PMC
Tsugawa, H. et al. A lipidome atlas in MS-DIAL 4. Nat. Biotechnol. 38, 1159–1163 (2020). PubMed DOI
Pfeuffer, J. et al. OpenMS—a platform for reproducible analysis of mass spectrometry data. J. Biotechnol. 261, 142–148 (2017). PubMed DOI
Gloaguen, Y., Kirwan, J. A. & Beule, D. Deep learning-assisted peak curation for large-scale LC–MS metabolomics. Anal. Chem. 94, 4930–4937 (2022).
Chetnik, K., Petrick, L. & Pandey, G. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data. Metabolomics 16, 117 (2020). PubMed DOI PMC
El Abiead, Y., Milford, M., Salek, R. M. & Koellensperger, G. mzRAPP: a tool for reliability assessment of data pre-processing in non-targeted metabolomics. Bioinformatics 37, 3678–3680 (2021). PubMed DOI PMC
Heuckeroth, S., Damiani, T., Smirnov, A. et al. Reproducible mass spectrometry data processing and compound annotation in MZmine 3. Nat. Protoc. https://doi.org/10.1038/s41596-024-00996-y (2024).
Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221 (2007). PubMed DOI PMC
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019). PubMed DOI
Dührkop, K. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat. Biotechnol. 39, 462–471 (2021). PubMed DOI
Liu, L.-L. et al. Molecular networking-based for the target discovery of potent antiproliferative polycyclic macrolactam ansamycins from Streptomyces cacaoi subsp. asoensis. Org. Chem. Front. 7, 4008–4018 (2020). DOI
Sedio, B. E., Boya P, C. A. & Rojas Echeverri, J. C. A protocol for high-throughput, untargeted forest community metabolomics using mass spectrometry molecular networks. Appl. Plant Sci. 6, e1033 (2018). PubMed DOI PMC
Quinn, R. A. et al. Molecular networking as a drug discovery, drug metabolism, and precision medicine strategy. Trends Pharmacol. Sci. 38, 143–154 (2017). PubMed DOI
Pluskal, T., Castillo, S., Villar-Briones, A. & Orešič, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinforma. 11, 395 (2010). DOI
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLOS Comput. Biol. 15, e1006907 (2019). PubMed DOI PMC
GOWER, J. C. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–338 (1966). DOI
Xu, Y. et al. Application of dissimilarity indices, principal coordinates analysis, and rank tests to peak tables in metabolomics of the gas chromatography/mass spectrometry of human sweat. Anal. Chem. 79, 5633–5641 (2007). PubMed DOI
Tian, M. et al. Pure ion chromatograms combined with advanced machine learning methods improve accuracy of discriminant models in LC–MS-based untargeted metabolomics. Molecules 26, 2715 (2021). PubMed DOI PMC
Cacciatore, S., Tenori, L., Luchinat, C., Bennett, P. R. & MacIntyre, D. A. KODAMA: an R package for knowledge discovery and data mining. Bioinformatics 33, 621–623 (2017). PubMed DOI
Paliy, O. & Shankar, V. Application of multivariate statistical techniques in microbial ecology. Mol. Ecol. 25, 1032–1057 (2016). PubMed DOI PMC
Efron, B. Bootstrap methods: another look at the jackknife. in Breakthroughs in Statistics: Methodology and Distribution (eds. Kotz, S. & Johnson, N. L.) 569–593 (Springer, 1992); https://doi.org/10.1007/978-1-4612-4380-9_41 .
Desu, M. M. & Raghavarao, D. Nonparametric Statistical Methods For Complete and Censored Data. (CRC Press, 2003).
Xia, Y. & Sun, J. Hypothesis testing and statistical analysis of microbiome. Genes Dis. 4, 138–148 (2017). PubMed DOI PMC
Anderson, M. J. A new method for non-parametric multivariate analysis of variance. Austral Ecol. 26, 32–46 (2001).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics 8, 61 (2016). DOI
Kim, H. W. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021). PubMed DOI PMC
Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 63, 411–423 (2001). DOI
Benton, P. H. et al. An interactive cluster heat map to visualize and explore multidimensional metabolomic data. Metabolomics. J. Metabolomic Soc. 11, 1029–1034 (2015).
Ren, S., Hinzman, A. A., Kang, E. L., Szczesniak, R. D. & Lu, L. J. Computational and statistical analysis of metabolomics data. Metabolomics 11, 1492–1513 (2015). DOI
Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites 10, 243 (2020). PubMed DOI PMC
Gromski, P. S. et al. A tutorial review: metabolomics and partial least squares-discriminant analysis – a marriage of convenience or a shotgun wedding. Anal. Chim. Acta 879, 10–23 (2015). PubMed DOI
Mendez, K. M., Reinke, S. N. & Broadhurst, D. I. A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification. Metabolomics 15, 150 (2019). PubMed DOI PMC
Jafari, M. & Ansari-Pour, N. Why, when and how to adjust your P values? Cell J. Yakhteh 20, 604–607 (2019).
Korthauer, K. et al. A practical guide to methods controlling false discoveries in computational biology. Genome Biol. 20, 118 (2019). PubMed DOI PMC
Mishra, P. et al. Descriptive statistics and normality tests for statistical data. Ann. Card. Anaesth. 22, 67–72 (2019). PubMed DOI PMC
Neuhaus, G. F. et al. Environmental metabolomics characterization of modern stromatolites and annotation of ibhayipeptolides. PLoS ONE 19, e0303273 (2024). PubMed DOI PMC
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019). PubMed DOI PMC
Moseley, H. N. B. Error analysis and propagation in metabolomics data analysis. Comput. Struct. Biotechnol. J. 4, e201301006 (2013). PubMed DOI PMC
Di Guida, R. et al. Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12, 93 (2016). PubMed DOI PMC
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). PubMed DOI PMC
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022). PubMed DOI
Rinker, T. & Kurkiewicz, D. pacman: package management for R, version 0.5.0. https://github.com/trinker/pacman (2018).
Wickham, H. et al. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686 (2019). DOI
Kluyver, T., Angerer, P. & Schulz, J. IRdisplay: ‘Jupyter’ display machinery. (2022).
Cacciatore, S., Luchinat, C. & Tenori, L. Knowledge discovery by accuracy maximization. Proc. Natl Acad. Sci. USA 111, 5117–5122 (2014). PubMed DOI PMC
Kassambara, A. & Mundt, F. Factoextra: extract and visualize the results of multivariate data analyses. R package version 1.0.7. https://CRAN.R-project.org/package=factoextra (2020).
Oksanen, J. et al. vegan: community ecology package. R package version 2.6-4. https://doi.org/10.32614/CRAN.package.vegan (2024).
Gu, Z. Complex heatmap visualization. iMeta 1, e43 (2022). PubMed DOI PMC
Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinforma. Oxf. Engl. 31, 3718–3720 (2015). DOI
Charrad, M., Ghazzali, N., Boiteau, V. & Niknafs, A. NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61, 1–36 (2014). DOI
Archer, E. rfPermute: estimate permutation P values for random forest importance metrics. R package version 2.5.1. CRAN https://doi.org/10.32614/CRAN.package.rfPermute (2023).
Ogle, D. H., Doll, J. C., Wheeler, A. P. & Dinno, A. FSA: simple fisheries stock assessment methods. R package version 0.9.4. CRAN https://fishr-core-team.github.io/FSA/ ; https://doi.org/10.32614/CRAN.package.FSA (2023).
Bengtsson, H. et al. matrixStats: functions that apply to rows and columns of matrices (and to vectors). R package version 0.63.0. CRAN https://doi.org/10.32614/CRAN.package.matrixStats (2023).
Xiao, N., Cook, J., Jégousse, C., Chen, H. & Li, M. ggsci: scientific journal and sci-fi themed color palettes for ‘ggplot2’. R package version 3.0. CRAN https://doi.org/10.32614/CRAN.package.ggsci (2023).
Wilke, C. O. cowplot: streamlined plot theme and plot annotations for ‘ggplot2’. R package version 1.1.1. CRAN https://doi.org/10.32614/CRAN.package.cowplot (2020).
Wickham, H. et al. svglite: an ‘SVG’ graphics device. R package version 2.1.1. CRAN https://doi.org/10.32614/CRAN.package.svglite (2023).
Reese, S. E. et al. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 29, 2877–2883 (2013). PubMed DOI PMC
Burton, L. et al. Instrumental and experimental effects in LC–MS-based metabolomics. J. Chromatogr. B 871, 227–235 (2008). DOI
Gregori, J. et al. Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics. J. Proteom. 75, 3938–3951 (2012). DOI
Thonusin, C. et al. Evaluation of intensity drift correction strategies using MetaboDrift, a normalization tool for multi-batch metabolomics data. J. Chromatogr. A 1523, 265–274 (2017). PubMed DOI PMC
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007). PubMed DOI
Deng, K. et al. WaveICA: a novel algorithm to remove batch effects for large-scale untargeted metabolomics data based on wavelet analysis. Anal. Chim. Acta 1061, 60–69 (2019). PubMed DOI
Wehrens, R. et al. Improved batch correction in untargeted MS-based metabolomics. Metabolomics 12, 88 (2016). PubMed DOI PMC
Dunn, W. B. et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat. Protoc. 6, 1060–1083 (2011). PubMed DOI
Kuligowski, J., Sánchez-Illana, Á., Sanjuán-Herráez, D., Vento, M. & Quintás, G. Intra-batch effect correction in liquid chromatography-mass spectrometry using quality control samples and support vector regression (QC-SVRC). Analyst 140, 7810–7817 (2015). PubMed DOI
Luan, H., Ji, F., Chen, Y. & Cai, Z. statTarget: a streamlined tool for signal drift correction and interpretations of quantitative mass spectrometry-based omics data. Anal. Chim. Acta 1036, 66–72 (2018). PubMed DOI
Rong, Z. et al. NormAE: deep adversarial learning model to remove batch effects in liquid chromatography mass spectrometry-based metabolomics data. Anal. Chem. 92, 5082–5090 (2020). PubMed DOI
Dmitrenko, A., Reid, M. & Zamboni, N. Regularized adversarial learning for normalization of multi-batch untargeted metabolomics data. Bioinformatics 39, btad096 (2023). PubMed DOI PMC
Tokareva, A. O. et al. Normalization methods for reducing interbatch effect without quality control samples in liquid chromatography-mass spectrometry-based studies. Anal. Bioanal. Chem. 413, 3479–3486 (2021). PubMed DOI
Liu, Q. et al. Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing. Sci. Rep. 10, 13856 (2020). PubMed DOI PMC
Cleary, J. L., Luu, G. T., Pierce, E. C., Dutton, R. J. & Sanchez, L. M. BLANKA: an algorithm for blank subtraction in mass spectrometry of complex biological samples. J. Am. Soc. Mass Spectrom. 30, 1426–1434 (2019). PubMed DOI PMC
Gorrochategui, E., Jaumot, J., Lacorte, S. & Tauler, R. Data analysis strategies for targeted and untargeted LC–MS metabolomic studies: overview and workflow. TrAC Trends Anal. Chem. 82, 425–442 (2016). DOI
Wulff, J. E. & Mitchell, M. W. A comparison of various normalization methods for LC/MS metabolomics data. Adv. Biosci. Biotechnol. 9, 339–351 (2018). DOI
Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic Quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1H NMR metabonomics. Anal. Chem. 78, 4281–4290 (2006). PubMed DOI
van den Berg, R. A., Hoefsloot, H. C., Westerhuis, J. A., Smilde, A. K. & van der Werf, M. J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7, 142 (2006). PubMed DOI PMC
Morgan, M. & Ramos, M. BiocManager: access the bioconductor project package repository. (2023).
Anderson, M. J. & Walsh, D. C. I. PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: what null hypothesis are you testing? Ecol. Monogr. 83, 557–574 (2013). DOI
Wilkinson, L. & Friendly, M. The history of the cluster heat map. Am. Stat. 63, 179–184 (2009). DOI
Wu, W. & Noble, W. S. Genomic data visualization on the Web. Bioinformatics 20, 1804–1805 (2004). PubMed DOI
Griffiths, E. T. et al. Detection and classification of narrow-band high frequency echolocation clicks from drifting recorders. J. Acoust. Soc. Am. 147, 3511–3522 (2020). PubMed DOI
Liu, S. et al. Comammox biogeography subject to anthropogenic interferences along a high-altitude river. Water Res. 226, 119225 (2022). PubMed DOI
Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001). DOI
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002); https://journal.r-project.org/articles/RN-2002-022/RN-2002-022.pdf .
Robinson, D. et al. broom: convert statistical objects into tidy tibbles. CRAN https://doi.org/10.32614/CRAN.package.broom (2023).
Vinaixa, M. et al. A Guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Metabolites 2, 775–795 (2012). PubMed DOI PMC
Ostertagová, E., Ostertag, O. & Kováč, J. Methodology and application of the Kruskal–Wallis test. Appl. Mech. Mater. 611, 115–120 (2014). DOI
Davidson, R. L., Weber, R. J. M., Liu, H., Sharma-Oates, A. & Viant, M. R. Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry-based metabolomics data. GigaScience 5, 10 (2016). PubMed DOI PMC
Giacomoni, F. et al. Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. Bioinformatics 31, 1493–1495 (2015). PubMed DOI
Kontou, E. E. et al. UmetaFlow: an untargeted metabolomics workflow for high-throughput data processing and analysis. J. Cheminformatics 15, 52 (2023). DOI
Rohart, F., Gautier, B., Singh, A. & Lê Cao, K.-A. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017). PubMed DOI PMC
Chong, J. & Xia, J. MetaboAnalystR: an R package for flexible and reproducible analysis of metabolomics data. Bioinformatics 34, 4313–4314 (2018). PubMed DOI PMC
Pang, Z. & Xia, J. LC–MS/MS raw spectral data processing. https://www.metaboanalyst.ca/resources/vignettes/LCMSMS_Raw_Spectral_Processing.html (2024).
Tiffany, C. R. & Bäumler, A. J. omu, a metabolomics count data analysis tool for intuitive figures and convenient metadata collection. Microbiol. Resour. Announc. 8, e00129-19 (2019). PubMed DOI PMC
Han, X. & Liang, L. metabolomicsR: a streamlined workflow to analyze metabolomic data in R. Bioinforma. Adv. 2, vbac067 (2022). DOI
Fernández-Albert, F., Llorach, R., Andrés-Lacueva, C. & Perera, A. An R package to analyse LC/MS metabolomic data: MAIT (metabolite automatic identification toolkit). Bioinformatics 30, 1937–1939 (2014). PubMed DOI PMC
Thévenot, E. A., Roux, A., Xu, Y., Ezan, E. & Junot, C. Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. J. Proteome Res. 14, 3322–3335 (2015). PubMed DOI
Kohler, D. et al. MSstats version 4.0: statistical analyses of quantitative mass spectrometry-based proteomic experiments with chromatography-based quantification at scale. J. Proteome Res. 22, 1466–1482 (2023). PubMed DOI PMC
Riquelme, G., Zabalegui, N., Marchi, P., Jones, C. M. & Monge, M. E. A python-based pipeline for preprocessing LC–MS data for untargeted metabolomics workflows. Metabolites 10, 416 (2020). PubMed DOI PMC
Ivanisevic, J. & Want, E. J. From samples to insights into metabolism: uncovering biologically relevant information in LC–HRMS metabolomics data. Metabolites 9, 308 (2019). PubMed DOI PMC
Silva, A. M., Cordeiro-da-Silva, A. & Coombs, G. H. Metabolic variation during development in culture of Leishmania donovani promastigotes. PLoS Negl. Trop. Dis. 5, e1451 (2011). PubMed DOI PMC
Martínez-Sena, T. et al. Monitoring of system conditioning after blank injections in untargeted UPLC–MS metabolomic analysis. Sci. Rep. 9, 9822 (2019). PubMed DOI PMC
Raynie, D. The vital role of blanks in sample preparation. LCGC N. Am. 36, 494–497 (2018).
Yue, Y., Bao, X., Jiang, J. & Li, J. Evaluation and correction of injection order effects in LC–MS/MS based targeted metabolomics. J. Chromatogr. B 1212, 123513 (2022). DOI
Livera, A. M. D. et al. Statistical methods for handling unwanted variation in metabolomics data. Anal. Chem. 87, 3606–3615 (2015). PubMed DOI PMC
Broadhurst, D. et al. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies. Metabolomics 14, 72 (2018). PubMed DOI PMC
Lawson, T. N. et al. msPurity: automated evaluation of precursor ion purity for mass spectrometry-based fragmentation in metabolomics. Anal. Chem. 89, 2432–2439 (2017). PubMed DOI
Schiffman, C. et al. Filtering procedures for untargeted LC–MS metabolomics data. BMC Bioinforma. 20, 334 (2019). DOI
Carobene, A., Braga, F., Roraas, T., Sandberg, S. & Bartlett, W. A. A systematic review of data on biological variation for alanine aminotransferase, aspartate aminotransferase and γ-glutamyl transferase. Clin. Chem. Lab. Med. CCLM 51, 1997–2007 (2013). PubMed DOI
Wei, R. et al. Missing value imputation approach for mass spectrometry-based metabolomics data. Sci. Rep. 8, 663 (2018). PubMed DOI PMC
Do, K. T. et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14, 128 (2018). PubMed DOI PMC
Li, B. et al. Performance evaluation and online realization of data-driven normalization methods used in LC/MS based untargeted metabolomics analysis. Sci. Rep. 6, 38881 (2016). PubMed DOI PMC
Scholz, M., Gatzek, S., Sterling, A., Fiehn, O. & Selbig, J. Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 20, 2447–2454 (2004). PubMed DOI
Deininger, S.-O. et al. Normalization in MALDI-TOF imaging datasets of proteins: practical considerations. Anal. Bioanal. Chem. 401, 167–181 (2011). PubMed DOI PMC
Qannari, E. M., Wakeling, I., Courcoux, P. & MacFie, H. J. H. Defining the underlying sensory dimensions. Food Qual. Prefer. 11, 151–154 (2000). DOI
Khalheim, O. M. Scaling of analytical data. Anal. Chim. Acta 177, 71–79 (1985). DOI
Kasprzak, E. M. & Lewis, K. E. Pareto analysis in multiobjective optimization using the collinearity theorem and scaling method. Struct. Multidiscip. Optim. 22, 208–218 (2001). DOI
Keenan, M. R. & Kotula, P. G. Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images. Surf. Interface Anal. 36, 203–212 (2004). DOI
Jäggi, C., Wirth, T. & Baur, B. Genetic variability in subpopulations of the asp viper (Vipera aspis) in the Swiss Jura mountains: implications for a conservation strategy. Biol. Conserv. 94, 69–77 (2000). DOI
Pinheiro, H. P., de Souza Pinheiro, A. & Sen, P. K. Comparison of genomic sequences using the Hamming distance. J. Stat. Plan. Inference 130, 325–339 (2005). DOI
Lozupone, C. & Knight, R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 71, 8228–8235 (2005). PubMed DOI PMC
Brejnrod, A. et al. Implementations of the chemical structural and compositional similarity metric in R and Python. Preprint at bioRxiv https://doi.org/10.1101/546150 (2019).
Tripathi, A. et al. Chemically informed analyses of metabolomics mass spectrometry data with Qemistree. Nat. Chem. Biol. 17, 146–151 (2021). PubMed DOI
Ramette, A. Multivariate analyses in microbial ecology. FEMS Microbiol. Ecol. 62, 142–160 (2007). PubMed DOI
Koenig, J. E. et al. Succession of microbial consortia in the developing infant gut microbiome. Proc. Natl Acad. Sci. 108, 4578–4585 (2011). PubMed DOI
Archer, F. I., Martien, K. K. & Taylor, B. L. Diagnosability of mt DNA with random forests: using sequence data to delimit subspecies. Mar. Mammal. Sci. 33, 101–131 (2017). DOI
Breiman, L. Out-of-bag estimation. Technical report 1-13 (Statistics Department, University of California Berkeley, 1996); https://www.stat.berkeley.edu/pub/users/breiman/OOBestimation.pdf .
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. & Zeileis, A. Conditional variable importance for random forests. BMC Bioinforma. 9, 307 (2008). DOI
Archer, K. J. & Kimes, R. V. Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52, 2249–2260 (2008). DOI
Riffenburgh, R. H. & Gillen, D. L. Statistics in Medicine (Academic Press, 2020).
Sato, T. Type I and type II error in multiple comparisons. J. Psychol. 130, 293–302 (1996). DOI
Bathke, A. The ANOVA F test can still be used in some balanced designs with unequal variances and nonnormal data. J. Stat. Plan. Inference 126, 413–422 (2004). DOI
Abdi, H. & Williams, L. Newman–Keuls test and Tukey test. Encycl. Res. Des. (2010).
Hecke, T. V. Power study of anova versus Kruskal–Wallis test. J. Stat. Manag. Syst. 15, 241–247 (2012).
Dinno, A. Nonparametric pairwise multiple comparisons in independent groups using Dunn’s test. Stata J. Promot. Commun. Stat. Stata 15, 292–300 (2015). DOI