Cellwise outlier detection and biomarker identification in metabolomics based on pairwise log ratios
Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium print-electronic
Typ dokumentu časopisecké články
PubMed
32189829
PubMed Central
PMC7063692
DOI
10.1002/cem.3182
PII: CEM3182
Knihovny.cz E-zdroje
- Klíčová slova
- biomarker, cellwise outliers, cell‐rPLR, log ratio, metabolomics, robust method,
- Publikační typ
- časopisecké články MeSH
Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell-rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell-rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.
Department of Clinical Biochemistry University Hospital Olomouc Olomouc Czech Republic
Institute of Statistics and Mathematical Methods in Economics TU Wien Vienna Austria
Zobrazit více v PubMed
Strimbu K, Tavel JA. What are biomarkers? Curr Opin HIV AIDS. 2010;5(6):463. PubMed PMC
Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker development for early detection of cancer. JNCI: J Natl Cancer Inst. 2001;93(14):1054‐1061. PubMed
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2009;26(3):392‐398. PubMed
Huber PJ, Ronchetti EM. Robust Statistics, Series in Probability and Mathematical Statistics. New York, NY, USA: John Wiley; 1981.
Maronna RA, Martin RD, Yohai VJ, Salibián‐Barrera M. Robust Statistics: Theory and Methods (With R). Chichester, UK: Wiley; 2019.
Maronna R, Martin RD, Yohai V. Robust Statistics. Chichester, UK: John Wiley & Sons; 2006.
Rousseeuw PJ, Bossche WVD. Detecting deviating data cells. Technometrics. 2018;60(2):135‐145.
Öllerer V, Alfons A, Croux C. The shooting S‐estimator for robust regression. Comput Stat. 2016;31(3):829‐844.
Warrack BM, Hnatyshyn S, Ott KH, et al. Normalization strategies for metabonomic analysis of urine samples. J Chromatogr B. 2009;877(5‐6):547‐552. PubMed
Filzmoser P, Walczak B. What can go wrong at the data normalization step for identification of biomarkers? J Chromatog A. 2014;1362:194‐205. PubMed
Kvalheim OM, Brakstad F, Liang Y. Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Anal Chem. 1994;66(1):43‐51.
Craig A, Cloarec O, Holmes E, Nicholson JK, Lindon JC. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal Chem. 2006;78(7):2262‐2267. PubMed
Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 2006;78(13):4281‐4290. PubMed
Pawlowsky‐Glahn V, Egozcue JJ, Tolosana‐Delgado R. Modeling and Analysis of Compositional Data. Chichester, UK: John Wiley & Sons; 2015.
Walach J, Filzmoser P, Hron K. Data normalization and scaling: Consequences for the analysis in omics science. In: Jaumot J, Bedia C, Tauler R, eds. Data Analysis for Omics Sciences: Methods and Applications Amsterdam The Netherlands: Elsevier; 2018: 65‐196.
Beaton AE, Tukey JW. The fitting of power series, meaning polynomials, illustrated on band‐spectroscopic data. Technometrics. 1974;16(2):147‐185.
Yohai VJ, Zamar RH. High breakdown‐point estimates of regression by means of the minimization of an efficient scale. J Am Stat Assoc. 1988;83(402):406‐413.
Maronna RA, Zamar RH. Robust estimates of location and dispersion for high‐dimensional datasets. Technometrics. 2002;44(4):307‐317.
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach Based on Influence Functions. New York, NY, USA: John Wiley & Sons; 1986.
Fisher RA. The Design of Experiments. UK: Oliver & Boyd, Edinburgh and London; 1935.
Rubin DB. Randomization analysis of experimental data: the Fisher randomization test comment. J Am Stat Assoc. 1980;75(371):591‐593.
Janečková H, Hron K, Wojtowicz P, et al. Targeted metabolomic analysis of plasma samples for the diagnosis of inherited metabolic disorders. J Chromatogr A. 2012;1226:11‐17. PubMed
Franceschi P, Masuero D, Vrhovsek U, Mattivi F, Wehrens R. A benchmark spike‐in data set for biomarker identification in metabolomics. J Chemom. 2012;26(1‐2):16‐24.
Wehrens R, Franceschi P, Vrhovsek U, Mattivi F. Stability‐based biomarker selection. Anal Chim Acta. 2011;705(1‐2):15‐23. PubMed
Wang J, Christison TT, Misuno K, et al. Metabolomic profiling of anionic metabolites in head and neck cancer cells by capillary ion chromatography with orbitrap mass spectrometry. Anal Chem. 2014;86(10):5116‐5124. PubMed
Cleveland WS, Devlin SJ. Locally weighted regression: an approach to regression analysis by local fitting. J Am Stat Assoc. 1988;83(403):596‐610.
Miller MJ, Kennedy AD, Eckhart AD, et al. Untargeted metabolomic analysis for the clinical screening of inborn errors of metabolism. J Inherit Metab Dis. 2015;38(6):1029‐1039. PubMed PMC
Jansen RS, Addie R, Merkx R, et al. N‐lactoyl‐amino acids are ubiquitous metabolites that originate from CNDP2‐mediated reverse proteolysis of lactate and amino acids. Proc Natl Acad Sci. 2015;112(21):6601‐6606. PubMed PMC
Václavík J, Coene KL, Vrobel I, et al. Structural elucidation of novel biomarkers of known metabolic disorders based on multistage fragmentation mass spectra. J Inherit Metab Dis. 2018;41(3):407‐414. PubMed
Wold H. Path models with latent variables: the NIPALS approach. In: Blalock HM, Aganbegian A, Borodkin FM, Boudon R, Capecchi V, eds. Quantitative Sociology International Perspectives on Mathematical and Statistical Modeling Academic Press: London, UK; 1975:307‐357.
Wold S, Martens H, Wold H. The Multivariate Calibration Problem in Chemistry Solved by the PLS Method. In: Kågström B, Ruhe A, eds. Matrix Pencils Springer: Berlin, Heidelberg, Germany; 1983:286‐293.
Ståhle L, Wold S. Partial least squares analysis with cross‐validation for the two‐class problem: a Monte Carlo study. J Chemom. 1987;1(3):185‐196.
Favilla S, Durante C, Vigni ML, Cocchi M. Assessing feature relevance in NPLS models by VIP. Chemom Intell Lab Syst. 2013;129:76‐86.
Wold S, Johansson E, Cocchi M. PLS—partial least squares projections to latent structures. 3D QSAR. Drug Des. 1993;1:523‐550.
Chong IG, Jun CH. Performance of some variable selection methods when multicollinearity is present. Chemom Intell Lab Syst. 2005;78(1):103‐112.
Gosselin Ryan, Rodrigue Denis, Duchesne Carl. A bootstrap‐VIP approach for selecting wavelength intervals in spectral imaging applications. Chemom Intell Lab Syst. 2010;100(1):12‐21.
Mehmood T, Liland KH, Snipen L, Sæbø S. A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst. 2012;118:62‐69.
Rajalahti T, Arneberg R, Berven FS, Myhr KM, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom Intell Lab Syst. 2009;95(1):35‐48.
Kvalheim OM. Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots. J Chemom. 2009;24(7‐8):496‐504.
Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr KM, Kvalheim OM. Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem. 2009;81(7):2581‐2590. PubMed
Filzmoser P, Serneels S, Maronna R, Van Espen PJ. Robust multivariate methods in chemometrics In: Walczak B, Ferre RT, Brown S, eds. Comprehensive Chemometrics (vol. 3). Oxford, UK: Oxford, UK; 2009:681‐722.
Serneels S, Croux C, Filzmoser P, Van Espen PJ. Partial robust M‐regression. Chemom Intell Lab Syst. 2005;79(1‐2):55‐64.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA‐seq data with DESeq2. Genome Biol. 2014;15(12):550. PubMed PMC
Wald A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Am Math Soc. 1943;54(3):426‐482.
Harrell FE. Regression Modeling Strategies. Germany: Springer, Cham; 2014.
Fernandes AD, Macklaim JM, Linn TG, Reid G, Gloor GB. ANOVA‐like differential expression (ALDEx) analysis for mixed population RNA‐Seq. PLoS ONE. 2013;8(7):e67019. PubMed PMC
Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high‐throughput sequencing data. Can J Microbiol. 2016;62(8):692‐703. PubMed
Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B Methodol. 1982;44(2):139‐177.
Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29(3/4):350‐362.
R Core Team. R: a language and environment for statistical computing: R Foundation for Statistical Computing, Vienna, Austria: https://www.R‐project.org/; 2018.
Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. Shiny: web application framework for R. https://CRAN.Rproject.org/package=shiny, r package version1.1.0.; 2018.