A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, práce podpořená grantem
PubMed
26137474
PubMed Central
PMC4468284
DOI
10.1155/2015/320385
Knihovny.cz E-zdroje
- MeSH
- algoritmy * MeSH
- interpretace statistických dat * MeSH
- lidé MeSH
- metabolomika MeSH
- proteomika * MeSH
- regulace genové exprese genetika MeSH
- teoretické modely * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.
Zobrazit více v PubMed
Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. New York, NY, USA: Springer; 2001.
Lee J. A., Verleysen M. Nonlinear Dimensionality Reduction. New York, NY, USA: Springer; 2007. DOI
Schwender H., Ickstadt K., Rahnenführer J. Classification with high-dimensional genetic data: assigning patients and genetic features to known classes. Biometrical Journal. 2008;50(6):911–926. doi: 10.1002/bimj.200810475. PubMed DOI
Dai J. J., Lieu L., Rocke D. Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular Biology. 2006;5, article 6:1–19. doi: 10.2202/1544-6115.1147. PubMed DOI
Davies L. Data Analysis and Approximate Models. Boca Raton, Fla, USA: CRC Press; 2014.
Hubert M., Rousseeuw P. J., Van Aelst S. High-breakdown robust multivariate methods. Statistical Science. 2008;23(1):92–119. doi: 10.1214/088342307000000087. DOI
Filzmoser P., Todorov V. Review of robust multivariate statistical methods in high dimension. Analytica Chimica Acta. 2011;705(1-2):2–14. doi: 10.1016/j.aca.2011.03.055. PubMed DOI
Todorov V., Filzmoser P. Comparing classical and robust sparse PCA. Advances in Intelligent Systems and Computing. 2013;190:283–291. doi: 10.1007/978-3-642-33042-1_31. DOI
Xu H., Caramanis C., Mannor S. Outlier-robust PCA: the high-dimensional case. IEEE Transactions on Information Theory. 2013;59(1):546–572. doi: 10.1109/tit.2012.2212415. DOI
van Aelst S., Khan J. A., Zamar R. H. Fast robust variable selection. In: Brito P., editor. COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany: Physica-Verlag HD; 2008. pp. 359–370.
Ding C., Peng H. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology. 2005;3(2):185–205. doi: 10.1142/S0219720005001004. PubMed DOI
Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks. 1994;5(4):537–550. doi: 10.1109/72.298224. PubMed DOI
Liu X., Krishnan A., Mondry A. An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics. 2005;6, article 76:15. doi: 10.1186/1471-2105-6-76. PubMed DOI PMC
Peng H., Long F., Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. doi: 10.1109/tpami.2005.159. PubMed DOI
Auffarth B., Lopez M., Cerquides J. Advances in Data Mining, Applications and Theoretical Aspects. Vol. 6171. Springer; 2010. Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images; pp. 248–262. (Lecture Notes in Computer Science).
Kalina J. Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision. 2012;44(3):449–462. doi: 10.1007/s10851-012-0337-z. DOI
Víšek J. Á. Consistency of the least weighted squares under heteroscedasticity. Kybernetika. 2011;47(2):179–206.
Harrell F. E. Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, NY, USA: Springer; 2002.
Guo Y., Hastie T., Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007;8(1):86–100. doi: 10.1093/biostatistics/kxj035. PubMed DOI
Pourahmadi M. High-Dimensional Covariance Estimation. Hoboken, NJ, USA: John Wiley & Sons; 2013. (Wiley Series in Probability and Statistics). DOI
Schäfer J., Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology. 2005;4, article 32:30. doi: 10.2202/1544-6115.1175. PubMed DOI
Tibshirani R., Hastie T., Narasimhan B., Chu G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science. 2003;18(1):104–117. doi: 10.1214/ss/1056397488. DOI
Ledoit O., Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2004;88(2):365–411. doi: 10.1016/s0047-259x(03)00096-4. DOI
Stein C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability; 1956; Berkeley, Calif, USA. University of California Press; pp. 197–206.
Kalina J. Classification methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering. 2014;34(1):10–18. doi: 10.1016/j.bbe.2013.09.007. DOI
Xanthopoulos P., Pardalos P. M., Trafalis T. B. Robust Data Mining. New York, NY, USA: Springer; 2013. DOI
Shevlyakov G. L., Vilchevski N. O. Robustness in Data Analysis: Criteria and Methods. Utrecht, The Netherlands: VSP; 2002.
Čížek P. Semiparametrically weighted robust estimation of regression models. Computational Statistics & Data Analysis. 2011;55(1):774–788. doi: 10.1016/j.csda.2010.06.024. DOI
Rousseeuw P. J., Driessen K. V. Computing LTS regression for large data sets. Data Mining and Knowledge Discovery. 2006;12(1):29–45. doi: 10.1007/s10618-005-0024-4. DOI
Rousseeuw P. J., Leroy A. M. Robust Regression and Outlier Detection. New York, NY, USA: John Wiley & Sons; 1987.
Donoho D. L., Huber P. J. The notion of breakdown point. In: Bickel P. J., Doksum K., Hodges J. L. J., editors. A Festschrift for Erich L. Lehmann. Wadsworth, Ohio, USA: Belmont; 1983. pp. 157–184.
Rao C. R. Linear Methods of Statistical Induction and their Applications. 2nd. New York, NY, USA: Wiley; 1973.
Christmann A. Least median of weighted squares in logistic regression with large strata. Biometrika. 1994;81(2):413–417. doi: 10.1093/biomet/81.2.413. DOI
Sreekumar A., Poisson L. M., Rajendiran T. M., et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 2009;457(7231):910–914. doi: 10.1038/nature07762. PubMed DOI PMC
Kalina J., Duintjer Tebbens J. Algorithms for regularized linear discriminant analysis. Proceedings of the 6th International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS '15); 2015; Lisbon, Portugal. Scitepress; pp. 128–133.
Schlenker A. Keystroke Dynamics Data. 2015. http://www2.cs.cas.cz/~kalina/keystrokedyn.html.
Kalina J., Schlenker A., Kutílek P. Highly robust analysis of keystroke dynamics measurements. Proceedings of the 13th International Symposium on Applied Machine Intelligence and Informatics (SAMI '15); January 2015; Herľany, Slovakia. IEEE; pp. 133–138. DOI
Özdemir M. K. A framework for authentication of medical reports based on keystroke dynamics [M.S. thesis] Middle East Technical University; 2010. http://etd.lib.metu.edu.tr/upload/12612081/index.pdf.
Bhatt S., Santhanam T. Keystroke dynamics for biometric authentication-a survey. Proceedings of the International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME '13); February 2013; IEEE; pp. 17–23. DOI
Hájek J., Šidák Z., Sen P. K. Theory of Rank Tests. 2nd. San Diego, Calif, USA: Academic Press; 1999.
Breiman L. Bagging predictors. Machine Learning. 1996;24(2):123–140.
Furlanello C., Serafini M., Merler S., Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics. 2003;4, article 54 doi: 10.1186/1471-2105-4-54. PubMed DOI PMC
Interpoint distance tests for high-dimensional comparison studies