JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek

FT
PubMed

Záznam pochází z PubMed

A Robust Supervised Variable Selection for Noisy High-Dimensional Data

Kalina, Jan
Autor Kalina, Jan Institute of Computer Science of the Czech Academy of Sciences, Pod Vodárenskou Vĕží 2, 182 07 Prague 8, Czech Republic
Schlenker, Anna
Autor Schlenker, Anna Institute of Computer Science of the Czech Academy of Sciences, Pod Vodárenskou Vĕží 2, 182 07 Prague 8, Czech Republic ; Department of Biomedical Informatics, Faculty of Biomedical Engineering, Czech Technical University in Prague, Náměstí Sítná 3105, 272 01 Kladno, Czech Republic

BioMed research international. 2015 ; 2015 () : 320385. [epub] 20150602

Biomed Res Int
ISSN 2314-6141
Zdroj

Jazyk angličtina Země Spojené státy americké Médium print-electronic

Typ dokumentu časopisecké články, práce podpořená grantem

Perzistentní odkaz https://www.medvik.cz/link/pmid26137474

Online Plný text

PubMed 26137474
PubMed Central PMC4468284
DOI 10.1155/2015/320385
Knihovny.cz E-zdroje

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

Institute of Computer Science of the Czech Academy of Sciences Pod Vodárenskou Vĕží 2 182 07 Prague 8 Czech Republic

Institute of Computer Science of the Czech Academy of Sciences Pod Vodárenskou Vĕží 2 182 07 Prague 8 Czech Republic ; Department of Biomedical Informatics Faculty of Biomedical Engineering Czech Technical University Prague Náměstí Sítná 3105 272 01 Kladno Czech Republic

Zobrazit více v PubMed

Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. New York, NY, USA: Springer; 2001.

Lee J. A., Verleysen M. Nonlinear Dimensionality Reduction. New York, NY, USA: Springer; 2007. DOI

Schwender H., Ickstadt K., Rahnenführer J. Classification with high-dimensional genetic data: assigning patients and genetic features to known classes. Biometrical Journal. 2008;50(6):911–926. doi: 10.1002/bimj.200810475. PubMed DOI

Dai J. J., Lieu L., Rocke D. Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular Biology. 2006;5, article 6:1–19. doi: 10.2202/1544-6115.1147. PubMed DOI

Davies L. Data Analysis and Approximate Models. Boca Raton, Fla, USA: CRC Press; 2014.

Hubert M., Rousseeuw P. J., Van Aelst S. High-breakdown robust multivariate methods. Statistical Science. 2008;23(1):92–119. doi: 10.1214/088342307000000087. DOI

Filzmoser P., Todorov V. Review of robust multivariate statistical methods in high dimension. Analytica Chimica Acta. 2011;705(1-2):2–14. doi: 10.1016/j.aca.2011.03.055. PubMed DOI

Todorov V., Filzmoser P. Comparing classical and robust sparse PCA. Advances in Intelligent Systems and Computing. 2013;190:283–291. doi: 10.1007/978-3-642-33042-1_31. DOI

Xu H., Caramanis C., Mannor S. Outlier-robust PCA: the high-dimensional case. IEEE Transactions on Information Theory. 2013;59(1):546–572. doi: 10.1109/tit.2012.2212415. DOI

van Aelst S., Khan J. A., Zamar R. H. Fast robust variable selection. In: Brito P., editor. COMPSTAT 2008: Proceedings in Computational Statistics. Heidelberg, Germany: Physica-Verlag HD; 2008. pp. 359–370.

Ding C., Peng H. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology. 2005;3(2):185–205. doi: 10.1142/S0219720005001004. PubMed DOI

Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks. 1994;5(4):537–550. doi: 10.1109/72.298224. PubMed DOI

Liu X., Krishnan A., Mondry A. An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics. 2005;6, article 76:15. doi: 10.1186/1471-2105-6-76. PubMed DOI PMC

Peng H., Long F., Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. doi: 10.1109/tpami.2005.159. PubMed DOI

Auffarth B., Lopez M., Cerquides J. Advances in Data Mining, Applications and Theoretical Aspects. Vol. 6171. Springer; 2010. Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images; pp. 248–262. (Lecture Notes in Computer Science).

Kalina J. Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision. 2012;44(3):449–462. doi: 10.1007/s10851-012-0337-z. DOI

Víšek J. Á. Consistency of the least weighted squares under heteroscedasticity. Kybernetika. 2011;47(2):179–206.

Harrell F. E. Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, NY, USA: Springer; 2002.

Guo Y., Hastie T., Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007;8(1):86–100. doi: 10.1093/biostatistics/kxj035. PubMed DOI

Pourahmadi M. High-Dimensional Covariance Estimation. Hoboken, NJ, USA: John Wiley & Sons; 2013. (Wiley Series in Probability and Statistics). DOI

Schäfer J., Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology. 2005;4, article 32:30. doi: 10.2202/1544-6115.1175. PubMed DOI

Tibshirani R., Hastie T., Narasimhan B., Chu G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science. 2003;18(1):104–117. doi: 10.1214/ss/1056397488. DOI

Ledoit O., Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2004;88(2):365–411. doi: 10.1016/s0047-259x(03)00096-4. DOI

Stein C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability; 1956; Berkeley, Calif, USA. University of California Press; pp. 197–206.

Kalina J. Classification methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering. 2014;34(1):10–18. doi: 10.1016/j.bbe.2013.09.007. DOI

Xanthopoulos P., Pardalos P. M., Trafalis T. B. Robust Data Mining. New York, NY, USA: Springer; 2013. DOI

Shevlyakov G. L., Vilchevski N. O. Robustness in Data Analysis: Criteria and Methods. Utrecht, The Netherlands: VSP; 2002.

Čížek P. Semiparametrically weighted robust estimation of regression models. Computational Statistics & Data Analysis. 2011;55(1):774–788. doi: 10.1016/j.csda.2010.06.024. DOI

Rousseeuw P. J., Driessen K. V. Computing LTS regression for large data sets. Data Mining and Knowledge Discovery. 2006;12(1):29–45. doi: 10.1007/s10618-005-0024-4. DOI

Rousseeuw P. J., Leroy A. M. Robust Regression and Outlier Detection. New York, NY, USA: John Wiley & Sons; 1987.

Donoho D. L., Huber P. J. The notion of breakdown point. In: Bickel P. J., Doksum K., Hodges J. L. J., editors. A Festschrift for Erich L. Lehmann. Wadsworth, Ohio, USA: Belmont; 1983. pp. 157–184.

Rao C. R. Linear Methods of Statistical Induction and their Applications. 2nd. New York, NY, USA: Wiley; 1973.

Christmann A. Least median of weighted squares in logistic regression with large strata. Biometrika. 1994;81(2):413–417. doi: 10.1093/biomet/81.2.413. DOI

Sreekumar A., Poisson L. M., Rajendiran T. M., et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 2009;457(7231):910–914. doi: 10.1038/nature07762. PubMed DOI PMC

Kalina J., Duintjer Tebbens J. Algorithms for regularized linear discriminant analysis. Proceedings of the 6th International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS '15); 2015; Lisbon, Portugal. Scitepress; pp. 128–133.

Schlenker A. Keystroke Dynamics Data. 2015. http://www2.cs.cas.cz/~kalina/keystrokedyn.html.

Kalina J., Schlenker A., Kutílek P. Highly robust analysis of keystroke dynamics measurements. Proceedings of the 13th International Symposium on Applied Machine Intelligence and Informatics (SAMI '15); January 2015; Herľany, Slovakia. IEEE; pp. 133–138. DOI

Özdemir M. K. A framework for authentication of medical reports based on keystroke dynamics [M.S. thesis] Middle East Technical University; 2010. http://etd.lib.metu.edu.tr/upload/12612081/index.pdf.

Bhatt S., Santhanam T. Keystroke dynamics for biometric authentication-a survey. Proceedings of the International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME '13); February 2013; IEEE; pp. 17–23. DOI

Hájek J., Šidák Z., Sen P. K. Theory of Rank Tests. 2nd. San Diego, Calif, USA: Academic Press; 1999.

Breiman L. Bagging predictors. Machine Learning. 1996;24(2):123–140.

Furlanello C., Serafini M., Merler S., Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics. 2003;4, article 54 doi: 10.1186/1471-2105-4-54. PubMed DOI PMC

Nejnovějších 20 citací...

Zobrazit více v
Medvik | PubMed

Interpoint distance tests for high-dimensional comparison studies

Journal of applied statistics. 2020 ; 47 (4) : 653-665. [epub] 20190731

J Appl Stat
ISSN 0266-4763
Zdroj

Najít záznam

v BMČ

A Robust Supervised Variable Selection for Noisy High-Dimensional Data

Najít záznam

Citační ukazatele

Možnosti archivace