Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods

. 2020 Aug 12 ; 10 (8) : . [epub] 20200812

Status PubMed-not-MEDLINE Jazyk angličtina Země Švýcarsko Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid32806785
Odkazy

PubMed 32806785
PubMed Central PMC7460566
DOI 10.3390/diagnostics10080584
PII: diagnostics10080584
Knihovny.cz E-zdroje

In this paper, we present the results of the research concerning extraction of informative gene expression profiles from high-dimensional array of gene expressions considering the state of patients' health using clustering method, ML-based binary classifiers and fuzzy inference system. Applying of the proposed stepwise procedure can allow us to extract the most informative genes taking into account both the subtypes of disease or state of the patient's health for further reconstruction of gene regulatory networks based on the allocated genes and following simulation of the reconstructed models. We used the publicly available gene expressions data as the experimental ones which were obtained using DNA microarray experiments and contained two types of patients' gene expression profiles-the patients with lung cancer tumor and healthy patients. The stepwise procedure of the data processing assumes the following steps-in the beginning, we reduce the number of genes by removing non-informative genes in terms of statistical criteria and Shannon entropy; then, we perform the stepwise hierarchical clustering of gene expression profiles at hierarchical levels from 1 to 10 using the SOTA (Self-Organizing Tree Algorithm) clustering algorithm with correlation distance metric. The quality of the obtained clustering was evaluated using the complex clustering quality criterion which is considered both the gene expression profiles distribution relative to center of the clusters where these gene expression profiles are allocated and the centers of the clusters distribution. The result of this stage execution was a selection of the optimal cluster at each of the hierarchical levels which corresponded to the minimum value of the quality criterion. At the next step, we have implemented a classification procedure of the examined objects using four well known binary classifiers-logistic regression, support-vector machine, decision trees and random forest classifier. The effectiveness of the appropriate technique was evaluated based on the use of ROC (Receiver Operating Characteristic) analysis using criteria, included as the components, the errors of both the first and the second kinds. The final decision concerning the extraction of the most informative subset of gene expression profiles was taken based on the use of the fuzzy inference system, the inputs of which are the results of the appropriate single classifiers operation and the output is the final solution concerning state of the patient's health. To our mind, the implementation of the proposed stepwise procedure of the informative gene expression profiles extraction create the conditions for the increasing effectiveness of the further procedure of gene regulatory networks reconstruction and the following simulation of the reconstructed models considering the subtypes of the disease and/or state of the patient's health.

Zobrazit více v PubMed

Lesage R., Kerkhofs J., Geris L. Computational modeling and reverse engineering to reveal dominant regulatory interactions controlling osteochondral differentiation: Potential for regenerative medicine. Front. Bioeng. Biotechnol. 2008;6:165. doi: 10.3389/fbioe.2018.00165. PubMed DOI PMC

Alexiou A., Chatzichronis S., Perveen A., Hafeez A., Ashraf G.M. Algorithmic and stochastic representations of gene regulatory networks and protein-protein interactions. Curr. Top. Med. Chem. 2019;19:413–425. doi: 10.2174/1568026619666190311125256. PubMed DOI

Liu Z.P. Towards precise reconstruction of gene regulatory networks by data integration. Quant. Biol. 2018;6:113–128. doi: 10.1007/s40484-018-0139-4. DOI

Byron K., Wang J.T.L. A comparative review of recent bioinformatics tools for inferring gene regulatory networks using time-series expression data. Int. J. Data Min. Bioinform. 2018;20:320–340. doi: 10.1504/IJDMB.2018.094889. DOI

Schena M., Davis R.W. Microarray Biochip Technology. Eaton Publishing; Detroit, MI, USA: 2008. pp. 1–18.

Heather J.M., Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics. 2016;107:1–8. doi: 10.1016/j.ygeno.2015.11.003. PubMed DOI PMC

Bolstad B.M., Irizarry R.A., Åstrand M., Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. PubMed DOI

Affymetrix Statistical Algorithms Description Document. [(accessed on 12 May 2020)];2002 Available online: http://tools.thermofisher.com/content/sfs/brochures/sadd_whitepaper.pdf.

Irizarry R.A., Hobbs B., Collin F., Beazer-Barclay Y.D., Antonellis K.J., Scherf U., Speed T.P. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Sel. Work. Terry Speed. 2012:601–616._15. doi: 10.1093/biostatistics/4.2.249. PubMed DOI

Chen Z., McGee M., Liu Q., Kong M., Deng Y., Scheuermann R.H. A distribution-free convolution model for background correction of oligonucleotide microarray data. BMC Genom. 2009;10:19. doi: 10.1186/1471-2164-10-S1-S19. PubMed DOI PMC

Gentleman R., Carey V., Huber W., Irizarry R., Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer; Berlin/Heidelberg, Germany: 2005.

Park T., Yi S.G., Kang S.H., Lee S.Y., Lee Y.S., Simon R. Evaluation of normalization methods for microarray data. BMC Bioinform. 2003;4:13. doi: 10.1186/1471-2105-4-33. PubMed DOI PMC

Raddatz B.B., Spitzbarth I., Matheis K.A., Kalkuhl A., Deschl U., Baumgärtner W., Ulrich R. Microarray-based gene expression analysis for veterinary pathologists: A review. Vet. Pathol. 2017;54:734–755. doi: 10.1177/0300985817709887. PubMed DOI

Astrand M. Contrast normalization of oligonucleotide arrays. J. Comput. Biol. 2003;10:95–102. doi: 10.1089/106652703763255697. PubMed DOI

Chen Y.J., Kodell R., Sistare F., Thompson K.L., Morris S., Chen J.J. Normalization methods for analysis of microarray gene-expression data. J. Biopharm. Stat. 2003;13:57–74. doi: 10.1081/BIP-120017726. PubMed DOI

Barbara D., Wu X. An approximate median polish algorithm for large multidimensional data sets. Knowl. Inf. Syst. 2003;5:416–438. doi: 10.1007/s10115-003-0096-7. DOI

Lazaridis E.N., Sinibaldi D., Bloom G., Mane S., Jove R. A simple method to improve probe set estimates from oligonucleotide arrays. Math. Biosci. 2002;176:53–58. doi: 10.1016/S0025-5564(01)00100-6. PubMed DOI

Babichev S., Durnyak B., Senkivskyy V., Sorochynskyi O., Kliap M., Khamula O. Exploratory analysis of neuroblastoma data genes expressions based on bioconductor package tools; Proceedings of the 2019 IDDM Workshops; Lviv, Ukraine. 11–13 November 2019; pp. 268–279.

Helgeson E.S., Liu Q., Chen G., Kosorok M.R., Bair E. Biclustering via sparse clustering. Biometrics. 2020;76:348–358. doi: 10.1111/biom.13136. PubMed DOI PMC

Xie J., Ma A., Zhang Y., Liu B., Cao S., Wang C., Ma Q. Qubic2: A novel and robust biclustering algorithm for analyses and interpretation of large-scale rna-seq data. Bioinformatics. 2020;36:1143–1149. doi: 10.1093/bioinformatics/btz692. PubMed DOI PMC

Karim M.B., Kanaya S., Altaf-Ul-Amin M. Implementation of bicluso and its comparison with other biclustering algorithms. Appl. Netw. Sci. 2019;1:79. doi: 10.1007/s41109-019-0180-x. DOI

Babichev S., Barilla J., Fišer J., Škvor J. A hybrid model of gene expression profiles reducing based on the complex use of fuzzy inference system and clustering quality criteria; Proceedings of the 2019 Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology (EUSFLAT 2019); Prague, Czech Republic. 9–13 September 2019; DOI

Patowary P., Sarmah R., Bhattacharyya D.K. Developing an effective biclustering technique using an enhanced proximity measure. Netw. Model. Anal. Health Inform. Bioinform. 2020;9:6. doi: 10.1007/s13721-019-0211-7. DOI

Saini N., Saha S., Soni C., Bhattacharyya P. Automatic evolution of bi-clusters from microarray data using self-organized multi-objective evolutionary algorithm. Appl. Intell. 2020;50:1027–1044. doi: 10.1007/s10489-019-01554-w. DOI

Feng C., Liu S., Zhang H., Guan R., Li D., Zhou F., Feng X. Dimension reduction and clustering models for single-cell rna sequencing data: A comparative study. Int. J. Mol. Sci. 2020;21:2181. doi: 10.3390/ijms21062181. PubMed DOI PMC

Babichev S., Taif M.A., Lytvynenko V. Estimation of the inductive model of objects clustering stability based on the k-means algorithm for different levels of data noise. Radio Electron. Comput. Sci. Control. 2016;4:54–60. doi: 10.15588/1607-3274-2016-4-7. DOI

Shukla A.K., Shukla P., Vardhan M. Gene selection for cancer types classification using novel hybrid metaheuristics approach. Swarm Evol. Comput. 2020;54:100661. doi: 10.1016/j.swevo.2020.100661. DOI

Yuan L.M., Sun Y., Huang G. Using class-specific feature selection for cancer detection with gene expression profile data of platelets. Sensors. 2020;20:1528. doi: 10.3390/s20051528. PubMed DOI PMC

Marussy K., Buza K. SUCCESS: A new approach for semi-supervised classification of time-series; Proceedings of the 2013 International Conference on Artificial Intelligence and Soft Computing; Zakopane, Poland. 9–13 June 2013; pp. 437–447._39. DOI

Buza K. Classification of gene expression data: A hubness-aware semi-supervised approach. Comput. Methods Programs Biomed. 2016;127:105–113. doi: 10.1016/j.cmpb.2016.01.016. PubMed DOI

Varkonyi D.T., Buza K. Extreme learning machines with regularization for the classification of gene expression data; Proceedings of the 19th Conference Information Technologies—Applications and Theory (ITAT 2019); Donovaly, Slovakia. 20–24 September 2019; pp. 99–103.

Glowacz A., Glowacz Z. Recognition of images of finger skin with application of histogram, image filtration and K-NN classifier. Biocybern. Biomed. Eng. 2016;36:95–101. doi: 10.1016/j.bbe.2015.12.005. DOI

Babichev S., Lytvynenko V., Skvor J., Korobchynskyi M., Voronenko M. Information Technology of Gene Expression Profiles Processing for Purpose of Gene Regulatory Networks Reconstruction; Proceedings of the 2018 IEEE 2nd International Conference on Data Stream Mining and Processing; Lviv, Ukraine. 21–25 August 2018; pp. 336–341. DOI

Tkachenko R., Doroshenko A., Izonin I., Tsymbal Y., Havrysh B. Imbalance data classification via neural-like structures of geometric transformations model: Local and global approaches. Adv. Intell. Syst. Comput. 2019;754:112–122._12. doi: 10.1007/978-3-319-91008-6_12. DOI

Izonin I., Trostianchyn A., Duriagina Z., Tkachenko R., Tepla T., Lotoshynska N. The combined use of the wiener polynomial and SVM for material classification task in medical implants production. Int. J. Intell. Syst. Appl. 2018;10:40–47. doi: 10.5815/ijisa.2018.09.05. DOI

Hausser J., Strimmer K. Entropy inference and the james-stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 2009;10:1469–1484.

Zhao Q., Xu M., Fränti P. Sum-of-squares based cluster validity index and significance analysis; Proceedings of the International Conference on Adaptive and Natural Computing Algorithms; Kuopio, Finland. 23–25 April 2019; pp. 313–322._32. DOI

Calinski T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat. 1974;3:1–27.

Dorazo J., Carazo J.M. Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 1997;44:226–260. doi: 10.1007/PL00006139. PubMed DOI

Fritzke B. Growing cell structures a self-organizing network for unsupervised and supervised learning. Neural Netw. 1994;7:1441–1461. doi: 10.1016/0893-6080(94)90091-4. DOI

Babichev S., Lytvynenko V., Skvor J., Fiser J. Model of the objective clustering inductive technology of gene expression profiles based on SOTA and DBSCAN clustering algorithms. Adv. Intell. Syst. Comput. 2018;689:21–39._2. doi: 10.1007/978-3-319-70581-1_2. DOI

Tolles J., Meurer W.J. Logistic regression: Relating patient characteristics to outcomes. JAMA. 2016;316:533–534. doi: 10.1001/jama.2016.7653. PubMed DOI

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Verplas J. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830.

Arunachalam A.S., Thirumurthi Raja A., Perumal S. Enhanced constructive decision tree classification model for engineering students data. Int. J. Recent Technol. Eng. 2019;8:2414–2420.

Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. DOI

Sasaki Y. Research Fellow. School of Computer Science, The University of Manchester; Manchester, UK: 2007. The truth of the f-measure; pp. 1–5.

Matthews B.W. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. BBA—Protein Struct. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. PubMed DOI

Zadeh L.A., Abbasov A.M., Shahbazova S.N. Fuzzy-based techniques in human-like processing of social network data. Int. J. Uncertain. Fuzziness Knowlege-Based Syst. 2015;23:1–14. doi: 10.1142/S0218488515400012. DOI

Hou J., Aerts J., den Hamer B., van Ijcken W., den Bakker M., Riegman P., Leest C.V., der Spek P.V., Foekens J.A., Hoogsteden H.C., et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE. 2010;5:e10312. doi: 10.1371/journal.pone.0010312. PubMed DOI PMC

Kuhn M., Wing J., Weston S. Classification and Regression Training. [(accessed on 18 May 2020)]; Available online: https://github.com/topepo/caret/

Kleiber C., Zeileis A. Applied Econometrics with R. [(accessed on 5 May 2020)]; Available online: https://cran.r-project.org/web/packages/AER/AER.pdf.

Meyer D., Dimitriadou E., Hornik K. Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. [(accessed on 21 May 2020)];2019 Available online: https://cran.r-project.org/web/packages/e1071/e1071.pdf.

Ihaka R., Gentleman R. R: A linguage for data analysis and graphic. J. Comput. Graph. Stat. 1996;5:299–314.

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...