JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek

FT
PubMed

Záznam pochází z PubMed

Multi-class classification of COVID-19 documents using machine learning algorithms

Rabby, Gollam
Autor Rabby, Gollam ORCID Department of Information and Knowledge Engineering, Prague University of Economics and Business, Prague, Czech Republic
Berka, Petr
Autor Berka, Petr ORCID Department of Information and Knowledge Engineering, Prague University of Economics and Business, Prague, Czech Republic

Journal of intelligent information systems. 2023 ; 60 (2) : 571-591. [epub] 20221129

J Intell Inf Syst
ISSN 0925-9902
Zdroj

Status PubMed-not-MEDLINE Jazyk angličtina Země Spojené státy americké Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz https://www.medvik.cz/link/pmid36465147

Online Plný text

PubMed 36465147
PubMed Central PMC9707112
DOI 10.1007/s10844-022-00768-8
PII: 768
Knihovny.cz E-zdroje

Klíčová slova
COVID-19, Machine learning algorithms, Multi-class classification, Text mining,
Publikační typ
časopisecké články MeSH

In most biomedical research paper corpus, document classification is a crucial task. Even due to the global epidemic, it is a crucial task for researchers across a variety of fields to figure out the relevant scientific research papers accurately and quickly from a flood of biomedical research papers. It can also assist learners or researchers in assigning a research paper to an appropriate category and also help to find the relevant research paper within a very short time. A biomedical document classifier needs to be designed differently to go beyond a "general" text classifier because it's not dependent only on the text itself (i.e. on titles and abstracts) but can also utilize other information like entities extracted using some medical taxonomies or bibliometric data. The main objective of this research was to find out the type of information or features and representation method creates influence the biomedical document classification task. For this reason, we run several experiments on conventional text classification methods with different kinds of features extracted from the titles, abstracts, and bibliometric data. These procedures include data cleaning, feature engineering, and multi-class classification. Eleven different variants of input data tables were created and analyzed using ten machine learning algorithms. We also evaluate the data efficiency and interpretability of these models as essential features of any biomedical research paper classification system for handling specifically the COVID-19 related health crisis. Our major findings are that TF-IDF representations outperform the entity extraction methods and the abstract itself provides sufficient information for correct classification. Out of the used machine learning algorithms, the best performance over various forms of document representation was achieved by Random Forest and Neural Network (BERT). Our results lead to a concrete guideline for practitioners on biomedical document classification.

Department of Information and Knowledge Engineering Prague University of Economics and Business Prague Czech Republic

Zobrazit více v PubMed

Aizawa A. An information-theoretic perspective of TF–IDF measures. Information Processing & Management. 2003;39(1):45–65. doi: 10.1016/S0306-4573(02)00021-3. DOI

Balaji V, Suganthi S, Rajadevi R, et al. Skin disease detection and segmentation using dynamic graph cut algorithm and classification through naive bayes classifier. Measurement. 2020;163:107–122. doi: 10.1016/j.measurement.2020.107922. DOI

Beranová, L., Joachimiak, M.P., Kliegr, T., & et al. (2022). Why was this cited? explainable machine learning applied to COVID-19 research literature. Scientometrics, 1–37. 10.1007/s11192-022-04314-910.1007/s11192-022-04314-9. PubMed PMC

Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. DOI

Brown PF, Della Pietra VJ, Desouza PV, et al. Class-based n-gram models of natural language. Computational Linguistics. 1992;18(4):467–480.

Chawla NV, Bowyer KW, Hall LO, et al. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. DOI

Chen, Q., Allot, A., Leaman, R., & et al. (2021a). Overview of the BioCreative VII LitCovid track: multi-label topic classification for COVID-19 literature annotation. In Proceedings of the 7th BioCreative challenge evaluation workshop. 10.1093/database/baac069. PubMed PMC

Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Research. 2021;49(D1):D1534–D1540. doi: 10.1093/nar/gkaa952. PubMed DOI PMC

Devlin, J., Chang, M.W., Lee, K., & et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, 10.48550/arXiv.1810.04805.

Elberrichi, Z., Amel, B., & Malika, T. (2012). Medical documents classification based on the domain ontology mesh. arXiv:12070446, 10.48550/arXiv.1207.0446.

Fukunaga K, Narendra PM. A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers. 1975;100(7):750–753. doi: 10.1109/T-C.1975.224297. DOI

Gani A, Siddiqa A, Shamshirband S, et al. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems. 2016;46(2):241–284. doi: 10.1007/s10115-015-0830-y. DOI

Geetha M, Renuka DK. Improving the performance of aspect based sentiment analysis using fine-tuned bert base uncased model. International Journal of Intelligent Networks. 2021;2:64–69. doi: 10.1016/j.ijin.2021.06.005. DOI

Gu J, Wang Z, Kuen J, et al. Recent advances in convolutional neural networks. Pattern Recognition. 2018;77:354–377. doi: 10.1016/j.patcog.2017.10.013. DOI

Jindal R, Taneja S. A lexical approach for text categorization of medical documents. Procedia Computer Science. 2015;46:314–320. doi: 10.1016/j.procs.2015.02.026. DOI

Jindal, R., & Taneja, S. (2015b). Ranking in multi label classification of text documents using quantifiers. In 2015 IEEE international conference on control system, computing and engineering (ICCSCE) (pp. 162–166). IEEE, DOI 10.1109/ICCSCE.2015.7482177, (to appear in print).

Kibriya, A.M., Frank, E., Pfahringer, B., & et al. (2004). Multinomial naive bayes for text categorization revisited. In Australasian joint conference on artificial intelligence. 10.1007/978-3-540-30549-1_43 (pp. 488–499). Springer.

Lample, G., Ballesteros, M., Subramanian, S., & et al. (2016). Neural architectures for named entity recognition. arXiv:160301360, 10.18653/v1/N16-1030.

Li W, Saigo H, Tong B, et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency. Journal of Intelligent Information Systems. 2021;56(3):435–458. doi: 10.1007/s10844-020-00635-4. DOI

Liaw A, Wiener M, et al. Classification and regression by randomforest. R News. 2002;2(3):18–22.

Liu B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. 2012;5(1):1–167. doi: 10.2200/S00416ED1V01Y201204HLT016. DOI

Louppe, G., Wehenkel, L., Sutera, A., & et al. (2013). Understanding variable importances in forests of randomized trees. Advances in Neural Information Processing Systems, 26. 10.5555/2999611.299966010.5555/2999611.2999660.

Lundberg SM, Erion G, Chen H, et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. 2020;2(1):2522–5839. doi: 10.1038/s42256-019-0138-9. PubMed DOI PMC

Margineantu, D.D., & Dietterich, T.G. (1997). Pruning adaptive boosting. In ICML. 10.5555/645526.757762 (pp. 211–218). Citeseer.

Mujtaba G, Shuib L, Idris N, et al. Clinical text classification research trends: Systematic literature review and open issues. Expert Systems with Applications. 2019;116:494–520. doi: 10.1016/j.eswa.2018.09.034. DOI

Muller, B., Sagot, B., & Seddah, D. (2019). Enhancing BERT for lexical normalization. In The 5th workshop on noisy user-generated text (W-NUT). 10.18653/v1/D19-5539.

Muralikumar J, Seelan SA, Vijayakumar N, et al. A statistical approach for modeling inter-document semantic relationships in digital libraries. Journal of Intelligent Information Systems. 2017;48(3):477–498. doi: 10.1007/s10844-016-0423-6. DOI

Neumann, M., King, D., Beltagy, I., & et al. (2019). Scispacy: fast and robust models for biomedical natural language processing. arXiv:190207669, 10.48550/arXiv.1902.07669.

Prusa JD, Khoshgoftaar TM. Improving deep neural network design with new text data representations. Journal of Big Data. 2017;4(1):1–16. doi: 10.1186/s40537-017-0065-8. DOI

Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). “why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 10.1145/2939672.2939778 (pp. 1135–1144).

Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions on Systems Man, and Cybernetics. 1991;21(3):660–674. doi: 10.1109/21.97458. DOI

Sperandei S. Understanding logistic regression analysis. Biochemia Medica. 2014;24(1):12–18. doi: 10.11613/BM.2014.003. PubMed DOI PMC

Suthaharan S. Machine learning models and algorithms for big data classification. Integr Ser Inf Syst. 2016;36:1–12.

Taud, H., & Mas, J. (2018). Multilayer perceptron (mlp). pp 451–455. 10.1007/978-1-4842-4470-8_31 .

Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. arXiv:190505950, 10.18653/v1/P19-1452.

Turc, I., Chang, M.W., Lee, K., & et al. (2019). Well-read students learn better: On the importance of pre-training compact models. arXiv:190808962, https://paperswithcode.com/paper/?openreview=BJg7x1HFvB.

Yan Y, Yin XC, Yang C, et al. Biomedical literature classification with a CNNS-based hybrid learning network. PloS ONE. 2018;13(7):93–97. doi: 10.1371/journal.pone.0197933. PubMed DOI PMC

Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics. 2010;1(1):43–52. doi: 10.1007/s13042-010-0001-0. DOI

Najít záznam

v BMČ

Multi-class classification of COVID-19 documents using machine learning algorithms

Najít záznam

Citační ukazatele

Možnosti archivace