Multi-class classification of COVID-19 documents using machine learning algorithms
Status PubMed-not-MEDLINE Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články
PubMed
36465147
PubMed Central
PMC9707112
DOI
10.1007/s10844-022-00768-8
PII: 768
Knihovny.cz E-zdroje
- Klíčová slova
- COVID-19, Machine learning algorithms, Multi-class classification, Text mining,
- Publikační typ
- časopisecké články MeSH
In most biomedical research paper corpus, document classification is a crucial task. Even due to the global epidemic, it is a crucial task for researchers across a variety of fields to figure out the relevant scientific research papers accurately and quickly from a flood of biomedical research papers. It can also assist learners or researchers in assigning a research paper to an appropriate category and also help to find the relevant research paper within a very short time. A biomedical document classifier needs to be designed differently to go beyond a "general" text classifier because it's not dependent only on the text itself (i.e. on titles and abstracts) but can also utilize other information like entities extracted using some medical taxonomies or bibliometric data. The main objective of this research was to find out the type of information or features and representation method creates influence the biomedical document classification task. For this reason, we run several experiments on conventional text classification methods with different kinds of features extracted from the titles, abstracts, and bibliometric data. These procedures include data cleaning, feature engineering, and multi-class classification. Eleven different variants of input data tables were created and analyzed using ten machine learning algorithms. We also evaluate the data efficiency and interpretability of these models as essential features of any biomedical research paper classification system for handling specifically the COVID-19 related health crisis. Our major findings are that TF-IDF representations outperform the entity extraction methods and the abstract itself provides sufficient information for correct classification. Out of the used machine learning algorithms, the best performance over various forms of document representation was achieved by Random Forest and Neural Network (BERT). Our results lead to a concrete guideline for practitioners on biomedical document classification.
Zobrazit více v PubMed
Aizawa A. An information-theoretic perspective of TF–IDF measures. Information Processing & Management. 2003;39(1):45–65. doi: 10.1016/S0306-4573(02)00021-3. DOI
Balaji V, Suganthi S, Rajadevi R, et al. Skin disease detection and segmentation using dynamic graph cut algorithm and classification through naive bayes classifier. Measurement. 2020;163:107–122. doi: 10.1016/j.measurement.2020.107922. DOI
Beranová, L., Joachimiak, M.P., Kliegr, T., & et al. (2022). Why was this cited? explainable machine learning applied to COVID-19 research literature. Scientometrics, 1–37. 10.1007/s11192-022-04314-910.1007/s11192-022-04314-9. PubMed PMC
Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. DOI
Brown PF, Della Pietra VJ, Desouza PV, et al. Class-based n-gram models of natural language. Computational Linguistics. 1992;18(4):467–480.
Chawla NV, Bowyer KW, Hall LO, et al. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. DOI
Chen, Q., Allot, A., Leaman, R., & et al. (2021a). Overview of the BioCreative VII LitCovid track: multi-label topic classification for COVID-19 literature annotation. In Proceedings of the 7th BioCreative challenge evaluation workshop. 10.1093/database/baac069. PubMed PMC
Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Research. 2021;49(D1):D1534–D1540. doi: 10.1093/nar/gkaa952. PubMed DOI PMC
Devlin, J., Chang, M.W., Lee, K., & et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805, 10.48550/arXiv.1810.04805.
Elberrichi, Z., Amel, B., & Malika, T. (2012). Medical documents classification based on the domain ontology mesh. arXiv:12070446, 10.48550/arXiv.1207.0446.
Fukunaga K, Narendra PM. A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers. 1975;100(7):750–753. doi: 10.1109/T-C.1975.224297. DOI
Gani A, Siddiqa A, Shamshirband S, et al. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems. 2016;46(2):241–284. doi: 10.1007/s10115-015-0830-y. DOI
Geetha M, Renuka DK. Improving the performance of aspect based sentiment analysis using fine-tuned bert base uncased model. International Journal of Intelligent Networks. 2021;2:64–69. doi: 10.1016/j.ijin.2021.06.005. DOI
Gu J, Wang Z, Kuen J, et al. Recent advances in convolutional neural networks. Pattern Recognition. 2018;77:354–377. doi: 10.1016/j.patcog.2017.10.013. DOI
Jindal R, Taneja S. A lexical approach for text categorization of medical documents. Procedia Computer Science. 2015;46:314–320. doi: 10.1016/j.procs.2015.02.026. DOI
Jindal, R., & Taneja, S. (2015b). Ranking in multi label classification of text documents using quantifiers. In 2015 IEEE international conference on control system, computing and engineering (ICCSCE) (pp. 162–166). IEEE, DOI 10.1109/ICCSCE.2015.7482177, (to appear in print).
Kibriya, A.M., Frank, E., Pfahringer, B., & et al. (2004). Multinomial naive bayes for text categorization revisited. In Australasian joint conference on artificial intelligence. 10.1007/978-3-540-30549-1_43 (pp. 488–499). Springer.
Lample, G., Ballesteros, M., Subramanian, S., & et al. (2016). Neural architectures for named entity recognition. arXiv:160301360, 10.18653/v1/N16-1030.
Li W, Saigo H, Tong B, et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency. Journal of Intelligent Information Systems. 2021;56(3):435–458. doi: 10.1007/s10844-020-00635-4. DOI
Liaw A, Wiener M, et al. Classification and regression by randomforest. R News. 2002;2(3):18–22.
Liu B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. 2012;5(1):1–167. doi: 10.2200/S00416ED1V01Y201204HLT016. DOI
Louppe, G., Wehenkel, L., Sutera, A., & et al. (2013). Understanding variable importances in forests of randomized trees. Advances in Neural Information Processing Systems, 26. 10.5555/2999611.299966010.5555/2999611.2999660.
Lundberg SM, Erion G, Chen H, et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. 2020;2(1):2522–5839. doi: 10.1038/s42256-019-0138-9. PubMed DOI PMC
Margineantu, D.D., & Dietterich, T.G. (1997). Pruning adaptive boosting. In ICML. 10.5555/645526.757762 (pp. 211–218). Citeseer.
Mujtaba G, Shuib L, Idris N, et al. Clinical text classification research trends: Systematic literature review and open issues. Expert Systems with Applications. 2019;116:494–520. doi: 10.1016/j.eswa.2018.09.034. DOI
Muller, B., Sagot, B., & Seddah, D. (2019). Enhancing BERT for lexical normalization. In The 5th workshop on noisy user-generated text (W-NUT). 10.18653/v1/D19-5539.
Muralikumar J, Seelan SA, Vijayakumar N, et al. A statistical approach for modeling inter-document semantic relationships in digital libraries. Journal of Intelligent Information Systems. 2017;48(3):477–498. doi: 10.1007/s10844-016-0423-6. DOI
Neumann, M., King, D., Beltagy, I., & et al. (2019). Scispacy: fast and robust models for biomedical natural language processing. arXiv:190207669, 10.48550/arXiv.1902.07669.
Prusa JD, Khoshgoftaar TM. Improving deep neural network design with new text data representations. Journal of Big Data. 2017;4(1):1–16. doi: 10.1186/s40537-017-0065-8. DOI
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). “why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 10.1145/2939672.2939778 (pp. 1135–1144).
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions on Systems Man, and Cybernetics. 1991;21(3):660–674. doi: 10.1109/21.97458. DOI
Sperandei S. Understanding logistic regression analysis. Biochemia Medica. 2014;24(1):12–18. doi: 10.11613/BM.2014.003. PubMed DOI PMC
Suthaharan S. Machine learning models and algorithms for big data classification. Integr Ser Inf Syst. 2016;36:1–12.
Taud, H., & Mas, J. (2018). Multilayer perceptron (mlp). pp 451–455. 10.1007/978-1-4842-4470-8_31 .
Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. arXiv:190505950, 10.18653/v1/P19-1452.
Turc, I., Chang, M.W., Lee, K., & et al. (2019). Well-read students learn better: On the importance of pre-training compact models. arXiv:190808962, https://paperswithcode.com/paper/?openreview=BJg7x1HFvB.
Yan Y, Yin XC, Yang C, et al. Biomedical literature classification with a CNNS-based hybrid learning network. PloS ONE. 2018;13(7):93–97. doi: 10.1371/journal.pone.0197933. PubMed DOI PMC
Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics. 2010;1(1):43–52. doi: 10.1007/s13042-010-0001-0. DOI