Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph
Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
IGA 16/2022
Vysoká Škola Ekonomická v Praze
IGA 16/2022
Vysoká Škola Ekonomická v Praze
CHIST-ERA-19- XAI-003
European Commission
PubMed
38017587
PubMed Central
PMC10683290
DOI
10.1186/s13326-023-00298-4
PII: 10.1186/s13326-023-00298-4
Knihovny.cz E-zdroje
- Klíčová slova
- COVID-19, Domain-independent knowledge graph, Influential scholarly document prediction, Machine learning algorithms, Text mining, World health organization,
- MeSH
- algoritmy MeSH
- COVID-19 * MeSH
- jazyk (prostředek komunikace) MeSH
- lidé MeSH
- rozpoznávání automatizované * MeSH
- strojové učení MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.
Department of Econometrics Prague University of Economics and Business Prague Czech Republic
L3S Research Center Leibniz University Hannover Hanover Germany
Leibniz Information Centre for Science and Technology Hannover Germany
Zobrazit více v PubMed
Melville P, Gryc W, Lawrence RD. Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009. p. 1275–1284.
Mujahid M, Lee E, Rustam F, Washington PB, Ullah S, Reshi AA, et al. Sentiment analysis and topic modeling on tweets about online education during COVID-19. Appl Sci. 2021;11(18):8438. doi: 10.3390/app11188438. DOI
Li J, Sun A, Han J, Li C. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. 2020;34(1):50–70. doi: 10.1109/TKDE.2020.2981314. DOI
Goyal A, Gupta V, Kumar M. Recent named entity recognition and classification techniques: a systematic review. Comput Sci Rev. 2018;29:21–43. doi: 10.1016/j.cosrev.2018.06.001. DOI
Wang G, Zhang Z, Sun J, Yang S, Larson CA. POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis. Inf Process Manag. 2015;51(4):458–479. doi: 10.1016/j.ipm.2014.09.004. DOI
Hicks D, Wouters P, Waltman L, De Rijcke S, Rafols I. Bibliometrics: the Leiden Manifesto for research metrics. Nature 2015;520:429–31. PubMed
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. DOI
Kramer O. K-nearest neighbors. In: Dimensionality reduction with unsupervised nearest neighbors. Springer; 2013. p. 13–23.
Rish I, et al. An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3. 2001. p. 41–46.
Beranová L, Joachimiak MP, Kliegr T, et al. Why was this cited? Explainable machine learning applied to COVID-19 research literature. Scientometrics. 2022;127:2313–49. 10.1007/s11192-022-04314-9. PubMed PMC
Schröder C, Niekler A. A survey of active learning for text classification using deep neural networks. arXiv preprint arXiv:2008.07267. 2020.
Lai S, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence. 2015.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
Almeida F, Xexéo G. Word embeddings: A survey. arXiv preprint arXiv:1901.09069. 2019.
Rheault L, Cochrane C. Word embeddings for the analysis of ideological placement in parliamentary corpora. Polit Anal. 2020;28(1):112–133. doi: 10.1017/pan.2019.26. DOI
Wieting J, Mallinson J, Gimpel K. Learning paraphrastic sentence embeddings from back-translated bitext. arXiv preprint arXiv:1706.01847. 2017.
Wehrmann J, Mattjie A, Barros RC. Order embeddings and character-level convolutions for multimodal alignment. Pattern Recogn Lett. 2018;102:15–22. doi: 10.1016/j.patrec.2017.11.020. DOI
HaCohen-Kerner Y, Miller D, Yigal Y. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE. 2020;15(5):e0232525. doi: 10.1371/journal.pone.0232525. PubMed DOI PMC
Hakim AA, Erwin A, Eng KI, Galinium M, Muliady W, Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach. In: 2014 6th international conference on information technology and electrical engineering (ICITEE). IEEE; 2014. p.1–4.
Sahlgren M, Cöster R. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: The 20th international conference on Computational Linguistics (COLING’04). 2004.
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P, et al. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web J. 2014;6. 10.3233/SW-140134.
Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979;28(1):100–108.
Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2(4):433–459. doi: 10.1002/wics.101. DOI
Syakur M, Khotimah B, Rochman E, Satoto BD. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP conference series: materials science and engineering, vol. 336. IOP Publishing; 2018. p. 012017.
Heibi I, Peroni S. A qualitative and quantitative analysis of open citations to retracted articles: the Wakefield 1998 et al.’s case. Scientometrics. 2021;126(10):8433–70. PubMed PMC
Kim SW, Gil JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum-Centric Comput Inf Sci. 2019;9(1):1–21. doi: 10.1186/s13673-019-0192-7. DOI
Ethayarajh K. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512. 2019.
Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. 2019.
Liaw A, Wiener M, et al. Classification and regression by randomForest. R News. 2002;2(3):18–22.
Suthaharan S. Support vector machine. In: Machine learning models and algorithms for big data classification. Springer; 2016. p. 207–235.
Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. Springer; 2002.
Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–346. doi: 10.1002/sim.3782. PubMed DOI PMC
Nohara Y, Matsumoto K, Soejima H, Nakashima N. Explanation of machine learning models using improved Shapley Additive Explanation. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019. p. 546.
Garreau D, Luxburg U. Explaining the explainer: A first theoretical analysis of LIME. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2020. p. 1287–1296.
Kursa MB. Robustness of Random Forest-based gene selection methods. BMC bioinformatics. 2014;15(1):1–8. doi: 10.1186/1471-2105-15-8. PubMed DOI PMC
Mendes PN, Jakob M, García-Silva A, Bizer C. DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems. 2011. p. 1–8.
Martin-Loeches I, Dickson R, Torres A, Hanberger H, Lipman J, Antonelli M, et al. The importance of airway and lung microbiome in the critically ill. Crit Care. 2020;24(1):1–9. doi: 10.1186/s13054-020-03219-4. PubMed DOI PMC
Bucher TC, Jiang X, Meyer O, Waitz S, Hertling S, Paulheim H. Scikit-learn pipelines meet knowledge graphs. In: European Semantic Web Conference. Springer; 2021. p. 9–14.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357. doi: 10.1613/jair.953. DOI
Adie E, Roe W. Altmetric: enriching scholarly content with article-level discussion and metrics. Learned Publ. 2013;26(1):11–17. doi: 10.1087/20130103. DOI
Wang P, Tian D. Bibliometric analysis of global scientific research on COVID-19. J Biosaf Biosecurity. 2021;3(1):4–9. doi: 10.1016/j.jobb.2020.12.002. PubMed DOI PMC
Abd-Alrazaq A, Schneider J, Mifsud B, Alam T, Househ M, Hamdi M, et al. A comprehensive overview of the COVID-19 literature: machine learning-based bibliometric analysis. J Med Internet Res. 2021;23(3):e23703. doi: 10.2196/23703. PubMed DOI PMC
Mbunge E, Akinnuwesi B, Fashoto SG, Metfula AS, Mashwama P. A critical review of emerging technologies for tackling COVID-19 pandemic. Hum Behav Emerg Technol. 2021;3(1):25–39. doi: 10.1002/hbe2.237. PubMed DOI PMC
Pontis S, Blandford A, Greifeneder E, Attalla H, Neal D. Keeping up to date: An academic researcher’s information journey. J Assoc Inf Sci Technol. 2017;68(1):22–35. doi: 10.1002/asi.23623. DOI
Gupta A, Aeron S, Agrawal A, Gupta H. Trends in COVID-19 publications: streamlining research using NLP and LDA. Front Digit Health. 2021;3:686720. doi: 10.3389/fdgth.2021.686720. PubMed DOI PMC
Zhang H, Shaw R. Identifying research trends and gaps in the context of COVID-19. Int J Environ Res Public Health. 2020;17(10):3370. doi: 10.3390/ijerph17103370. PubMed DOI PMC
Ahmad T, Murad MA, Baig M, Hui J. Research trends in COVID-19 vaccine: a bibliometric analysis. Hum Vaccines Immunotherapeutics. 2021;17(8):2367–2372. doi: 10.1080/21645515.2021.1886806. PubMed DOI PMC
Bonney R, Shirk JL, Phillips TB, Wiggins A, Ballard HL, Miller-Rushing AJ, et al. Next steps for citizen science. Science. 2014;343(6178):1436–1437. doi: 10.1126/science.1251554. PubMed DOI
Katapally TR. A global digital citizen science policy to tackle pandemics like COVID-19. J Med Internet Res. 2020;22(5):e19357. doi: 10.2196/19357. PubMed DOI PMC
Jaradeh MY, Oelen A, Farfar KE, Prinz M, D'Souza J, Kismihók G, Stocker M, Auer S. Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture. 2019. p. 243-246.
Martn-Martn A, Thelwall M, Orduna-Malea E, Delgado Lopez-Cozar E. Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics. 2021;126(1):871–906. doi: 10.1007/s11192-020-03690-4. PubMed DOI PMC
Williamson EJ, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 2020;584(7821):430–436. doi: 10.1038/s41586-020-2521-4. PubMed DOI PMC