JavaScript is NOT enabled !

Please enable JavaScript.

Article

FT
PubMed

This record comes from PubMed

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph

Rabby, Gollam
Author Rabby, Gollam L3S Research Center, Leibniz University Hannover, Hanover, Germany. gollam.rabby@l3s.de Department of Information and Knowledge Engineering, Prague University of Economics and Business, nám. Winstona Churchilla 1938/4, 120 00, Prague, Czech Republic. gollam.rabby@l3s.de
D'Souza, Jennifer
Author D'Souza, Jennifer Leibniz Information Centre for Science and Technology, Hannover, Germany
Oelen, Allard
Author Oelen, Allard Leibniz Information Centre for Science and Technology, Hannover, Germany
Dvorackova, Lucie
Author Dvorackova, Lucie Department of Econometrics, Prague University of Economics and Business, Prague, Czech Republic
Svátek, Vojtěch
Author Svátek, Vojtěch Department of Information and Knowledge Engineering, Prague University of Economics and Business, nám. Winstona Churchilla 1938/4, 120 00, Prague, Czech Republic
Auer, Sören
Author Auer, Sören L3S Research Center, Leibniz University Hannover, Hanover, Germany Leibniz Information Centre for Science and Technology, Hannover, Germany

Journal of biomedical semantics. 2023 Nov 28 ; 14 (1) : 18. [epub] 20231128

J Biomed Semantics
ISSN 2041-1480
Source

Language English Country England, Great Britain Media electronic

Document type Journal Article

Persistent link https://www.medvik.cz/link/pmid38017587

Grant support
IGA 16/2022 Vysoká Škola Ekonomická v Praze
IGA 16/2022 Vysoká Škola Ekonomická v Praze
CHIST-ERA-19- XAI-003 European Commission

Online Full text

PubMed 38017587
PubMed Central PMC10683290
DOI 10.1186/s13326-023-00298-4
PII: 10.1186/s13326-023-00298-4
Knihovny.cz E-resources

Keywords
COVID-19, Domain-independent knowledge graph, Influential scholarly document prediction, Machine learning algorithms, Text mining, World health organization,
MeSH
Algorithms MeSH
COVID-19 * MeSH
Language MeSH
Humans MeSH
Pattern Recognition, Automated * MeSH
Machine Learning MeSH
Check Tag
Humans MeSH
Publication type
Journal Article MeSH

Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.

Department of Econometrics Prague University of Economics and Business Prague Czech Republic

Department of Information and Knowledge Engineering Prague University of Economics and Business nám Winstona Churchilla 1938 4 120 00 Prague Czech Republic

L3S Research Center Leibniz University Hannover Hanover Germany

Leibniz Information Centre for Science and Technology Hannover Germany

See more in PubMed

Melville P, Gryc W, Lawrence RD. Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009. p. 1275–1284.

Mujahid M, Lee E, Rustam F, Washington PB, Ullah S, Reshi AA, et al. Sentiment analysis and topic modeling on tweets about online education during COVID-19. Appl Sci. 2021;11(18):8438. doi: 10.3390/app11188438. DOI

Li J, Sun A, Han J, Li C. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. 2020;34(1):50–70. doi: 10.1109/TKDE.2020.2981314. DOI

Goyal A, Gupta V, Kumar M. Recent named entity recognition and classification techniques: a systematic review. Comput Sci Rev. 2018;29:21–43. doi: 10.1016/j.cosrev.2018.06.001. DOI

Wang G, Zhang Z, Sun J, Yang S, Larson CA. POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis. Inf Process Manag. 2015;51(4):458–479. doi: 10.1016/j.ipm.2014.09.004. DOI

Hicks D, Wouters P, Waltman L, De Rijcke S, Rafols I. Bibliometrics: the Leiden Manifesto for research metrics. Nature 2015;520:429–31. PubMed

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. DOI

Kramer O. K-nearest neighbors. In: Dimensionality reduction with unsupervised nearest neighbors. Springer; 2013. p. 13–23.

Rish I, et al. An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3. 2001. p. 41–46.

Beranová L, Joachimiak MP, Kliegr T, et al. Why was this cited? Explainable machine learning applied to COVID-19 research literature. Scientometrics. 2022;127:2313–49. 10.1007/s11192-022-04314-9. PubMed PMC

Schröder C, Niekler A. A survey of active learning for text classification using deep neural networks. arXiv preprint arXiv:2008.07267. 2020.

Lai S, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI conference on artificial intelligence. 2015.

Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.

Almeida F, Xexéo G. Word embeddings: A survey. arXiv preprint arXiv:1901.09069. 2019.

Rheault L, Cochrane C. Word embeddings for the analysis of ideological placement in parliamentary corpora. Polit Anal. 2020;28(1):112–133. doi: 10.1017/pan.2019.26. DOI

Wieting J, Mallinson J, Gimpel K. Learning paraphrastic sentence embeddings from back-translated bitext. arXiv preprint arXiv:1706.01847. 2017.

Wehrmann J, Mattjie A, Barros RC. Order embeddings and character-level convolutions for multimodal alignment. Pattern Recogn Lett. 2018;102:15–22. doi: 10.1016/j.patrec.2017.11.020. DOI

HaCohen-Kerner Y, Miller D, Yigal Y. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE. 2020;15(5):e0232525. doi: 10.1371/journal.pone.0232525. PubMed DOI PMC

Hakim AA, Erwin A, Eng KI, Galinium M, Muliady W, Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach. In: 2014 6th international conference on information technology and electrical engineering (ICITEE). IEEE; 2014. p.1–4.

Sahlgren M, Cöster R. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: The 20th international conference on Computational Linguistics (COLING’04). 2004.

Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P, et al. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web J. 2014;6. 10.3233/SW-140134.

Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979;28(1):100–108.

Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2(4):433–459. doi: 10.1002/wics.101. DOI

Syakur M, Khotimah B, Rochman E, Satoto BD. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP conference series: materials science and engineering, vol. 336. IOP Publishing; 2018. p. 012017.

Heibi I, Peroni S. A qualitative and quantitative analysis of open citations to retracted articles: the Wakefield 1998 et al.’s case. Scientometrics. 2021;126(10):8433–70. PubMed PMC

Kim SW, Gil JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum-Centric Comput Inf Sci. 2019;9(1):1–21. doi: 10.1186/s13673-019-0192-7. DOI

Ethayarajh K. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512. 2019.

Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. 2019.

Liaw A, Wiener M, et al. Classification and regression by randomForest. R News. 2002;2(3):18–22.

Suthaharan S. Support vector machine. In: Machine learning models and algorithms for big data classification. Springer; 2016. p. 207–235.

Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. Springer; 2002.

Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–346. doi: 10.1002/sim.3782. PubMed DOI PMC

Nohara Y, Matsumoto K, Soejima H, Nakashima N. Explanation of machine learning models using improved Shapley Additive Explanation. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019. p. 546.

Garreau D, Luxburg U. Explaining the explainer: A first theoretical analysis of LIME. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2020. p. 1287–1296.

Kursa MB. Robustness of Random Forest-based gene selection methods. BMC bioinformatics. 2014;15(1):1–8. doi: 10.1186/1471-2105-15-8. PubMed DOI PMC

Mendes PN, Jakob M, García-Silva A, Bizer C. DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems. 2011. p. 1–8.

Martin-Loeches I, Dickson R, Torres A, Hanberger H, Lipman J, Antonelli M, et al. The importance of airway and lung microbiome in the critically ill. Crit Care. 2020;24(1):1–9. doi: 10.1186/s13054-020-03219-4. PubMed DOI PMC

Bucher TC, Jiang X, Meyer O, Waitz S, Hertling S, Paulheim H. Scikit-learn pipelines meet knowledge graphs. In: European Semantic Web Conference. Springer; 2021. p. 9–14.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357. doi: 10.1613/jair.953. DOI

Adie E, Roe W. Altmetric: enriching scholarly content with article-level discussion and metrics. Learned Publ. 2013;26(1):11–17. doi: 10.1087/20130103. DOI

Wang P, Tian D. Bibliometric analysis of global scientific research on COVID-19. J Biosaf Biosecurity. 2021;3(1):4–9. doi: 10.1016/j.jobb.2020.12.002. PubMed DOI PMC

Abd-Alrazaq A, Schneider J, Mifsud B, Alam T, Househ M, Hamdi M, et al. A comprehensive overview of the COVID-19 literature: machine learning-based bibliometric analysis. J Med Internet Res. 2021;23(3):e23703. doi: 10.2196/23703. PubMed DOI PMC

Mbunge E, Akinnuwesi B, Fashoto SG, Metfula AS, Mashwama P. A critical review of emerging technologies for tackling COVID-19 pandemic. Hum Behav Emerg Technol. 2021;3(1):25–39. doi: 10.1002/hbe2.237. PubMed DOI PMC

Pontis S, Blandford A, Greifeneder E, Attalla H, Neal D. Keeping up to date: An academic researcher’s information journey. J Assoc Inf Sci Technol. 2017;68(1):22–35. doi: 10.1002/asi.23623. DOI

Gupta A, Aeron S, Agrawal A, Gupta H. Trends in COVID-19 publications: streamlining research using NLP and LDA. Front Digit Health. 2021;3:686720. doi: 10.3389/fdgth.2021.686720. PubMed DOI PMC

Zhang H, Shaw R. Identifying research trends and gaps in the context of COVID-19. Int J Environ Res Public Health. 2020;17(10):3370. doi: 10.3390/ijerph17103370. PubMed DOI PMC

Ahmad T, Murad MA, Baig M, Hui J. Research trends in COVID-19 vaccine: a bibliometric analysis. Hum Vaccines Immunotherapeutics. 2021;17(8):2367–2372. doi: 10.1080/21645515.2021.1886806. PubMed DOI PMC

Bonney R, Shirk JL, Phillips TB, Wiggins A, Ballard HL, Miller-Rushing AJ, et al. Next steps for citizen science. Science. 2014;343(6178):1436–1437. doi: 10.1126/science.1251554. PubMed DOI

Katapally TR. A global digital citizen science policy to tackle pandemics like COVID-19. J Med Internet Res. 2020;22(5):e19357. doi: 10.2196/19357. PubMed DOI PMC

Jaradeh MY, Oelen A, Farfar KE, Prinz M, D'Souza J, Kismihók G, Stocker M, Auer S. Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture. 2019. p. 243-246.

Martn-Martn A, Thelwall M, Orduna-Malea E, Delgado Lopez-Cozar E. Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics. 2021;126(1):871–906. doi: 10.1007/s11192-020-03690-4. PubMed DOI PMC

Williamson EJ, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 2020;584(7821):430–436. doi: 10.1038/s41586-020-2521-4. PubMed DOI PMC

Borrow
RIS

Find record

In BMC

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph

Find record

Citation metrics

Archiving options