Why was this cited? Explainable machine learning applied to COVID-19 research literature

. 2022 ; 127 (5) : 2313-2349. [epub] 20220409

Status PubMed-not-MEDLINE Jazyk angličtina Země Švýcarsko Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid35431364

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

Zobrazit více v PubMed

Adney DR, Bielefeldt-Ohmann H, Hartwig AE, Bowen RA. Infection, replication, and transmission of middle east respiratory syndrome coronavirus in alpacas. Emerging Infectious Diseases. 2016;22(6):1031. doi: 10.3201/eid2206.160192. PubMed DOI PMC

Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C. Learning certifiably optimal rule lists for categorical data. The Journal of Machine Learning Research. 2017;18(1):8753–8830.

Azhar EI, El-Kafrawy SA, Farraj SA, Hassan AM, Al-Saeed MS, Hashem AM, Madani TA. Evidence for camel-to-human transmission of MERS coronavirus. New England Journal of Medicine. 2014;370(26):2499–2505. doi: 10.1056/NEJMoa1401505. PubMed DOI

Belikov AV, Belikov VV. A citation-based, author-and age-normalized, logarithmic index for evaluation of individual researchers independently of publication counts. Research. 2015;4:884.

Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. DOI

Cadorel, L., & Tettamanzi, A. G. B. (2020). Mining RDF data of COVID-19 scientific literature for interesting association rules. In The 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT’20).

Chan JFW, Lau SKP, To KKW, Cheng VCC, Woo PCY, Yuen K-Y. Middle east respiratory syndrome coronavirus: Another zoonotic betacoronavirus causing SARS-like disease. Clinical Microbiology Reviews. 2015;28(2):465–522. doi: 10.1128/CMR.00102-14. PubMed DOI PMC

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. DOI

de Winter JCF. The relationship between tweets, citations, and article views for PLOS ONE articles. Scientometrics. 2015;102(2):1773–1779. doi: 10.1007/s11192-014-1445-x. DOI

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies (Long and Short Papers) (Vol. 1, pp. 4171–4186). Association for Computational Linguistics. 10.18653/v1/N19-1423

Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research. 2014;15(1):3133–3181.

Fisher RA. Has Mendel’s work been rediscovered? Annals of Science. 1936;1(2):115–137. doi: 10.1080/00033793600200111. DOI

Ghorbani A, Abid A, Zou J. Interpretation of neural networks is fragile. Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:3681–3688. doi: 10.1609/aaai.v33i01.33013681. DOI

Giosa, D., & Di Caro, L. (2020) What2cite: Unveiling topics and citations dependencies for scientific literature exploration and recommendation. In International conference on knowledge engineering and knowledge management (pp. 147–157). Springer.

Glenisson P, Glänzel W, Persson O. Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics. 2005;63(1):163–180. doi: 10.1007/s11192-005-0208-0. DOI

Goumenou M, Spandidos DA, Tsatsakis A. Possibility of transmission through dogs being a contributing factor to the extreme Covid-19 outbreak in North Italy. Molecular Medicine Reports. 2020;21(6):2293–2295. PubMed PMC

Hahsler M, Johnson I, Kliegr T, Kucha J. Associative classification in r: arc, arulesCBA, and rCBA. R Journal. 2019;9(2):254. doi: 10.32614/RJ-2019-048. DOI

Hahsler M, Karpienko R. Visualizing association rules in hierarchical groups. Journal of Business Economics. 2017;87(3):317–335. doi: 10.1007/s11573-016-0822-8. DOI

Iqbal, F., Debbabi, M., & Fung, B. C. M. (2020). Authorship attribution using customized associative classification. In Machine learning for authorship attribution and cyber forensics (pp. 105–120). Springer.

Jinha AE. Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing. 2010;23(3):258–263. doi: 10.1087/20100308. DOI

Justice, M. J., & Dhillon, P. (2016). Using the mouse to model human disease: Increasing validity and reproducibility. PubMed PMC

Kaldas M, Michael S, Hanna J, Yousef GM. Journal impact factor: A bumpy ride in an open space. Journal of Investigative Medicine. 2020;68(1):83–87. doi: 10.1136/jim-2019-001009. PubMed DOI

Klavans R, Boyack KW. Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology. 2017;68(4):984–998. doi: 10.1002/asi.23734. DOI

Kliegr, T., & Kuchař, J. (2019). Tuning hyperparameters of classification based on associations (CBA). In ITAT (pp. 9–16).

Kuchař, J., & Kliegr, T. (2014). Bag-of-entities text representation for client-side (video) recommender systems. In Proceedings of the RecSysTV.

Kumar M, Mazumder P, Mohapatra S, Thakur AK, Dhangar K, Taki K, Mukherjee S, Patel AK, Bhattacharya P, Mohapatra P, et al. A chronicle of SARS-CoV-2: Seasonality, environmental fate, transport, inactivation, and antiviral drug resistance. Journal of Hazardous Materials. 2020;405:12–4043. PubMed PMC

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260–270). Association for Computational Linguistics. 10.18653/v1/N16-1030

Lee, J. Y., & Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convolutional neural networks. In: NAACL 11 March 2016. arXiv:abs/1603.03827

Li Yu, Zhang Z, Yang L, Lian X, Xie Y, Li S, Xin S, Cao P, Jianhong L. The mers-cov receptor dpp4 as a candidate binding target of the sars-cov-2 spike. Iscience. 2020;23(6):101160. doi: 10.1016/j.isci.2020.101160. PubMed DOI PMC

Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In Proceedings of the fourth international conference on knowledge discovery and data mining.

Louppe G, Wehenkel L, Sutera A, Geurts P. Understanding variable importances in forests of randomized trees. Advances in Neural Information Processing Systems. 2013;26:431–439.

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. 2020;2(1):2522–5839. doi: 10.1038/s42256-019-0138-9. PubMed DOI PMC

MacFarlane D, Rocha R. Guidelines for communicating about bats to prevent persecution in the time of COVID-19. Biological Conservation. 2020;248:108650. doi: 10.1016/j.biocon.2020.108650. PubMed DOI PMC

Mahmud, M., Kaiser, M. S., & Hussain, A. (2020). Deep learning in mining biological data. arXiv preprintarXiv:2003.00108 PubMed PMC

Mollas, I., Bassiliades, N., & Tsoumakas, G. (2019). Lionets: Local interpretation of neural networks through penultimate layer decoding. In Joint European conference on machine learning and knowledge discovery in databases (pp. 265–276). Springer.

Müller, M. A., Meyer, B., Corman, V. M., Al-Masri, M., Turkestani, A., Ritz, D., Sieberg, A., Aldabbagh, S., Bosch, B.-J., Lattwein, E., et al. (2015) Presence of middle east respiratory syndrome coronavirus antibodies in Saudi Arabia: A nationwide, cross-sectional, serological study. The Lancet Infectious Diseases,15(5), 559–564. PubMed PMC

Muñoz-Fontela C, Dowling WE, Funnell SGP, Gsell P-S, Riveros-Balta AX, Albrecht RA, Andersen H, Baric RS, Carroll MW, Cavaleri M, et al. Animal models for COVID-19. Nature. 2020;586(7830):509–515. doi: 10.1038/s41586-020-2787-6. PubMed DOI PMC

Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019). ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP workshop and shared task (pp. 319–327). Association for Computational Linguistics. 10.18653/v1/W19-5034

Oermann MH, Nicoll LH, Ashton KS, Edie AH, Amarasekara S, Chinn PL, Carter-Templeton H, Ledbetter LS. Analysis of citation patterns and impact of predatory sources in the nursing literature. Journal of Nursing Scholarship. 2020;52(3):311–319. doi: 10.1111/jnu.12557. PubMed DOI

Pereira, M. J. R., Bernard, E., & Aguiar, L. (2020). Bats and COVID-19: villains or victims? Biota Neotropica,20(3).

Piskorski, J., Haneczok, J., & Jacquet, G. (2020). New benchmark corpus and models for fine-grained event classification: To BERT or not to BERT? In Proceedings of the 28th international conference on computational linguistics (pp. 6663–6678).

Poon LLM, Chu DKW, Chan K-H, Wong OK, Ellis TM, Leung YHC, Lau SKP, Woo PCY, Suen KY, Yuen KY, et al. Identification of a novel coronavirus in bats. Journal of Virology. 2005;79(4):2001–2009. doi: 10.1128/JVI.79.4.2001-2009.2005. PubMed DOI PMC

Ravanmehr, V., Blau, H., Cappelletti, L., Fontana, T., Carmody, L., Coleman, B., George, J., Reese, J., Joachimiak, M., Bocci, G., et al. (2021). Supervised learning with word embeddings derived from pubmed captures latent knowledge about protein kinases and cancer. bioRxiv. PubMed PMC

Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S, Shefchek KA, Good BM, Balhoff JP, Fontana T, et al. Kg-covid-19: A framework to produce customized knowledge graphs for covid-19 response. Patterns. 2020;2(1):100155. doi: 10.1016/j.patter.2020.100155. PubMed DOI PMC

Reusken CBEM, Haagmans BL, Müller MA, Gutierrez C, Godeke G-J, Meyer B, Muth D, Raj VS, Smits-De Vries L, Corman VM, et al. Middle east respiratory syndrome coronavirus neutralising serum antibodies in dromedary camels: A comparative serological study. The Lancet Infectious Diseases. 2013;13(10):859–866. doi: 10.1016/S1473-3099(13)70164-6. PubMed DOI PMC

Rezaee-Zavareh, M. S. & Karimi-Sari, H. (2020). Effect of published papers by the institute for health metrics and evaluation on the impact factor of the lancet journal. Journal of Investigative Medicine. PubMed

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144).

Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H. Rdf2vec: Rdf graph embeddings and their applications. Semantic Web. 2019;10(4):721–752. doi: 10.3233/SW-180317. DOI

Rodríguez-Pérez R, Bajorath J. Interpretation of machine learning models using shapley values: Application to compound potency and multi-target activity predictions. Journal of Computer-Aided Molecular Design. 2020;34:1013. doi: 10.1007/s10822-020-00314-0. PubMed DOI PMC

Roldan-Valadez E, Orbe-Arteaga U, Rios C. Eigenfactor score and alternative bibliometrics surpass the impact factor in a 2-years ahead annual-citation calculation: A linear mixed design model analysis of radiology, nuclear medicine and medical imaging journals. La Radiologia Medica. 2018;123(7):524–534. doi: 10.1007/s11547-018-0870-y. PubMed DOI

Ruano J, Aguilar-Luque M, Gómez-Garcia F, Alcalde Mellado P, Gay-Mimbrera J, Carmona-Fernandez PJ, Maestre-López B, Sanz-Cabanillas JL, Romero JLH, González-Padilla M, et al. The differential impact of scientific quality, bibliometric factors, and social media activity on the influence of systematic reviews and meta-analyses about psoriasis. PLoS ONE. 2018;13(1):191124. doi: 10.1371/journal.pone.0191124. PubMed DOI PMC

Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 2019;1(5):206–215. doi: 10.1038/s42256-019-0048-x. PubMed DOI PMC

Schmid U, Finzel B. Mutual explanations for cooperative decision making in medicine. KI-Künstliche Intelligenz. 2020;34(1–7):2020.

Sharun K, Tiwari R, Patel SK, Karthik K, Yatoo MI, Malik YS, Singh KP, Panwar PK, Harapan H, Singh RK, et al. Coronavirus disease 2019 (COVID-19) in domestic animals and wildlife: advances and prospects in the development of animal models for vaccine and therapeutic research. Human Vaccines & Immunotherapeutics. 2020;16:3043. doi: 10.1080/21645515.2020.1807802. PubMed DOI PMC

Shereen MA, Khan S, Kazmi A, Bashir N, Siddique R. COVID-19 infection: Origin, transmission, and characteristics of human coronaviruses. Journal of Advanced Research. 2020;24:91. doi: 10.1016/j.jare.2020.03.005. PubMed DOI PMC

Soares, J., Bazarian, F. K., Tavares, R. R., Denise, K., Bresciani, S., Pestana, R. C., et al. (2015). A review of the state of the art of self-citations. Journal of Education & Social Policy.

Speer, R., Chin, J., & Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI’17 (pp. 4444–4451). AAAI Press.

Strollo R, Pozzilli P. Dpp4 inhibition: preventing sars-cov-2 infection and/or progression of covid-19? Diabetes/Metabolism Research and Reviews. 2020;36(8):e3330. doi: 10.1002/dmrr.3330. PubMed DOI PMC

Subudhi S, Rapin N, Misra V. Immune system modulation and viral persistence in bats: Understanding viral spillover. Viruses. 2019;11(2):192. doi: 10.3390/v11020192. PubMed DOI PMC

Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z, Kononova O, Persson KA, Ceder G, Jain A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019;571(7763):95–98. doi: 10.1038/s41586-019-1335-8. PubMed DOI

Eck NJ, Van Waltman L. Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics. 2017;111(2):1053–1070. doi: 10.1007/s11192-017-2300-7. PubMed DOI PMC

Vieira ES, Gomes JANF. Citations to scientific articles: Its distribution and dependence on the article features. Journal of Informetrics. 2010;4(1):1–13. doi: 10.1016/j.joi.2009.06.002. DOI

Wainberg M, Alipanahi B, Frey BJ. Are random forests truly the best classifiers? The Journal of Machine Learning Research. 2016;17(1):3837–3841.

Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W., et al. (2020). Cord-19: The covid-19 open research dataset. ArXiv.

Wang Q. A bibliometric model for identifying emerging research topics. Journal of the Association for Information Science and Technology. 2018;69(2):290–304. doi: 10.1002/asi.23930. DOI

Web of Science Group. Journal impact factor - journal citation reports. (2022). https://clarivate.com/webofsciencegroup/solutions/journal-citation-reports/

Wei C-H, Kao H-Y, Zhiyong L. Pubtator: A web-based text mining tool for assisting biocuration. Nucleic Acids Research. 2013;41(W1):W518–W522. doi: 10.1093/nar/gkt441. PubMed DOI PMC

Whittaker GR, André NM, Millet JK. Improving virus taxonomy by recontextualizing sequence-based classification with biologically relevant data: The case of the alphacoronavirus 1 species. MSphere. 2018;3(1):e00463. doi: 10.1128/mSphereDirect.00463-17. PubMed DOI PMC

Widagdo W, Ayudhya SSN, Hundie GB, Haagmans BL. Host determinants of mers-cov transmission and pathogenesis. Viruses. 2019;11(3):280. doi: 10.3390/v11030280. PubMed DOI PMC

Yamada, I., & Shindo, H. (2019). Neural attentive bag-of-entities model for text classification. arXiv preprintarXiv:1909.01259

Ye, J., Han, S., Hu, Y., Coskun, B., Liu, M., Qin, H., & Skiena, S. (2017). Nationality classification using name embeddings. In 2017 ACM on Conference on Information and Knowledge Management. arXiv:abs/1708.07903

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE international conference on computer vision (ICCV).

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...