JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek

Medvik - BMČ

Je něco špatně v tomto záznamu ?

Why was this cited? Explainable machine learning applied to COVID-19 research literature

L. Beranová, MP. Joachimiak, T. Kliegr, G. Rabby, V. Sklenák

Beranová, Lucie
Autor Beranová, Lucie ORCID Department of Econometrics, Faculty of Informatics and Statistics, VSE Praha, W Churchill sq. 4, Prague, Czech Republic
Joachimiak, Marcin P
Autor Joachimiak, Marcin P ORCID Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory, Berkeley, USA
Kliegr, Tomáš
Autor Kliegr, Tomáš ORCID Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic
Rabby, Gollam
Autor Rabby, Gollam ORCID Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic
Sklenák, Vilém
Autor Sklenák, Vilém ORCID Centre of Information and Library Services, VSE Praha, Prague, Czech Republic Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic

Scientometrics. 2022 ; 127 (5) : 2313-2349. [pub] 20220409

ISSN 0138-9130
Medvik
Zdroj

Jazyk angličtina Země Nizozemsko

Typ dokumentu časopisecké články

Perzistentní odkaz https://www.medvik.cz/link/bmc22017614

PubMed 35431364
DOI 10.1007/s11192-022-04314-9
Knihovny.cz E-zdroje

Publikační typ
časopisecké články MeSH

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

Centre of Information and Library Services VSE Praha Prague Czech Republic

Department of Econometrics Faculty of Informatics and Statistics VSE Praha W Churchill sq 4 Prague Czech Republic

Department of Information and Knowledge Engineering VSE Praha Prague Czech Republic

Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory Berkeley USA

Citace poskytuje Crossref.org

000: 00000naa a2200000 a 4500

001: bmc22017614

003: CZ-PrNML

005: 20220720100311.0

007: ta

008: 220718s2022 ne f 000 0|eng||

009: AR

024 7_: $a 10.1007/s11192-022-04314-9 $2 doi

035 __: $a (PubMed)35431364

040 __: $a ABA008 $b cze $d ABA008 $e AACR2

041 0_: $a eng

044 __: $a ne

100 1_: $a Beranová, Lucie $u Department of Econometrics, Faculty of Informatics and Statistics, VSE Praha, W Churchill sq. 4, Prague, Czech Republic $1 https://orcid.org/0000000161039388

245 10: $a Why was this cited? Explainable machine learning applied to COVID-19 research literature / $c L. Beranová, MP. Joachimiak, T. Kliegr, G. Rabby, V. Sklenák

520 9_: $a Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

655 _2: $a časopisecké články $7 D016428

700 1_: $a Joachimiak, Marcin P $u Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory, Berkeley, USA $1 https://orcid.org/000000018175045X

700 1_: $a Kliegr, Tomáš $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000272610380

700 1_: $a Rabby, Gollam $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000212120101

700 1_: $a Sklenák, Vilém $u Centre of Information and Library Services, VSE Praha, Prague, Czech Republic $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000289660798

773 0_: $w MED00007848 $t Scientometrics $x 0138-9130 $g Roč. 127, č. 5 (2022), s. 2313-2349

856 41: $u https://pubmed.ncbi.nlm.nih.gov/35431364 $y Pubmed

910 __: $a ABA008 $b sig $c sign $y - $z 0

990 __: $a 20220718 $b ABA008

991 __: $a 20220720100306 $b ABA008

999 __: $a ind $b bmc $g 1816675 $s 1168856

BAS __: $a 3

BAS __: $a PreBMC

BMC __: $a 2022 $b 127 $c 5 $d 2313-2349 $e 20220409 $i 0138-9130 $m Scientometrics $n Scientometrics $x MED00007848

LZP __: $a Pubmed-20220718

Najít záznam

v PubMed

Why was this cited? Explainable machine learning applied to COVID-19 research literature

Najít záznam

Citační ukazatele

Možnosti archivace