• Je něco špatně v tomto záznamu ?

Why was this cited? Explainable machine learning applied to COVID-19 research literature

L. Beranová, MP. Joachimiak, T. Kliegr, G. Rabby, V. Sklenák

. 2022 ; 127 (5) : 2313-2349. [pub] 20220409

Jazyk angličtina Země Nizozemsko

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/bmc22017614

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

Citace poskytuje Crossref.org

000      
00000naa a2200000 a 4500
001      
bmc22017614
003      
CZ-PrNML
005      
20220720100311.0
007      
ta
008      
220718s2022 ne f 000 0|eng||
009      
AR
024    7_
$a 10.1007/s11192-022-04314-9 $2 doi
035    __
$a (PubMed)35431364
040    __
$a ABA008 $b cze $d ABA008 $e AACR2
041    0_
$a eng
044    __
$a ne
100    1_
$a Beranová, Lucie $u Department of Econometrics, Faculty of Informatics and Statistics, VSE Praha, W Churchill sq. 4, Prague, Czech Republic $1 https://orcid.org/0000000161039388
245    10
$a Why was this cited? Explainable machine learning applied to COVID-19 research literature / $c L. Beranová, MP. Joachimiak, T. Kliegr, G. Rabby, V. Sklenák
520    9_
$a Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.
655    _2
$a časopisecké články $7 D016428
700    1_
$a Joachimiak, Marcin P $u Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory, Berkeley, USA $1 https://orcid.org/000000018175045X
700    1_
$a Kliegr, Tomáš $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000272610380
700    1_
$a Rabby, Gollam $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000212120101
700    1_
$a Sklenák, Vilém $u Centre of Information and Library Services, VSE Praha, Prague, Czech Republic $u Department of Information and Knowledge Engineering, VSE Praha, Prague, Czech Republic $1 https://orcid.org/0000000289660798
773    0_
$w MED00007848 $t Scientometrics $x 0138-9130 $g Roč. 127, č. 5 (2022), s. 2313-2349
856    41
$u https://pubmed.ncbi.nlm.nih.gov/35431364 $y Pubmed
910    __
$a ABA008 $b sig $c sign $y - $z 0
990    __
$a 20220718 $b ABA008
991    __
$a 20220720100306 $b ABA008
999    __
$a ind $b bmc $g 1816675 $s 1168856
BAS    __
$a 3
BAS    __
$a PreBMC
BMC    __
$a 2022 $b 127 $c 5 $d 2313-2349 $e 20220409 $i 0138-9130 $m Scientometrics $n Scientometrics $x MED00007848
LZP    __
$a Pubmed-20220718

Najít záznam

Citační ukazatele

Nahrávání dat ...

    Možnosti archivace