Detail
Článek
Článek online
FT
Medvik - BMČ
  • Je něco špatně v tomto záznamu ?

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

M. Vollmar, S. Tirunagari, D. Harrus, D. Armstrong, R. Gáborová, D. Gupta, MQL. Afonso, G. Evans, S. Velankar

. 2024 ; 11 (1) : 1032. [pub] 20240927

Jazyk angličtina Země Anglie, Velká Británie

Typ dokumentu časopisecké články, dataset

Perzistentní odkaz   https://www.medvik.cz/link/bmc24018848

Grantová podpora
945405 EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)

We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Citace poskytuje Crossref.org

000      
00000naa a2200000 a 4500
001      
bmc24018848
003      
CZ-PrNML
005      
20241024111114.0
007      
ta
008      
241015s2024 enk f 000 0|eng||
009      
AR
024    7_
$a 10.1038/s41597-024-03841-9 $2 doi
035    __
$a (PubMed)39333508
040    __
$a ABA008 $b cze $d ABA008 $e AACR2
041    0_
$a eng
044    __
$a enk
100    1_
$a Vollmar, Melanie $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. melaniev@ebi.ac.uk $1 https://orcid.org/0000000291629159
245    10
$a Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature / $c M. Vollmar, S. Tirunagari, D. Harrus, D. Armstrong, R. Gáborová, D. Gupta, MQL. Afonso, G. Evans, S. Velankar
520    9_
$a We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
650    _2
$a lidé $7 D006801
650    12
$a proteiny $x chemie $7 D011506
650    12
$a strojové učení $7 D000069550
650    _2
$a databáze proteinů $7 D030562
655    _2
$a časopisecké články $7 D016428
655    _2
$a dataset $7 D064886
700    1_
$a Tirunagari, Santosh $u Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
700    1_
$a Harrus, Deborah $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
700    1_
$a Armstrong, David $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK $1 https://orcid.org/0000000249861229
700    1_
$a Gáborová, Romana $u CEITEC - Central European Institute of Technology, Masaryk University, Kamenice 5, 62500, Brno, Czech Republic
700    1_
$a Gupta, Deepti $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
700    1_
$a Afonso, Marcelo Querino Lima $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
700    1_
$a Evans, Genevieve $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
700    1_
$a Velankar, Sameer $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK $1 https://orcid.org/0000000284395964
773    0_
$w MED00208692 $t Scientific data $x 2052-4463 $g Roč. 11, č. 1 (2024), s. 1032
856    41
$u https://pubmed.ncbi.nlm.nih.gov/39333508 $y Pubmed
910    __
$a ABA008 $b sig $c sign $y - $z 0
990    __
$a 20241015 $b ABA008
991    __
$a 20241024111108 $b ABA008
999    __
$a ok $b bmc $g 2201611 $s 1230821
BAS    __
$a 3
BAS    __
$a PreBMC-MEDLINE
BMC    __
$a 2024 $b 11 $c 1 $d 1032 $e 20240927 $i 2052-4463 $m Scientific data $n Sci Data $x MED00208692
GRA    __
$a 945405 $p EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
LZP    __
$a Pubmed-20241015

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...