-
Je něco špatně v tomto záznamu ?
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
M. Vollmar, S. Tirunagari, D. Harrus, D. Armstrong, R. Gáborová, D. Gupta, MQL. Afonso, G. Evans, S. Velankar
Jazyk angličtina Země Anglie, Velká Británie
Typ dokumentu časopisecké články, dataset
Grantová podpora
945405
EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
NLK
Directory of Open Access Journals
od 2014
Free Medical Journals
od 2014
Nature Open Access
od 2014-12-01
PubMed Central
od 2014
Europe PubMed Central
od 2014
ProQuest Central
od 2014-03-01
Open Access Digital Library
od 2014-01-01
Open Access Digital Library
od 2014-01-01
Health & Medicine (ProQuest)
od 2014-03-01
ROAD: Directory of Open Access Scholarly Resources
od 2014
Springer Nature OA/Free Journals
od 2014-12-01
- MeSH
- databáze proteinů MeSH
- lidé MeSH
- proteiny * chemie MeSH
- strojové učení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Citace poskytuje Crossref.org
- 000
- 00000naa a2200000 a 4500
- 001
- bmc24018848
- 003
- CZ-PrNML
- 005
- 20241024111114.0
- 007
- ta
- 008
- 241015s2024 enk f 000 0|eng||
- 009
- AR
- 024 7_
- $a 10.1038/s41597-024-03841-9 $2 doi
- 035 __
- $a (PubMed)39333508
- 040 __
- $a ABA008 $b cze $d ABA008 $e AACR2
- 041 0_
- $a eng
- 044 __
- $a enk
- 100 1_
- $a Vollmar, Melanie $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. melaniev@ebi.ac.uk $1 https://orcid.org/0000000291629159
- 245 10
- $a Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature / $c M. Vollmar, S. Tirunagari, D. Harrus, D. Armstrong, R. Gáborová, D. Gupta, MQL. Afonso, G. Evans, S. Velankar
- 520 9_
- $a We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
- 650 _2
- $a lidé $7 D006801
- 650 12
- $a proteiny $x chemie $7 D011506
- 650 12
- $a strojové učení $7 D000069550
- 650 _2
- $a databáze proteinů $7 D030562
- 655 _2
- $a časopisecké články $7 D016428
- 655 _2
- $a dataset $7 D064886
- 700 1_
- $a Tirunagari, Santosh $u Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- 700 1_
- $a Harrus, Deborah $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- 700 1_
- $a Armstrong, David $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK $1 https://orcid.org/0000000249861229
- 700 1_
- $a Gáborová, Romana $u CEITEC - Central European Institute of Technology, Masaryk University, Kamenice 5, 62500, Brno, Czech Republic
- 700 1_
- $a Gupta, Deepti $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- 700 1_
- $a Afonso, Marcelo Querino Lima $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- 700 1_
- $a Evans, Genevieve $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- 700 1_
- $a Velankar, Sameer $u Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK $1 https://orcid.org/0000000284395964
- 773 0_
- $w MED00208692 $t Scientific data $x 2052-4463 $g Roč. 11, č. 1 (2024), s. 1032
- 856 41
- $u https://pubmed.ncbi.nlm.nih.gov/39333508 $y Pubmed
- 910 __
- $a ABA008 $b sig $c sign $y - $z 0
- 990 __
- $a 20241015 $b ABA008
- 991 __
- $a 20241024111108 $b ABA008
- 999 __
- $a ok $b bmc $g 2201611 $s 1230821
- BAS __
- $a 3
- BAS __
- $a PreBMC-MEDLINE
- BMC __
- $a 2024 $b 11 $c 1 $d 1032 $e 20240927 $i 2052-4463 $m Scientific data $n Sci Data $x MED00208692
- GRA __
- $a 945405 $p EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
- LZP __
- $a Pubmed-20241015