Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

. 2024 Sep 27 ; 11 (1) : 1032. [epub] 20240927

Jazyk angličtina Země Anglie, Velká Británie Médium electronic

Typ dokumentu časopisecké články, dataset

Perzistentní odkaz   https://www.medvik.cz/link/pmid39333508

Grantová podpora
945405 EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)

Odkazy

PubMed 39333508
PubMed Central PMC11436914
DOI 10.1038/s41597-024-03841-9
PII: 10.1038/s41597-024-03841-9
Knihovny.cz E-zdroje

We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Zobrazit více v PubMed

wwPDB consortium Protein Data Bank: the single global archive for 3d macromolecular structure data. Nucleic Acids Res.47, D520–D528, 10.1093/nar/gky949 (2019). PubMed PMC

Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res.48, D335–D343, 10.1093/nar/gkz990 (2020). PubMed PMC

Choudhary, P. et al. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci. Data10, 10.1038/s41597-023-02101-6 (2023). PubMed PMC

The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res.51, D523–D531, 10.1093/nar/gkac1052 (2023). PubMed PMC

Munro, R. Human-in-the-loop machine learning. (Manning Publications, Shelter Island, 2020).

Settles, B. Active learning literature survey. (Tech. rep., University of Wisconsin-Madison. Department of Computer Sciences, https://minds.wisconsin.edu/handle/1793/60660 (2009).

Olsson, F. A literature survey of active machine learning in the context of natural language processing. (Tech. rep., Swedish Institute of Computer Science, http://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-23510 (2009).

Hoi, S. C. H., Jin, R., Zhu, J. & Lyu, M. R. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on machine learning. ICML ’06, 417–424 (Association for Computing Machinery, New York, NY, USA), 10.1145/1143844.1143897 (2006).

Nguyen, D. H. M. & Patrick, J. D. Supervised machine learning and active learning in classification of radiology reports. J Am Med Inform Assoc21, 893–901, 10.1136/amiajnl-2013-002516 (2014). PubMed PMC

Luo, T. et al. Active learning to recognize multiple types of plankton. In Proceedings of the 17th international conference on pattern recognition.ICPR 2004, 478–481, 10.1109/ICPR.2004.1334570 (2004).

Blum, A. & Mitchell, T. Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conf. on Computational Learning TheoryCOLT’98, 92–100, 10.1145/279943.279962 (1998).

Collins, M. & Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corporahttps://aclanthology.org/W99-0613 (1999).

Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, 200–209, Bled, Slovenia. https://www.cs.cornell.edu/people/tj/publications/joachims_99c.pdf (1999).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589, 10.1038/s41586-021-03819-2 (2021). PubMed PMC

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science373, 871–876, 10.1126/science.abj8754 (2021). PubMed PMC

Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res.50, D439–D444, 10.1093/nar/gkab1061 (2022). PubMed PMC

Schwede, T. et al. Outcome of a workshop on applications of protein models in biomedical research. Structure17, 151–159, 10.1016/j.str.2008.12.014 (2009). PubMed PMC

Varadi, M. et al. 3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources. GigaScience11, 10.1093/gigascience/giac118 (2022). PubMed PMC

Wilkinson, M. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data3, 10.1038/sdata.2016.18 (2016). PubMed PMC

Allot, A., Lee, K., Chen, Q., Luo, L. & Lu, Z. Litsuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res.49, W352–W358, 10.1093/nar/gkab326 (2021). PubMed PMC

Roberts, R. PubMed Central: The GenBank of the published literature. Proc. Natl Acad. Sci. USA98, 381–382, 10.1073/pnas.98.2.381 (2001). PubMed PMC

The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res. 43, D1042–D1048, 10.1093/nar/gku1061 (2015). PubMed PMC

Erbilgin, O., Sutter, M., & Kerfeld, C. A. The structural basis of coenzyme A recycling in a bacterial organelle. PLOS Biol. 14, 10.1371/journal.pbio.1002399 (2016). PubMed PMC

Agrawal, A. A. et al. An extended U2AF(65)-RNA-binding domain recognizes the 3′ splice site signal. Nat. Commun. 7, 10.1038/ncomms10950 (2016). PubMed PMC

Huber, E. M. et al. A unified mechanism for proteolysis and autocatalytic activation in the 20S proteasome. Nat Commun. 7, 10.1038/ncomms10900 (2016). PubMed PMC

Kandiah, E. et al. Structural insights into Escherichia coli lysine decarboxylases and molecular determinants of interaction with the AAA+ ATPase RavA. Sci. Rep. 6, 10.1038/srep24601 (2016). PubMed PMC

Hunkeler, M., Stuttfeld, E., Hagmann, A., Imseng, S. & Maier, T. The dynamic organization of fungal acetyl-CoA carboxylase. Nat Commun. 7, 10.1038/ncomms11196 (2016). PubMed PMC

Santiago, J. et al. Mechanistic insight into a peptide hormone signaling complex mediating floral organ abscission. Elife5, 10.7554/elife.15075 (2016). PubMed PMC

Tauzin, A. S. et al. Molecular dissection of xycloglucan recognition in a prominent human gut symbiont. mBio7, 10.1128/mbio.02134-15 (2016). PubMed PMC

McLuskey, K. et al. Crystal structure and activity studies of the C11 cysteine peptidase from Parabacteroides merdae in the human gut microbiome. J. Biol. Chem.291, 9482–9491, 10.1074/jbc.m115.706143 (2016). PubMed PMC

van et al. Structural basis for mep2 ammonium transceptor activation by phosphosrylation. Nat. commun. 7, 10.1038/ncomms11337 (2016). PubMed PMC

Xu, M. et al. Structural insights into the regulatory mechanism of the Pseudomonas aeruginosa YfiBNR system. Protein Cell7, 10.1007/s13238-016-0264-7 403-416 (2016). PubMed PMC

Yokogawa, M. et al. Structural basis for the regulation of enzymatic activity of Regnase-1 by domain-domain interactions. Sci. Rep. 6, 10.1038/srep22324 (2016). PubMed PMC

Liguori, A. et al. Molecular basis of ligand-dependent regulation of NadR, the transcriptional repressor of meningococcal virulence factor NadA. Plos Pathog. 12, 10.1371/journal.ppat.1005557 (2016). PubMed PMC

Nwachukwu, J. C. et al. Predictive features of ligand-specific signaling through the estrogen receptor. Mol. Cyst. Biol. 12, 10.15252/msb.20156701 (2016). PubMed PMC

Bury, C. S. et al. RNA protects a nucleoprotein complex against radiation damage. Acta Crystallogr. D72, 648–657, 10.1107/s2059798316003351 (2016). PubMed PMC

Andrews, F. H. et al. The Taf14 YEATS domain is a reader for histone crotonylation. Nat. Chem. Biol.12, 396–398, 10.1038/nchembio.2065 (2016). PubMed PMC

Meyer, B. et al. Ribosome biogenesis factor Tsr3 is the aminocarboxylpropyl transferase responsible for 18S rRNA hypermodification in yeast and humans. Nucleic Acids Res.44, 4304–4316, 10.1093/nar/gkw244 (2016). PubMed PMC

Xie, Y., Li, M. & Chang, W. Crystal structures of putative sugar kinases from Synechococcus elongatus PCC 7942 and Arabidopsis thaliana. Plos One11, 10.1371/journal.pone.0156067 (2016). PubMed PMC

Watson, J. R. et al. Investigation of the interaction between Cdc42 and its effector TOCA1: handover of Cdc42 to the actin regulator N-WASP is facilitated by differential binding affinities. J. Biol. Chem.291, 13875–13890, 10.1074/jbc.m116.724294 (2016). PubMed PMC

Horowitz, S. et al. Visualizing chaperone-assisted protein folding. Nat. Struct. Mol. Biol.23, 691–697, 10.1038/nsmb.3237 (2016). PubMed PMC

Teplyakov, A. et al. Structural diversity in a human antibody germline library. MABS.8, 1045–1063, 10.1080/19420862.2016.1190060 (2016). PubMed PMC

Xiao, S., Ellena, J. F., Armstrong, G. S. & Capelluto, D. G. Structure of the GAT domain of the endosomal adapter protein Tom1. Data Brief7, 344–348, 10.1016/j.dib.2016.02.042 (2016). PubMed PMC

Widderich, N. et al. Biochemistry and crystal structure of ectoine synthase: a metal-containing member of the cupin superfamily. Plos One11, 10.1371/journal.pone.0151285 (2016). PubMed PMC

Liu, X. et al. A conserved motif in JNK/p38-specific MAPK phosphatase as a determinant for JNK1 recognition and inactivation. Nat. Commun. 7, 10.1038/ncomms10879 (2016). PubMed PMC

Kabe, Y. et al. Haem-dependent dimerization of PGRMC1/Sigma-2 receptor facilitates cancer proliferation and chemoresistance. Nat. Commun. 7, 10.1038/ncomms11030 (2016). PubMed PMC

Kreutzer, A. G., Hamza, I. L., Spencer, R. K. & Nowick, J. S. X-ray crystallographic structures of a trimer, dodecamer, and annular pore formed by an Aβ17-36 β-hairpin. J. Am. Chem. Soc.138, 4634–4642, 10.1021/jacs.6b01332 (2016). PubMed PMC

Liu, S. et al. Inhibiting complex IL-17A and IL-17RA interactions with a linear peptide. Sci. Rep. 6, 10.1038/srep26071 (2016). PubMed PMC

Cole, D. K. et al. Hotspot autoimmune T cell receptor binding underlies pathogen and insulin peptide cross-reactivity. J. Clin. Invest.126, 2191–2204, 10.1172/jci85679 (2016). PubMed PMC

Marcotte, D. J. et al. Structural determinant for inducing RORgamma specific inverse agonism triggered by a synthetic benzoxazinone ligand. BMC Struct. Biol. 16, 10.1186/s12900-016-0059-3 (2016). PubMed PMC

Abeyrathne, P. D., Koh, C. S., Grant, T., Grigorieff, N. & Korostelev, A. A. Ensemble cryo-EM uncovers inchworm-like translocation of a viral IRES through the ribosome. Elife5, 10.7554/elife.14874 (2016). PubMed PMC

Liao, J. et al. Mechanism of extracellular ion exchange and binding-site occlusion in a sodium/calcium exchanger. Nat. Struct. Mol. Biol.23, 590–599, 10.1038/nsmb.3230 (2016). PubMed PMC

Jeong, H. et al. Crystal structure of SEL1L: insight into the roles of SLR motifs in ERAD pathway. Sci. Rep. 6, 10.1038/srep20261 (2016). PubMed PMC

Khosa, S., Hoeppner, A., Gohlke, H., Schmitt, L. & Smits, S. H. Structure of the response regulator NsrR from Streptococcus agalactiae, which is involved in lantibiotic resistance. Plos One11, 10.1371/journal.pone.0149903 (2016) PubMed PMC

Schulte, K. et al. The immunity-regulated GTPase Irga6 dimerizes in a parallel head-to-head fashion. BMC Biol. 14, 10.1186/s12915-016-0236-7 (2016). PubMed PMC

Kimura, S. et al. Template-dependent nucleotide addition in the reverse (3′-5′) direction by Thg1-like protein. Sci. Adv. 2, 10.1126/sciadv.1501397 (2016). PubMed PMC

Fang, J. et al. Hemi-methylated DNA opens a closed conformation of UHRF1 to facilitate its histone recognition. Nat. Commun. 7, 10.1038/ncomms11197 (2016). PubMed PMC

Veselkov, D. A. et al. Structure of a quinolone-stabilized cleavage complex of topoisomerase IV from Klebsiella pneumoniae and comparison with a related Streptococcus pneumoniae complex. Acta Crystallogr. D72, 488–496, 10.1107/s2059798316001212 (2016). PubMed PMC

Scott, D. E., Marsh, M., Blundell, T. L., Abell, C. & Hyvönen, M. Structure-activated relationship of the peptide binding-motif mediating the BRCA2:RAD51 protein-protein interaction. FEBS Lett.590, 1094–1102, 10.1002/1873-3468.12139 (2016). PubMed PMC

Schellenberg, M. J. et al. Reversal of DNA damage induced topoisomerase 2 DNA-protein crosslinks by Tdp2. Nucleic Acids Res.44, 3829–3844, 10.1093/nar/gkw228 (2016). PubMed PMC

Borgnia, M. J. et al. Using cryo-EM to map small ligands on dynamic metabolic enzymes: studies with glutamate dehydrogenase. Mol. Pharmacol.89, 645–651, 10.1124/mol.116.103382 (2016). PubMed PMC

Matthews, M. M. et al. Structures of human ADAR2 bound to dsRNA reveals base-flipping mechanism and basis for site selectivity. Nat. Struct. Mol. Biol.23, 426–433, 10.1038/nsmb.3203 (2016). PubMed PMC

Klima, M. et al. Structural insights and in vitro reconstitution of membrane targeting and activation of human PI4KB by the ABCD3 protein. Sci. Rep. 6, 10.1038/srep23641 (2016). PubMed PMC

Liberto, M. V. et al. Molecular characterization of a family 5 glycoside hydrolase suggests an induced-fit enzymatic mechanism. Sci. Rep. 6, 10.1038/srep23473 (2016). PubMed PMC

Carcelli, M. et al. N-acylhydrazone inhibitors of influenza virus PA endonuclease with versatile metal binding modes. Sci. Rep. 6, 10.1038/srep31500 (2016). PubMed PMC

Zhang, Y., Rataj, K., Simpson, G. G. & Tong, L. Crystal structure of the SPOC Domain of the Arabidopsis flowering regulator FPA. Plos One11, 10.1371/journal.pone.0160694 (2016). PubMed PMC

Chen, J. Y. et al. Structure and function of human Naa60 (NatF), a Golgi-localized bi-functional acetyltransferase. Sci. Rep. 6, 10.1038/srep31425 (2016). PubMed PMC

He, D. et al. Structural characterization of encapsulated ferritin provides insight into iron storage in bacterial nanocompartments. Elife5, 10.7554/elife.18972 (2016). PubMed PMC

Zebisch, M., Jackson, V. A., Zhao, Y. & Jones, E. Y. Structure of the dual-mode Wnt regulator Kremen 1 and insight into ternary complex formation with LRP6 and Dickkopf. Structure24, 1599–1605, 10.1016/j.str.2016.06.020 (2016). PubMed PMC

Labourel, A. et al. The mechanism by which arabinoxylanases can recognize highly decorated xylans. J. Biol. Chem.291, 22149–22159, 10.1074/jbc.m116.743948 (2016). PubMed PMC

Ma, B. et al. Biochemical and structural characterization of a DNA N6-adenine methyltransferase from Helicobacter pylori. Oncotarget7, 40965–40977, 10.18632/oncotarget.9692 (2016). PubMed PMC

Janowski, R. et al. Roquin recognizes a non-canonical hexaloop structure in the 3′-UTR of Ox40. Nat. Commun. 7, 10.1038/ncomms11032 (2016). PubMed PMC

Islamaj, R., Kwon, D., Kim, S. & Lu, Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res.48, W5–W11, 10.1093/nar/gkaa333 (2020). PubMed PMC

Comeau, D. C., Wei, C.-H., Islamaj Doğan, R. & Lu, Z. PMC text mining subset in BioC: about three million full text articles and growing. Bioinformatics35, 3533–3535, 10.1093/bioinformatics/btz070 (2019). PubMed PMC

Comeau, D. C. et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford)2013, bat064, 10.1093/database/bat064 (2013). PubMed PMC

Hirschman, L., Yeh, A., Blaschke, C. & Valencia, A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinforma.6, S1, 10.1186/1471-2105-6-S1-S1 (2005). PubMed PMC

Rogers, F. B. Medical subject headings. Bull. Med. Libr. Assoc.51, 114–6, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC197951/pdf/mlab00186-0145.pdf (1963). PubMed PMC

Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet.25, 25–9, 10.1038/75556 (2000). PubMed PMC

The Gene Ontology Consortium. et al. The Gene Ontology knowledgebase in 2023. Genetics224, iyad031, 10.1093/genetics/iyad031 (2023). PubMed PMC

Eilbeck, K. et al. The sequence ontology: A tool for the unification of genome annotations. Genome Biol.6, R44, 10.1186/gb-2005-6-5-r44 (2005). PubMed PMC

Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res.44, D1214–D1219, 10.1093/nar/gkv1031 (2016). PubMed PMC

Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res.39, D52–D57, 10.1093/nar/gkq1237 (2011). PubMed PMC

Natale, D. et al. Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic Acids Res.45, D339–D346, 10.1093/nar/gkw1075 (2017). PubMed PMC

Islamaj, R. et al. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database2022, 10.1093/database/baac102 (2022). PubMed PMC

Ramshaw, L. A. & Marcus, M. P. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora. https://aclanthology.org/W95-0107.pdf (1995).

Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017) (Curran Associates Inc., Red Hook, NY, USA, 2017).

Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240, 10.1093/bioinformatics/btz682 (2019). PubMed PMC

Gu, Y. et. al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc.3, 1–23, 10.1145/3458754 (2021).

Fang, L., Chen, Q., Wei, C.-H., Lu, Z. & Wang, K. Bioformer: an efficient transformer language model for biomedical text mining Preprint at: 10.48550/arXiv.2302.01588 (2023).

Yang, X. et al. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data10, 10.1038/s41597-023-02617-x (2023). PubMed PMC

Gnehm, A.-S., Bühlmann, E., & Clematide, S. Evaluation of transfer learning and domain adaptation for analyzing german-speaking job advertisements. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 3892–3901, https://aclanthology.org/2022.lrec-1.414.pdf (European Language Resources Association, Marseille, France, 2022).

Luoma, J. & Pyysalo, S. Exploring cross-sentence contexts for named entity recognition with BERT. In Proceedings of the 28th International Conference on Computational Linguistics (International Committee on Computational Linguistics, Barcelona, Spain (Online), 904–914, 10.18653/v1/2020.coling-main.78 (2020).

Wang, X. et al. Improving named entity recognition by external context retrieving and cooperative learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics), 1800–1812, 10.18653/v1/2021.acl-long.142 (2021).

Segura-Bedmar, I. et al. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 341–350 (Association for Computational Linguistics, 2013), https://aclanthology.org/S13-2056.pdf (2013).

Vollmar, M. et al. Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature, figshare, 10.6084/m9.figshare.c.7357228.v1 (2024). PubMed

Tjong Kim Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL (Association for Computational Linguistics), 142–147, https://aclanthology.org/W02-2024.pdf (2003).

Nejnovějších 20 citací...

Zobrazit více v
Medvik | PubMed

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

. 2024 Sep 27 ; 11 (1) : 1032. [epub] 20240927

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...