Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Jazyk angličtina Země Anglie, Velká Británie Médium electronic
Typ dokumentu časopisecké články, dataset
Grantová podpora
945405
EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
PubMed
39333508
PubMed Central
PMC11436914
DOI
10.1038/s41597-024-03841-9
PII: 10.1038/s41597-024-03841-9
Knihovny.cz E-zdroje
- MeSH
- databáze proteinů MeSH
- lidé MeSH
- proteiny * chemie MeSH
- strojové učení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
- Názvy látek
- proteiny * MeSH
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Zobrazit více v PubMed
wwPDB consortium Protein Data Bank: the single global archive for 3d macromolecular structure data. Nucleic Acids Res.47, D520–D528, 10.1093/nar/gky949 (2019). PubMed PMC
Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res.48, D335–D343, 10.1093/nar/gkz990 (2020). PubMed PMC
Choudhary, P. et al. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci. Data10, 10.1038/s41597-023-02101-6 (2023). PubMed PMC
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res.51, D523–D531, 10.1093/nar/gkac1052 (2023). PubMed PMC
Munro, R. Human-in-the-loop machine learning. (Manning Publications, Shelter Island, 2020).
Settles, B. Active learning literature survey. (Tech. rep., University of Wisconsin-Madison. Department of Computer Sciences, https://minds.wisconsin.edu/handle/1793/60660 (2009).
Olsson, F. A literature survey of active machine learning in the context of natural language processing. (Tech. rep., Swedish Institute of Computer Science, http://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-23510 (2009).
Hoi, S. C. H., Jin, R., Zhu, J. & Lyu, M. R. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on machine learning. ICML ’06, 417–424 (Association for Computing Machinery, New York, NY, USA), 10.1145/1143844.1143897 (2006).
Nguyen, D. H. M. & Patrick, J. D. Supervised machine learning and active learning in classification of radiology reports. J Am Med Inform Assoc21, 893–901, 10.1136/amiajnl-2013-002516 (2014). PubMed PMC
Luo, T. et al. Active learning to recognize multiple types of plankton. In Proceedings of the 17th international conference on pattern recognition.ICPR 2004, 478–481, 10.1109/ICPR.2004.1334570 (2004).
Blum, A. & Mitchell, T. Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conf. on Computational Learning TheoryCOLT’98, 92–100, 10.1145/279943.279962 (1998).
Collins, M. & Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corporahttps://aclanthology.org/W99-0613 (1999).
Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, 200–209, Bled, Slovenia. https://www.cs.cornell.edu/people/tj/publications/joachims_99c.pdf (1999).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589, 10.1038/s41586-021-03819-2 (2021). PubMed PMC
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science373, 871–876, 10.1126/science.abj8754 (2021). PubMed PMC
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res.50, D439–D444, 10.1093/nar/gkab1061 (2022). PubMed PMC
Schwede, T. et al. Outcome of a workshop on applications of protein models in biomedical research. Structure17, 151–159, 10.1016/j.str.2008.12.014 (2009). PubMed PMC
Varadi, M. et al. 3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources. GigaScience11, 10.1093/gigascience/giac118 (2022). PubMed PMC
Wilkinson, M. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data3, 10.1038/sdata.2016.18 (2016). PubMed PMC
Allot, A., Lee, K., Chen, Q., Luo, L. & Lu, Z. Litsuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res.49, W352–W358, 10.1093/nar/gkab326 (2021). PubMed PMC
Roberts, R. PubMed Central: The GenBank of the published literature. Proc. Natl Acad. Sci. USA98, 381–382, 10.1073/pnas.98.2.381 (2001). PubMed PMC
The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res. 43, D1042–D1048, 10.1093/nar/gku1061 (2015). PubMed PMC
Erbilgin, O., Sutter, M., & Kerfeld, C. A. The structural basis of coenzyme A recycling in a bacterial organelle. PLOS Biol. 14, 10.1371/journal.pbio.1002399 (2016). PubMed PMC
Agrawal, A. A. et al. An extended U2AF(65)-RNA-binding domain recognizes the 3′ splice site signal. Nat. Commun. 7, 10.1038/ncomms10950 (2016). PubMed PMC
Huber, E. M. et al. A unified mechanism for proteolysis and autocatalytic activation in the 20S proteasome. Nat Commun. 7, 10.1038/ncomms10900 (2016). PubMed PMC
Kandiah, E. et al. Structural insights into Escherichia coli lysine decarboxylases and molecular determinants of interaction with the AAA+ ATPase RavA. Sci. Rep. 6, 10.1038/srep24601 (2016). PubMed PMC
Hunkeler, M., Stuttfeld, E., Hagmann, A., Imseng, S. & Maier, T. The dynamic organization of fungal acetyl-CoA carboxylase. Nat Commun. 7, 10.1038/ncomms11196 (2016). PubMed PMC
Santiago, J. et al. Mechanistic insight into a peptide hormone signaling complex mediating floral organ abscission. Elife5, 10.7554/elife.15075 (2016). PubMed PMC
Tauzin, A. S. et al. Molecular dissection of xycloglucan recognition in a prominent human gut symbiont. mBio7, 10.1128/mbio.02134-15 (2016). PubMed PMC
McLuskey, K. et al. Crystal structure and activity studies of the C11 cysteine peptidase from Parabacteroides merdae in the human gut microbiome. J. Biol. Chem.291, 9482–9491, 10.1074/jbc.m115.706143 (2016). PubMed PMC
van et al. Structural basis for mep2 ammonium transceptor activation by phosphosrylation. Nat. commun. 7, 10.1038/ncomms11337 (2016). PubMed PMC
Xu, M. et al. Structural insights into the regulatory mechanism of the Pseudomonas aeruginosa YfiBNR system. Protein Cell7, 10.1007/s13238-016-0264-7 403-416 (2016). PubMed PMC
Yokogawa, M. et al. Structural basis for the regulation of enzymatic activity of Regnase-1 by domain-domain interactions. Sci. Rep. 6, 10.1038/srep22324 (2016). PubMed PMC
Liguori, A. et al. Molecular basis of ligand-dependent regulation of NadR, the transcriptional repressor of meningococcal virulence factor NadA. Plos Pathog. 12, 10.1371/journal.ppat.1005557 (2016). PubMed PMC
Nwachukwu, J. C. et al. Predictive features of ligand-specific signaling through the estrogen receptor. Mol. Cyst. Biol. 12, 10.15252/msb.20156701 (2016). PubMed PMC
Bury, C. S. et al. RNA protects a nucleoprotein complex against radiation damage. Acta Crystallogr. D72, 648–657, 10.1107/s2059798316003351 (2016). PubMed PMC
Andrews, F. H. et al. The Taf14 YEATS domain is a reader for histone crotonylation. Nat. Chem. Biol.12, 396–398, 10.1038/nchembio.2065 (2016). PubMed PMC
Meyer, B. et al. Ribosome biogenesis factor Tsr3 is the aminocarboxylpropyl transferase responsible for 18S rRNA hypermodification in yeast and humans. Nucleic Acids Res.44, 4304–4316, 10.1093/nar/gkw244 (2016). PubMed PMC
Xie, Y., Li, M. & Chang, W. Crystal structures of putative sugar kinases from Synechococcus elongatus PCC 7942 and Arabidopsis thaliana. Plos One11, 10.1371/journal.pone.0156067 (2016). PubMed PMC
Watson, J. R. et al. Investigation of the interaction between Cdc42 and its effector TOCA1: handover of Cdc42 to the actin regulator N-WASP is facilitated by differential binding affinities. J. Biol. Chem.291, 13875–13890, 10.1074/jbc.m116.724294 (2016). PubMed PMC
Horowitz, S. et al. Visualizing chaperone-assisted protein folding. Nat. Struct. Mol. Biol.23, 691–697, 10.1038/nsmb.3237 (2016). PubMed PMC
Teplyakov, A. et al. Structural diversity in a human antibody germline library. MABS.8, 1045–1063, 10.1080/19420862.2016.1190060 (2016). PubMed PMC
Xiao, S., Ellena, J. F., Armstrong, G. S. & Capelluto, D. G. Structure of the GAT domain of the endosomal adapter protein Tom1. Data Brief7, 344–348, 10.1016/j.dib.2016.02.042 (2016). PubMed PMC
Widderich, N. et al. Biochemistry and crystal structure of ectoine synthase: a metal-containing member of the cupin superfamily. Plos One11, 10.1371/journal.pone.0151285 (2016). PubMed PMC
Liu, X. et al. A conserved motif in JNK/p38-specific MAPK phosphatase as a determinant for JNK1 recognition and inactivation. Nat. Commun. 7, 10.1038/ncomms10879 (2016). PubMed PMC
Kabe, Y. et al. Haem-dependent dimerization of PGRMC1/Sigma-2 receptor facilitates cancer proliferation and chemoresistance. Nat. Commun. 7, 10.1038/ncomms11030 (2016). PubMed PMC
Kreutzer, A. G., Hamza, I. L., Spencer, R. K. & Nowick, J. S. X-ray crystallographic structures of a trimer, dodecamer, and annular pore formed by an Aβ17-36 β-hairpin. J. Am. Chem. Soc.138, 4634–4642, 10.1021/jacs.6b01332 (2016). PubMed PMC
Liu, S. et al. Inhibiting complex IL-17A and IL-17RA interactions with a linear peptide. Sci. Rep. 6, 10.1038/srep26071 (2016). PubMed PMC
Cole, D. K. et al. Hotspot autoimmune T cell receptor binding underlies pathogen and insulin peptide cross-reactivity. J. Clin. Invest.126, 2191–2204, 10.1172/jci85679 (2016). PubMed PMC
Marcotte, D. J. et al. Structural determinant for inducing RORgamma specific inverse agonism triggered by a synthetic benzoxazinone ligand. BMC Struct. Biol. 16, 10.1186/s12900-016-0059-3 (2016). PubMed PMC
Abeyrathne, P. D., Koh, C. S., Grant, T., Grigorieff, N. & Korostelev, A. A. Ensemble cryo-EM uncovers inchworm-like translocation of a viral IRES through the ribosome. Elife5, 10.7554/elife.14874 (2016). PubMed PMC
Liao, J. et al. Mechanism of extracellular ion exchange and binding-site occlusion in a sodium/calcium exchanger. Nat. Struct. Mol. Biol.23, 590–599, 10.1038/nsmb.3230 (2016). PubMed PMC
Jeong, H. et al. Crystal structure of SEL1L: insight into the roles of SLR motifs in ERAD pathway. Sci. Rep. 6, 10.1038/srep20261 (2016). PubMed PMC
Khosa, S., Hoeppner, A., Gohlke, H., Schmitt, L. & Smits, S. H. Structure of the response regulator NsrR from Streptococcus agalactiae, which is involved in lantibiotic resistance. Plos One11, 10.1371/journal.pone.0149903 (2016) PubMed PMC
Schulte, K. et al. The immunity-regulated GTPase Irga6 dimerizes in a parallel head-to-head fashion. BMC Biol. 14, 10.1186/s12915-016-0236-7 (2016). PubMed PMC
Kimura, S. et al. Template-dependent nucleotide addition in the reverse (3′-5′) direction by Thg1-like protein. Sci. Adv. 2, 10.1126/sciadv.1501397 (2016). PubMed PMC
Fang, J. et al. Hemi-methylated DNA opens a closed conformation of UHRF1 to facilitate its histone recognition. Nat. Commun. 7, 10.1038/ncomms11197 (2016). PubMed PMC
Veselkov, D. A. et al. Structure of a quinolone-stabilized cleavage complex of topoisomerase IV from Klebsiella pneumoniae and comparison with a related Streptococcus pneumoniae complex. Acta Crystallogr. D72, 488–496, 10.1107/s2059798316001212 (2016). PubMed PMC
Scott, D. E., Marsh, M., Blundell, T. L., Abell, C. & Hyvönen, M. Structure-activated relationship of the peptide binding-motif mediating the BRCA2:RAD51 protein-protein interaction. FEBS Lett.590, 1094–1102, 10.1002/1873-3468.12139 (2016). PubMed PMC
Schellenberg, M. J. et al. Reversal of DNA damage induced topoisomerase 2 DNA-protein crosslinks by Tdp2. Nucleic Acids Res.44, 3829–3844, 10.1093/nar/gkw228 (2016). PubMed PMC
Borgnia, M. J. et al. Using cryo-EM to map small ligands on dynamic metabolic enzymes: studies with glutamate dehydrogenase. Mol. Pharmacol.89, 645–651, 10.1124/mol.116.103382 (2016). PubMed PMC
Matthews, M. M. et al. Structures of human ADAR2 bound to dsRNA reveals base-flipping mechanism and basis for site selectivity. Nat. Struct. Mol. Biol.23, 426–433, 10.1038/nsmb.3203 (2016). PubMed PMC
Klima, M. et al. Structural insights and in vitro reconstitution of membrane targeting and activation of human PI4KB by the ABCD3 protein. Sci. Rep. 6, 10.1038/srep23641 (2016). PubMed PMC
Liberto, M. V. et al. Molecular characterization of a family 5 glycoside hydrolase suggests an induced-fit enzymatic mechanism. Sci. Rep. 6, 10.1038/srep23473 (2016). PubMed PMC
Carcelli, M. et al. N-acylhydrazone inhibitors of influenza virus PA endonuclease with versatile metal binding modes. Sci. Rep. 6, 10.1038/srep31500 (2016). PubMed PMC
Zhang, Y., Rataj, K., Simpson, G. G. & Tong, L. Crystal structure of the SPOC Domain of the Arabidopsis flowering regulator FPA. Plos One11, 10.1371/journal.pone.0160694 (2016). PubMed PMC
Chen, J. Y. et al. Structure and function of human Naa60 (NatF), a Golgi-localized bi-functional acetyltransferase. Sci. Rep. 6, 10.1038/srep31425 (2016). PubMed PMC
He, D. et al. Structural characterization of encapsulated ferritin provides insight into iron storage in bacterial nanocompartments. Elife5, 10.7554/elife.18972 (2016). PubMed PMC
Zebisch, M., Jackson, V. A., Zhao, Y. & Jones, E. Y. Structure of the dual-mode Wnt regulator Kremen 1 and insight into ternary complex formation with LRP6 and Dickkopf. Structure24, 1599–1605, 10.1016/j.str.2016.06.020 (2016). PubMed PMC
Labourel, A. et al. The mechanism by which arabinoxylanases can recognize highly decorated xylans. J. Biol. Chem.291, 22149–22159, 10.1074/jbc.m116.743948 (2016). PubMed PMC
Ma, B. et al. Biochemical and structural characterization of a DNA N6-adenine methyltransferase from Helicobacter pylori. Oncotarget7, 40965–40977, 10.18632/oncotarget.9692 (2016). PubMed PMC
Janowski, R. et al. Roquin recognizes a non-canonical hexaloop structure in the 3′-UTR of Ox40. Nat. Commun. 7, 10.1038/ncomms11032 (2016). PubMed PMC
Islamaj, R., Kwon, D., Kim, S. & Lu, Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res.48, W5–W11, 10.1093/nar/gkaa333 (2020). PubMed PMC
Comeau, D. C., Wei, C.-H., Islamaj Doğan, R. & Lu, Z. PMC text mining subset in BioC: about three million full text articles and growing. Bioinformatics35, 3533–3535, 10.1093/bioinformatics/btz070 (2019). PubMed PMC
Comeau, D. C. et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford)2013, bat064, 10.1093/database/bat064 (2013). PubMed PMC
Hirschman, L., Yeh, A., Blaschke, C. & Valencia, A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinforma.6, S1, 10.1186/1471-2105-6-S1-S1 (2005). PubMed PMC
Rogers, F. B. Medical subject headings. Bull. Med. Libr. Assoc.51, 114–6, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC197951/pdf/mlab00186-0145.pdf (1963). PubMed PMC
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet.25, 25–9, 10.1038/75556 (2000). PubMed PMC
The Gene Ontology Consortium. et al. The Gene Ontology knowledgebase in 2023. Genetics224, iyad031, 10.1093/genetics/iyad031 (2023). PubMed PMC
Eilbeck, K. et al. The sequence ontology: A tool for the unification of genome annotations. Genome Biol.6, R44, 10.1186/gb-2005-6-5-r44 (2005). PubMed PMC
Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res.44, D1214–D1219, 10.1093/nar/gkv1031 (2016). PubMed PMC
Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res.39, D52–D57, 10.1093/nar/gkq1237 (2011). PubMed PMC
Natale, D. et al. Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic Acids Res.45, D339–D346, 10.1093/nar/gkw1075 (2017). PubMed PMC
Islamaj, R. et al. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database2022, 10.1093/database/baac102 (2022). PubMed PMC
Ramshaw, L. A. & Marcus, M. P. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora. https://aclanthology.org/W95-0107.pdf (1995).
Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010 https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017) (Curran Associates Inc., Red Hook, NY, USA, 2017).
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240, 10.1093/bioinformatics/btz682 (2019). PubMed PMC
Gu, Y. et. al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc.3, 1–23, 10.1145/3458754 (2021).
Fang, L., Chen, Q., Wei, C.-H., Lu, Z. & Wang, K. Bioformer: an efficient transformer language model for biomedical text mining Preprint at: 10.48550/arXiv.2302.01588 (2023).
Yang, X. et al. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data10, 10.1038/s41597-023-02617-x (2023). PubMed PMC
Gnehm, A.-S., Bühlmann, E., & Clematide, S. Evaluation of transfer learning and domain adaptation for analyzing german-speaking job advertisements. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 3892–3901, https://aclanthology.org/2022.lrec-1.414.pdf (European Language Resources Association, Marseille, France, 2022).
Luoma, J. & Pyysalo, S. Exploring cross-sentence contexts for named entity recognition with BERT. In Proceedings of the 28th International Conference on Computational Linguistics (International Committee on Computational Linguistics, Barcelona, Spain (Online), 904–914, 10.18653/v1/2020.coling-main.78 (2020).
Wang, X. et al. Improving named entity recognition by external context retrieving and cooperative learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics), 1800–1812, 10.18653/v1/2021.acl-long.142 (2021).
Segura-Bedmar, I. et al. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 341–350 (Association for Computational Linguistics, 2013), https://aclanthology.org/S13-2056.pdf (2013).
Vollmar, M. et al. Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature, figshare, 10.6084/m9.figshare.c.7357228.v1 (2024). PubMed
Tjong Kim Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL (Association for Computational Linguistics), 142–147, https://aclanthology.org/W02-2024.pdf (2003).