Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?
Jazyk angličtina Země Anglie, Velká Británie Médium print
Typ dokumentu časopisecké články
Grantová podpora
23-07349S
Czech Science Foundation
PubMed
40742755
PubMed Central
PMC12377911
DOI
10.1093/bioinformatics/btaf431
PII: 8220314
Knihovny.cz E-zdroje
- MeSH
- databáze proteinů MeSH
- deep learning MeSH
- konformace proteinů MeSH
- ligandy MeSH
- molekulární modely MeSH
- neuronové sítě MeSH
- proteiny * chemie metabolismus MeSH
- vazba proteinů MeSH
- vazebná místa MeSH
- výpočetní biologie * metody MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- ligandy MeSH
- proteiny * MeSH
MOTIVATION: Predicting protein-ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. Here, we analyze a hybrid approach that combines the strengths of both paradigms by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein's 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model performance. RESULTS: By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of the baselines in absolute terms. Nevertheless, as more complex pLMs are used to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that although the use of the experimental protein structure almost always improves the accuracy of the prediction of the binding site, complex pLMs still contain structural information that leads to good predictive performance even without the use of 3D structure. AVAILABILITY AND IMPLEMENTATION: The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.
Faculty of Mathematics and Physics Charles University 118 00 Prague Czech Republic
Faculty of Science Charles University 128 00 Prague Czech Republic
Zobrazit více v PubMed
Aggarwal R, Gupta A, Chelur V et al. Deeppocket: ligand binding site detection and segmentation using 3d convolutional neural networks. J Chem Inf Model 2022;62:5069–79. PubMed
Alipanahi B, Delong A, Weirauch MT et al. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 2015;33:831–8. PubMed
Ashkenazy H, Erez E, Martz E et al. Consurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res 2010;38:W529–W533. PubMed PMC
Berman HM, Westbrook J, Feng Z et al. The protein data bank. Nucleic Acids Res 2000;28:235–42. PubMed PMC
Brown T, Mann B, Ryder N et al. Language models are few-shot learners. Adv Neural Info Process Syst 2020;33:1877–901.
Brylinski M, Skolnick J. A threading-based method (findsite) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 2008;105:129–34. PubMed PMC
Chai J, Zeng H, Li A et al. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl 2021;6:100134.
Chauhan JS, Mishra NK, Raghava GP. Identification of atp binding residues of a protein from its primary sequence. BMC Bioinformatics 2009;10:434–9. PubMed PMC
Chen K, Mizianty MJ, Kurgan L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 2012;28:331–41. PubMed
Chen P, Huang JZ, Gao X. Ligandrfs: random Forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinfo 2014;15:S4–12. PubMed PMC
Chen P, Hu S, Zhang J et al. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans Comput Biol Bioinform 2016;13:901–12. PubMed
Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 2020;21:6. 10.1186/s12864-019-6413-7 PubMed DOI PMC
Cui Y, Dong Q, Hong D et al. Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinfo 2019;20:93–12. PubMed PMC
Devlin J, Chang M-W, Lee K et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, preprint: not peer reviewed.
Ding Y, Tang J, Guo F. Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 2017;57:3149–61. 10.1021/acs.jcim.7b00307 PubMed DOI
Elnaggar A, Heinzinger M, Dallago C et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. PubMed
Evteev S, Ereshchenko A, Adjugim D et al. Skittles: gnn-assisted pseudo-ligands generation and its application for binding sites classification and affinity prediction. Prot Struct Funct Bioinfo 2025;93:1269–80. PubMed
Evteev SA, Ereshchenko AV, Ivanenkov YA. Siteradar: utilizing graph machine learning for precise mapping of protein–ligand-binding sites. J Chem Inf Model 2023;63:1124–32. PubMed
Ferreira LG, Dos Santos RN, Oliva G et al. Molecular docking and structure-based drug design strategies. Molecules 2015;20:13384–421. PubMed PMC
Ferruz N, Höcker B. Controllable protein design with language models. Nat Mach Intell 2022;4:521–32.
Fout A, Byrd J, Shariat B et al. Protein interface prediction using graph convolutional networks. Adv Neural Info Process Syst 2017;30.
Hauser M, Steinegger M, Söding J. Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 2016;32:1323–30. PubMed
He K, Zhang X, Ren S et al. Deep residual learning for image recognition. In
Heinzinger M, Elnaggar A, Wang Y et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019;20:723–17. PubMed PMC
Hendlich M, Rippmann F, Barnickel G. Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 1997;15:359–89. PubMed
Høie MH, Kiehl EN, Petersen B et al. Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res 2022;50:W510–W515. PubMed PMC
Hoksza D, Gamouh H. Exploration of protein sequence embeddings for protein-ligand binding site detection. In
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In
Jha K, Karmakar S, Saha S. Graph-bert and language model-based framework for protein–protein interaction identification. Sci Rep 2023;13:5663. PubMed PMC
Han J, Cen J, Wu L et al. A survey of geometric graph neural networks: data structures, models and applications. Front Comp Sci 2025;19. 10.1007/s11704-025-41426-w DOI
Jiménez J, Doerr S, Martínez-Rosell G et al. Deepsite: protein-binding site predictor using 3d-convolutional neural networks. Bioinformatics 2017;33:3036–42. PubMed
Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. PubMed PMC
Kandel J, Tayara H, Chong KT. Puresnet: prediction of protein-ligand binding sites using deep residual neural network. J Cheminform 2021;13:65–14. PubMed PMC
Kawashima S, Kanehisa M. Aaindex: amino acid index database. Nucleic Acids Res 2000;28:374. PubMed PMC
Khurana D, Koli A, Khatter K et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 2023;82:3713–44. PubMed PMC
Kim P, Zhao J, Lu P et al. Mutlbsgenedb: mutated ligand binding site gene database. Nucleic Acids Res 2017;45:D256–D263. PubMed PMC
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv, 2016, preprint: not peer reviewed.
Konc J, Janežič D. Protein binding sites for drug design. Biophys Rev 2022;14:1413–21. PubMed PMC
Krivák R, Hoksza D. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 2018;10:39. PubMed PMC
Laskowski RA. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995;13:323–30. PubMed
Laurie AT, Jackson RM. Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics 2005;21:1908–16. PubMed
Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 2009;10:168–11. PubMed PMC
Li P, Liu Z-P. Geobind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res 2023;51:e60. PubMed PMC
Li Y, Huang C, Ding L et al. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 2019;166:4–21. PubMed
Lin Y, Yoo S, Sanchez R. Sitecomp: a server for ligand binding site analysis in protein structures. Bioinformatics 2012;28:1172–3. PubMed PMC
Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. a; 500902, preprint: not peer reviewed.
Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. b, preprint: not peer reviewed.
Liu Y, Grimm M, Dai W-T et al. Cb-dock: a web server for cavity detection-guided protein–ligand blind docking. Acta Pharmacol Sin 2020;41:138–44. PubMed PMC
Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv, 2017, preprint: not peer reviewed.
Min B, Ross H, Sulem E et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv 2024;56:1–40. 10.1145/3605943 DOI
Mylonas SK, Axenopoulos A, Daras P. Deepsurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics 2021;37:1681–90. PubMed
Ngan C-H, Hall DR, Zerbe B et al. Ftsite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics 2012;28:286–7. PubMed PMC
O’Shea K, Nash R. An introduction to convolutional neural networks. arXiv, 2015, preprint: not peer reviewed.
Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–30.
Pokharel S, Pratyush P, Heinzinger M et al. Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep 2022;12:16933. PubMed PMC
Pratyush P, Pokharel S, Saigo H et al. Plmsnosite: an ensemble-based approach for predicting protein s-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model. BMC Bioinfo 2023;24:41. PubMed PMC
Pravda L, Berka K, Svobodová Vařeková R et al. Anatomy of enzyme channels. BMC Bioinfo 2014;15:379. PubMed PMC
Pu L, Govindaraj RG, Lemoine JM et al. Deepdrug3d: classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol 2019;15:e1006718. PubMed PMC
Rao R, Meier J, Sercu T et al. Transformer protein language models are unsupervised structure learners. Biorxiv, 2020, 2020–12, preprint: not peer reviewed.
Roche DB, Tetchner SJ, McGuffin LJ. Funfold: an improved automated method for the prediction of ligand binding residues using 3d models of proteins. BMC Bioinformatics 2011;12:160–20. PubMed PMC
Roche DB, Brackenridge DA, McGuffin LJ. Proteins and their interacting partners: an introduction to protein–ligand binding site prediction methods. Int J Mol Sci 2015;16:29829–42. PubMed PMC
Roche R, Moussad B, Shuvo MH et al. Equipnas: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024;52:e27. PubMed PMC
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50:742–54. PubMed
Rusch TK, Bronstein MM, Mishra S. A survey on oversmoothing in graph neural networks. arXiv, March 2023, preprint: not peer reviewed.
Serra A, Galdi P, Tagliaferri R. Machine learning for bioinformatics and neuroimaging. Wiley Interdiscip Rev Data Min Knowl Discov 2018;8:e1248.
Srivastava N, Hinton G, Krizhevsky A et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–58.
Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods 2019;16:603–6. PubMed
Su M, Yang Q, Du Y et al. Comparative assessment of scoring functions: the casf-2016 update. J Chem Inf Model 2019;59:895–913. PubMed
Suzek BE, Wang Y, Huang H et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015;31:926–32. PubMed PMC
Tiwary BK. Biological databases. Bioinformatics and Computational Biology: A Primer for Biologists. 2022, 11–31. 10.1007/978-981-16-4241-8 DOI
Unsal S, Atas H, Albayrak M et al. Learning functional properties of proteins with language models. Nat Mach Intell 2022;4:227–45.
Varadi M, Anyango S, Deshpande M et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 2022;50:D439–D444. PubMed PMC
Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Adv Neural Info Process Syst 2017;30.
Veličković P. Everything is connected: graph neural networks. Curr Opin Struct Biol 2023;79:102538. PubMed
Veličković P, Cucurull G, Casanova A et al. Graph attention networks. arXiv, 2017, preprint: not peer reviewed.
Wang R, Fang X, Lu Y et al. The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J Med Chem 2004;47:2977–80. PubMed
Wang W, Sun B, Yu M et al. Graphplbr: protein-ligand binding residue prediction with deep graph convolution network. IEEE/ACM Trans Comput Biol Bioinform 2023;20:2223–32. PubMed
Wang Y, You Z-H, Yang S et al. A high efficient biological language model for predicting protein–protein interactions. Cells 2019;8:122. PubMed PMC
Wass MN, Kelley LA, Sternberg MJ. 3dligandsite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 2010;38:W469–73. PubMed PMC
Xia Y, Xia C-Q, Pan X et al. Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49:e51. PubMed PMC
Yang J, Roy A, Zhang Y. Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res 2013;41:D1096–D1103. PubMed PMC
Yang J, Roy A, Zhang Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013;29:2588–95. PubMed PMC
Yu D-J, Hu J, Yang J et al. Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans Comput Biol Bioinform 2013;10:994–1008. PubMed
Yuan Q, Chen S, Rao J et al. Alphafold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:bbab564. PubMed
Zhang X-M, Liang L, Liu L et al. Graph neural networks and their current applications in bioinformatics. Front Genet 2021;12:690049. PubMed PMC
Zhang Y, Huang W, Wei Z et al. Equipocket: an e (3)-equivariant geometric graph neural network for ligand binding site prediction, arXiv, 2023, preprint: not peer reviewed.
Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020;18:417–26. PubMed PMC
Zhao Z, Xu Y, Zhao Y. Sxgbsite: prediction of protein–ligand binding sites using sequence information and extreme gradient boosting. Genes (Basel) 2019;10:965. 10.3390/genes10120965 PubMed DOI PMC
Zheng Z, Deng Y, Xue D et al. Structure-informed language models are protein designers. bioRxiv, 2023, preprint: not peer reviewed.