Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?

. 2025 Aug 02 ; 41 (8) : .

Jazyk angličtina Země Anglie, Velká Británie Médium print

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid40742755

Grantová podpora
23-07349S Czech Science Foundation

MOTIVATION: Predicting protein-ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. Here, we analyze a hybrid approach that combines the strengths of both paradigms by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein's 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model performance. RESULTS: By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of the baselines in absolute terms. Nevertheless, as more complex pLMs are used to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that although the use of the experimental protein structure almost always improves the accuracy of the prediction of the binding site, complex pLMs still contain structural information that leads to good predictive performance even without the use of 3D structure. AVAILABILITY AND IMPLEMENTATION: The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.

Zobrazit více v PubMed

Aggarwal R, Gupta A, Chelur V  et al.  Deeppocket: ligand binding site detection and segmentation using 3d convolutional neural networks. J Chem Inf Model  2022;62:5069–79. PubMed

Alipanahi B, Delong A, Weirauch MT  et al.  Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol  2015;33:831–8. PubMed

Ashkenazy H, Erez E, Martz E  et al.  Consurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res  2010;38:W529–W533. PubMed PMC

Berman HM, Westbrook J, Feng Z  et al.  The protein data bank. Nucleic Acids Res  2000;28:235–42. PubMed PMC

Brown T, Mann B, Ryder N  et al.  Language models are few-shot learners. Adv Neural Info Process Syst  2020;33:1877–901.

Brylinski M, Skolnick J.  A threading-based method (findsite) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA  2008;105:129–34. PubMed PMC

Chai J, Zeng H, Li A  et al.  Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl  2021;6:100134.

Chauhan JS, Mishra NK, Raghava GP.  Identification of atp binding residues of a protein from its primary sequence. BMC Bioinformatics  2009;10:434–9. PubMed PMC

Chen K, Mizianty MJ, Kurgan L.  Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics  2012;28:331–41. PubMed

Chen P, Huang JZ, Gao X.  Ligandrfs: random Forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinfo  2014;15:S4–12. PubMed PMC

Chen P, Hu S, Zhang J  et al.  A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans Comput Biol Bioinform  2016;13:901–12. PubMed

Chicco D, Jurman G.  The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics  2020;21:6. 10.1186/s12864-019-6413-7 PubMed DOI PMC

Cui Y, Dong Q, Hong D  et al.  Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinfo  2019;20:93–12. PubMed PMC

Devlin J, Chang M-W, Lee K  et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, preprint: not peer reviewed.

Ding Y, Tang J, Guo F.  Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model  2017;57:3149–61. 10.1021/acs.jcim.7b00307 PubMed DOI

Elnaggar A, Heinzinger M, Dallago C  et al.  Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell  2022;44:7112–27. PubMed

Evteev S, Ereshchenko A, Adjugim D  et al.  Skittles: gnn-assisted pseudo-ligands generation and its application for binding sites classification and affinity prediction. Prot Struct Funct Bioinfo  2025;93:1269–80. PubMed

Evteev SA, Ereshchenko AV, Ivanenkov YA.  Siteradar: utilizing graph machine learning for precise mapping of protein–ligand-binding sites. J Chem Inf Model  2023;63:1124–32. PubMed

Ferreira LG, Dos Santos RN, Oliva G  et al.  Molecular docking and structure-based drug design strategies. Molecules  2015;20:13384–421. PubMed PMC

Ferruz N, Höcker B.  Controllable protein design with language models. Nat Mach Intell  2022;4:521–32.

Fout A, Byrd J, Shariat B  et al.  Protein interface prediction using graph convolutional networks. Adv Neural Info Process Syst  2017;30.

Hauser M, Steinegger M, Söding J.  Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics  2016;32:1323–30. PubMed

He K, Zhang X, Ren S  et al. Deep residual learning for image recognition. In

Heinzinger M, Elnaggar A, Wang Y  et al.  Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics  2019;20:723–17. PubMed PMC

Hendlich M, Rippmann F, Barnickel G.  Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model  1997;15:359–89. PubMed

Høie MH, Kiehl EN, Petersen B  et al.  Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res  2022;50:W510–W515. PubMed PMC

Hoksza D, Gamouh H. Exploration of protein sequence embeddings for protein-ligand binding site detection. In

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In

Jha K, Karmakar S, Saha S.  Graph-bert and language model-based framework for protein–protein interaction identification. Sci Rep  2023;13:5663. PubMed PMC

Han J, Cen J, Wu L  et al.  A survey of geometric graph neural networks: data structures, models and applications. Front Comp Sci  2025;19. 10.1007/s11704-025-41426-w DOI

Jiménez J, Doerr S, Martínez-Rosell G  et al.  Deepsite: protein-binding site predictor using 3d-convolutional neural networks. Bioinformatics  2017;33:3036–42. PubMed

Jumper J, Evans R, Pritzel A  et al.  Highly accurate protein structure prediction with alphafold. Nature  2021;596:583–9. PubMed PMC

Kandel J, Tayara H, Chong KT.  Puresnet: prediction of protein-ligand binding sites using deep residual neural network. J Cheminform  2021;13:65–14. PubMed PMC

Kawashima S, Kanehisa M.  Aaindex: amino acid index database. Nucleic Acids Res  2000;28:374. PubMed PMC

Khurana D, Koli A, Khatter K  et al.  Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl  2023;82:3713–44. PubMed PMC

Kim P, Zhao J, Lu P  et al.  Mutlbsgenedb: mutated ligand binding site gene database. Nucleic Acids Res  2017;45:D256–D263. PubMed PMC

Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv, 2016, preprint: not peer reviewed.

Konc J, Janežič D.  Protein binding sites for drug design. Biophys Rev  2022;14:1413–21. PubMed PMC

Krivák R, Hoksza D.  P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform  2018;10:39. PubMed PMC

Laskowski RA.  Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph  1995;13:323–30. PubMed

Laurie AT, Jackson RM.  Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics  2005;21:1908–16. PubMed

Le Guilloux V, Schmidtke P, Tuffery P.  Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics  2009;10:168–11. PubMed PMC

Li P, Liu Z-P.  Geobind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res  2023;51:e60. PubMed PMC

Li Y, Huang C, Ding L  et al.  Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods  2019;166:4–21. PubMed

Lin Y, Yoo S, Sanchez R.  Sitecomp: a server for ligand binding site analysis in protein structures. Bioinformatics  2012;28:1172–3. PubMed PMC

Lin Z, Akin H, Rao R  et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. a; 500902, preprint: not peer reviewed.

Lin Z, Akin H, Rao R  et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. b, preprint: not peer reviewed.

Liu Y, Grimm M, Dai W-T  et al.  Cb-dock: a web server for cavity detection-guided protein–ligand blind docking. Acta Pharmacol Sin  2020;41:138–44. PubMed PMC

Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv, 2017, preprint: not peer reviewed.

Min B, Ross H, Sulem E  et al.  Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv  2024;56:1–40. 10.1145/3605943 DOI

Mylonas SK, Axenopoulos A, Daras P.  Deepsurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics  2021;37:1681–90. PubMed

Ngan C-H, Hall DR, Zerbe B  et al.  Ftsite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics  2012;28:286–7. PubMed PMC

O’Shea K, Nash R. An introduction to convolutional neural networks. arXiv, 2015, preprint: not peer reviewed.

Pedregosa F, Varoquaux G, Gramfort A  et al.  Scikit-learn: machine learning in python. J Mach Learn Res  2011;12:2825–30.

Pokharel S, Pratyush P, Heinzinger M  et al.  Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep  2022;12:16933. PubMed PMC

Pratyush P, Pokharel S, Saigo H  et al.  Plmsnosite: an ensemble-based approach for predicting protein s-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model. BMC Bioinfo  2023;24:41. PubMed PMC

Pravda L, Berka K, Svobodová Vařeková R  et al.  Anatomy of enzyme channels. BMC Bioinfo  2014;15:379. PubMed PMC

Pu L, Govindaraj RG, Lemoine JM  et al.  Deepdrug3d: classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol  2019;15:e1006718. PubMed PMC

Rao R, Meier J, Sercu T  et al. Transformer protein language models are unsupervised structure learners. Biorxiv, 2020, 2020–12, preprint: not peer reviewed.

Roche DB, Tetchner SJ, McGuffin LJ.  Funfold: an improved automated method for the prediction of ligand binding residues using 3d models of proteins. BMC Bioinformatics  2011;12:160–20. PubMed PMC

Roche DB, Brackenridge DA, McGuffin LJ.  Proteins and their interacting partners: an introduction to protein–ligand binding site prediction methods. Int J Mol Sci  2015;16:29829–42. PubMed PMC

Roche R, Moussad B, Shuvo MH  et al.  Equipnas: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res  2024;52:e27. PubMed PMC

Rogers D, Hahn M.  Extended-connectivity fingerprints. J Chem Inf Model  2010;50:742–54. PubMed

Rusch TK, Bronstein MM, Mishra S. A survey on oversmoothing in graph neural networks. arXiv, March 2023, preprint: not peer reviewed.

Serra A, Galdi P, Tagliaferri R.  Machine learning for bioinformatics and neuroimaging. Wiley Interdiscip Rev Data Min Knowl Discov  2018;8:e1248.

Srivastava N, Hinton G, Krizhevsky A  et al.  Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res  2014;15:1929–58.

Steinegger M, Mirdita M, Söding J.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods  2019;16:603–6. PubMed

Su M, Yang Q, Du Y  et al.  Comparative assessment of scoring functions: the casf-2016 update. J Chem Inf Model  2019;59:895–913. PubMed

Suzek BE, Wang Y, Huang H  et al.  Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics  2015;31:926–32. PubMed PMC

Tiwary BK.  Biological databases. Bioinformatics and Computational Biology: A Primer for Biologists. 2022, 11–31. 10.1007/978-981-16-4241-8 DOI

Unsal S, Atas H, Albayrak M  et al.  Learning functional properties of proteins with language models. Nat Mach Intell  2022;4:227–45.

Varadi M, Anyango S, Deshpande M  et al.  Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res  2022;50:D439–D444. PubMed PMC

Vaswani A, Shazeer N, Parmar N  et al.  Attention is all you need. Adv Neural Info Process Syst  2017;30.

Veličković P.  Everything is connected: graph neural networks. Curr Opin Struct Biol  2023;79:102538. PubMed

Veličković P, Cucurull G, Casanova A  et al. Graph attention networks. arXiv, 2017, preprint: not peer reviewed.

Wang R, Fang X, Lu Y  et al.  The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J Med Chem  2004;47:2977–80. PubMed

Wang W, Sun B, Yu M  et al.  Graphplbr: protein-ligand binding residue prediction with deep graph convolution network. IEEE/ACM Trans Comput Biol Bioinform  2023;20:2223–32. PubMed

Wang Y, You Z-H, Yang S  et al.  A high efficient biological language model for predicting protein–protein interactions. Cells  2019;8:122. PubMed PMC

Wass MN, Kelley LA, Sternberg MJ.  3dligandsite: predicting ligand-binding sites using similar structures. Nucleic Acids Res  2010;38:W469–73. PubMed PMC

Xia Y, Xia C-Q, Pan X  et al.  Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res  2021;49:e51. PubMed PMC

Yang J, Roy A, Zhang Y.  Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res  2013;41:D1096–D1103. PubMed PMC

Yang J, Roy A, Zhang Y.  Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics  2013;29:2588–95. PubMed PMC

Yu D-J, Hu J, Yang J  et al.  Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans Comput Biol Bioinform  2013;10:994–1008. PubMed

Yuan Q, Chen S, Rao J  et al.  Alphafold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform  2022;23:bbab564. PubMed

Zhang X-M, Liang L, Liu L  et al.  Graph neural networks and their current applications in bioinformatics. Front Genet  2021;12:690049. PubMed PMC

Zhang Y, Huang W, Wei Z  et al. Equipocket: an e (3)-equivariant geometric graph neural network for ligand binding site prediction, arXiv, 2023, preprint: not peer reviewed.

Zhao J, Cao Y, Zhang L.  Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J  2020;18:417–26. PubMed PMC

Zhao Z, Xu Y, Zhao Y.  Sxgbsite: prediction of protein–ligand binding sites using sequence information and extreme gradient boosting. Genes (Basel)  2019;10:965. 10.3390/genes10120965 PubMed DOI PMC

Zheng Z, Deng Y, Xue D  et al. Structure-informed language models are protein designers. bioRxiv, 2023, preprint: not peer reviewed.

Najít záznam

Citační ukazatele

Pouze přihlášení uživatelé

Možnosti archivace

Nahrávání dat ...