JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek

FT
PubMed

Záznam pochází z PubMed

Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?

Gamouh, Hamza
Autor Gamouh, Hamza ORCID Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic
Novotný, Marian
Autor Novotný, Marian ORCID Faculty of Science, Charles University, 128 00 Prague, Czech Republic
Hoksza, David
Autor Hoksza, David Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic

Bioinformatics (Oxford, England). 2025 Aug 02 ; 41 (8) : .

Bioinformatics
ISSN 1367-4811 | 1367-4803
Zdroj

Jazyk angličtina Země Velká Británie, Anglie Médium print

Typ dokumentu časopisecké články

Perzistentní odkaz https://www.medvik.cz/link/pmid40742755

Grantová podpora
23-07349S Czech Science Foundation

Online Plný text

PubMed 40742755
PubMed Central PMC12377911
DOI 10.1093/bioinformatics/btaf431
PII: 8220314
Knihovny.cz E-zdroje

MOTIVATION: Predicting protein-ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. Here, we analyze a hybrid approach that combines the strengths of both paradigms by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein's 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model performance. RESULTS: By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of the baselines in absolute terms. Nevertheless, as more complex pLMs are used to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that although the use of the experimental protein structure almost always improves the accuracy of the prediction of the binding site, complex pLMs still contain structural information that leads to good predictive performance even without the use of 3D structure. AVAILABILITY AND IMPLEMENTATION: The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.

Faculty of Mathematics and Physics Charles University 118 00 Prague Czech Republic

Faculty of Science Charles University 128 00 Prague Czech Republic

Zobrazit více v PubMed

Aggarwal R, Gupta A, Chelur V et al. Deeppocket: ligand binding site detection and segmentation using 3d convolutional neural networks. J Chem Inf Model 2022;62:5069–79. PubMed

Alipanahi B, Delong A, Weirauch MT et al. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 2015;33:831–8. PubMed

Ashkenazy H, Erez E, Martz E et al. Consurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res 2010;38:W529–W533. PubMed PMC

Berman HM, Westbrook J, Feng Z et al. The protein data bank. Nucleic Acids Res 2000;28:235–42. PubMed PMC

Brown T, Mann B, Ryder N et al. Language models are few-shot learners. Adv Neural Info Process Syst 2020;33:1877–901.

Brylinski M, Skolnick J. A threading-based method (findsite) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 2008;105:129–34. PubMed PMC

Chai J, Zeng H, Li A et al. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl 2021;6:100134.

Chauhan JS, Mishra NK, Raghava GP. Identification of atp binding residues of a protein from its primary sequence. BMC Bioinformatics 2009;10:434–9. PubMed PMC

Chen K, Mizianty MJ, Kurgan L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 2012;28:331–41. PubMed

Chen P, Huang JZ, Gao X. Ligandrfs: random Forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinfo 2014;15:S4–12. PubMed PMC

Chen P, Hu S, Zhang J et al. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans Comput Biol Bioinform 2016;13:901–12. PubMed

Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 2020;21:6. 10.1186/s12864-019-6413-7 PubMed DOI PMC

Cui Y, Dong Q, Hong D et al. Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinfo 2019;20:93–12. PubMed PMC

Devlin J, Chang M-W, Lee K et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, preprint: not peer reviewed.

Ding Y, Tang J, Guo F. Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 2017;57:3149–61. 10.1021/acs.jcim.7b00307 PubMed DOI

Elnaggar A, Heinzinger M, Dallago C et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. PubMed

Evteev S, Ereshchenko A, Adjugim D et al. Skittles: gnn-assisted pseudo-ligands generation and its application for binding sites classification and affinity prediction. Prot Struct Funct Bioinfo 2025;93:1269–80. PubMed

Evteev SA, Ereshchenko AV, Ivanenkov YA. Siteradar: utilizing graph machine learning for precise mapping of protein–ligand-binding sites. J Chem Inf Model 2023;63:1124–32. PubMed

Ferreira LG, Dos Santos RN, Oliva G et al. Molecular docking and structure-based drug design strategies. Molecules 2015;20:13384–421. PubMed PMC

Ferruz N, Höcker B. Controllable protein design with language models. Nat Mach Intell 2022;4:521–32.

Fout A, Byrd J, Shariat B et al. Protein interface prediction using graph convolutional networks. Adv Neural Info Process Syst 2017;30.

Hauser M, Steinegger M, Söding J. Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 2016;32:1323–30. PubMed

He K, Zhang X, Ren S et al. Deep residual learning for image recognition. In

Heinzinger M, Elnaggar A, Wang Y et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019;20:723–17. PubMed PMC

Hendlich M, Rippmann F, Barnickel G. Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 1997;15:359–89. PubMed

Høie MH, Kiehl EN, Petersen B et al. Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res 2022;50:W510–W515. PubMed PMC

Hoksza D, Gamouh H. Exploration of protein sequence embeddings for protein-ligand binding site detection. In

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In

Jha K, Karmakar S, Saha S. Graph-bert and language model-based framework for protein–protein interaction identification. Sci Rep 2023;13:5663. PubMed PMC

Han J, Cen J, Wu L et al. A survey of geometric graph neural networks: data structures, models and applications. Front Comp Sci 2025;19. 10.1007/s11704-025-41426-w DOI

Jiménez J, Doerr S, Martínez-Rosell G et al. Deepsite: protein-binding site predictor using 3d-convolutional neural networks. Bioinformatics 2017;33:3036–42. PubMed

Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. PubMed PMC

Kandel J, Tayara H, Chong KT. Puresnet: prediction of protein-ligand binding sites using deep residual neural network. J Cheminform 2021;13:65–14. PubMed PMC

Kawashima S, Kanehisa M. Aaindex: amino acid index database. Nucleic Acids Res 2000;28:374. PubMed PMC

Khurana D, Koli A, Khatter K et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 2023;82:3713–44. PubMed PMC

Kim P, Zhao J, Lu P et al. Mutlbsgenedb: mutated ligand binding site gene database. Nucleic Acids Res 2017;45:D256–D263. PubMed PMC

Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv, 2016, preprint: not peer reviewed.

Konc J, Janežič D. Protein binding sites for drug design. Biophys Rev 2022;14:1413–21. PubMed PMC

Krivák R, Hoksza D. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 2018;10:39. PubMed PMC

Laskowski RA. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995;13:323–30. PubMed

Laurie AT, Jackson RM. Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics 2005;21:1908–16. PubMed

Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 2009;10:168–11. PubMed PMC

Li P, Liu Z-P. Geobind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res 2023;51:e60. PubMed PMC

Li Y, Huang C, Ding L et al. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 2019;166:4–21. PubMed

Lin Y, Yoo S, Sanchez R. Sitecomp: a server for ligand binding site analysis in protein structures. Bioinformatics 2012;28:1172–3. PubMed PMC

Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. a; 500902, preprint: not peer reviewed.

Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. b, preprint: not peer reviewed.

Liu Y, Grimm M, Dai W-T et al. Cb-dock: a web server for cavity detection-guided protein–ligand blind docking. Acta Pharmacol Sin 2020;41:138–44. PubMed PMC

Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv, 2017, preprint: not peer reviewed.

Min B, Ross H, Sulem E et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv 2024;56:1–40. 10.1145/3605943 DOI

Mylonas SK, Axenopoulos A, Daras P. Deepsurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics 2021;37:1681–90. PubMed

Ngan C-H, Hall DR, Zerbe B et al. Ftsite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics 2012;28:286–7. PubMed PMC

O’Shea K, Nash R. An introduction to convolutional neural networks. arXiv, 2015, preprint: not peer reviewed.

Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–30.

Pokharel S, Pratyush P, Heinzinger M et al. Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep 2022;12:16933. PubMed PMC

Pratyush P, Pokharel S, Saigo H et al. Plmsnosite: an ensemble-based approach for predicting protein s-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model. BMC Bioinfo 2023;24:41. PubMed PMC

Pravda L, Berka K, Svobodová Vařeková R et al. Anatomy of enzyme channels. BMC Bioinfo 2014;15:379. PubMed PMC

Pu L, Govindaraj RG, Lemoine JM et al. Deepdrug3d: classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol 2019;15:e1006718. PubMed PMC

Rao R, Meier J, Sercu T et al. Transformer protein language models are unsupervised structure learners. Biorxiv, 2020, 2020–12, preprint: not peer reviewed.

Roche DB, Tetchner SJ, McGuffin LJ. Funfold: an improved automated method for the prediction of ligand binding residues using 3d models of proteins. BMC Bioinformatics 2011;12:160–20. PubMed PMC

Roche DB, Brackenridge DA, McGuffin LJ. Proteins and their interacting partners: an introduction to protein–ligand binding site prediction methods. Int J Mol Sci 2015;16:29829–42. PubMed PMC

Roche R, Moussad B, Shuvo MH et al. Equipnas: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024;52:e27. PubMed PMC

Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50:742–54. PubMed

Rusch TK, Bronstein MM, Mishra S. A survey on oversmoothing in graph neural networks. arXiv, March 2023, preprint: not peer reviewed.

Serra A, Galdi P, Tagliaferri R. Machine learning for bioinformatics and neuroimaging. Wiley Interdiscip Rev Data Min Knowl Discov 2018;8:e1248.

Srivastava N, Hinton G, Krizhevsky A et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–58.

Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods 2019;16:603–6. PubMed

Su M, Yang Q, Du Y et al. Comparative assessment of scoring functions: the casf-2016 update. J Chem Inf Model 2019;59:895–913. PubMed

Suzek BE, Wang Y, Huang H et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015;31:926–32. PubMed PMC

Tiwary BK. Biological databases. Bioinformatics and Computational Biology: A Primer for Biologists. 2022, 11–31. 10.1007/978-981-16-4241-8 DOI

Unsal S, Atas H, Albayrak M et al. Learning functional properties of proteins with language models. Nat Mach Intell 2022;4:227–45.

Varadi M, Anyango S, Deshpande M et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 2022;50:D439–D444. PubMed PMC

Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Adv Neural Info Process Syst 2017;30.

Veličković P. Everything is connected: graph neural networks. Curr Opin Struct Biol 2023;79:102538. PubMed

Veličković P, Cucurull G, Casanova A et al. Graph attention networks. arXiv, 2017, preprint: not peer reviewed.

Wang R, Fang X, Lu Y et al. The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J Med Chem 2004;47:2977–80. PubMed

Wang W, Sun B, Yu M et al. Graphplbr: protein-ligand binding residue prediction with deep graph convolution network. IEEE/ACM Trans Comput Biol Bioinform 2023;20:2223–32. PubMed

Wang Y, You Z-H, Yang S et al. A high efficient biological language model for predicting protein–protein interactions. Cells 2019;8:122. PubMed PMC

Wass MN, Kelley LA, Sternberg MJ. 3dligandsite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 2010;38:W469–73. PubMed PMC

Xia Y, Xia C-Q, Pan X et al. Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49:e51. PubMed PMC

Yang J, Roy A, Zhang Y. Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res 2013;41:D1096–D1103. PubMed PMC

Yang J, Roy A, Zhang Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013;29:2588–95. PubMed PMC

Yu D-J, Hu J, Yang J et al. Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans Comput Biol Bioinform 2013;10:994–1008. PubMed

Yuan Q, Chen S, Rao J et al. Alphafold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:bbab564. PubMed

Zhang X-M, Liang L, Liu L et al. Graph neural networks and their current applications in bioinformatics. Front Genet 2021;12:690049. PubMed PMC

Zhang Y, Huang W, Wei Z et al. Equipocket: an e (3)-equivariant geometric graph neural network for ligand binding site prediction, arXiv, 2023, preprint: not peer reviewed.

Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020;18:417–26. PubMed PMC

Zhao Z, Xu Y, Zhao Y. Sxgbsite: prediction of protein–ligand binding sites using sequence information and extreme gradient boosting. Genes (Basel) 2019;10:965. 10.3390/genes10120965 PubMed DOI PMC

Zheng Z, Deng Y, Xue D et al. Structure-informed language models are protein designers. bioRxiv, 2023, preprint: not peer reviewed.

Najít záznam

v BMČ

Citační ukazatele

Pouze přihlášení uživatelé

Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?

Najít záznam

Citační ukazatele

Možnosti archivace