Leveraging large language models for literature-driven prioritization of protein binding pockets
Language English Country Great Britain, England Media print
Document type Journal Article
Grant support
101101923
Ministry of Education
PubMed
40795239
PubMed Central
PMC12371332
DOI
10.1093/bioinformatics/btaf449
PII: 8225722
Knihovny.cz E-resources
- MeSH
- Databases, Protein MeSH
- Drug Discovery * methods MeSH
- Proteins * chemistry metabolism MeSH
- Software MeSH
- Protein Binding MeSH
- Binding Sites MeSH
- Large Language Models MeSH
- Computational Biology * methods MeSH
- Publication type
- Journal Article MeSH
- Names of Substances
- Proteins * MeSH
MOTIVATION: Accurately identifying and prioritizing protein binding pockets is a foundational element of small-molecule drug discovery. Defining these known pockets currently relies on a laborious manual process of extracting key residue data from selected publications, reconciling inconsistent terminology, and independently computing volumetric representations. This manual curation to ensure biological relevance is time-consuming, error-prone, and represents a major bottleneck for efficient, high-throughput drug discovery. RESULTS: We present a novel approach for the identification and prioritization of protein binding pockets for small molecules by combining geometric pocket detection with large language models (LLMs). Our method leverages Fpocket to generate candidate pockets, which are then validated against published experimental data extracted from research articles using LLM with a series of prompts fine-tuned to identify and extract residue-level information associated with experimentally confirmed binding sites. We developed a curated benchmark dataset of diverse proteins and associated literature to train and evaluate the LLM's performance in paper relevance assessment and pocket extraction. AVAILABILITY AND IMPLEMENTATION: The developed benchmark dataset and methodology are freely available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: 10.5281/zenodo.15798647).
See more in PubMed
Aggarwal R, Gupta A, Chelur V et al. DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks. J Chem Inf Model 2022;62:5069–79. 10.1021/acs.jcim.1c00799 PubMed DOI
Ahuja S, Mukund S, Deng L et al. Structural basis of Nav1.7 inhibition by an isoform-selective small-molecule antagonist. Science 2015;350:aac5464. 10.1126/science.aac5464 PubMed DOI
An Y, Lim J, Glavatskikh M et al. In silico fragment-based discovery of CIB1-directed anti-tumor agents by FRASE-Bot. Nat Commun 2024;15:5564. 10.1038/s41467-024-49892-9 PubMed DOI PMC
Capra JA, Laskowski RA, Thornton JM et al. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 2009;5:e1000585. 10.1371/journal.pcbi.1000585 PubMed DOI PMC
Durrant JD, Votapka L, Sørensen J et al. POVME 2.0: an enhanced tool for determining pocket shape and volume characteristics. J Chem Theory Comput 2014;10:5047–56. 10.1021/ct500381c PubMed DOI PMC
Ghersi D, Sanchez R. Improving accuracy and efficiency of blind protein‐ligand docking by focusing on predicted binding sites. Proteins 2009;74:417–24. 10.1002/prot.22154 PubMed DOI PMC
Graef J, Ehrt C, Rarey M. Binding site detection remastered: enabling fast, robust, and reliable binding site detection and descriptor calculation with DoGSite3. J Chem Inf Model 2023;63:3128–37. 10.1021/acs.jcim.3c00336 PubMed DOI
Jeevan K, Palistha S, Tayara H et al. PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction. J Cheminform 2024;16:66. 10.1186/s13321-024-00865-6 PubMed DOI PMC
Jiménez J, Doerr S, Martínez-Rosell G et al. DeepSite: protein-Binding site predictor using 3D-convolutional neural networks. Bioinformatics 2017;33:3036–42. 10.1093/bioinformatics/btx350 PubMed DOI
Kandel J, Tayara H, Chong KT. PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J Cheminform 2021;13:65. 10.1186/s13321-021-00547-7 PubMed DOI PMC
Kim JJ, Gharpure A, Teng J et al. Shared structural mechanisms of general anaesthetics and benzodiazepines. Nature 2020;585:303–8. 10.1038/s41586-020-2654-5 PubMed DOI PMC
Krivák R, Hoksza D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 2018;10:39. 10.1186/s13321-018-0285-8 PubMed DOI PMC
Kruse AC, Kobilka BK, Gautam D et al. Muscarinic acetylcholine receptors: novel opportunities for drug development. Nat Rev Drug Discov 2014;13:549–60. 10.1038/nrd4295 PubMed DOI PMC
Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 2009;10:168. 10.1186/1471-2105-10-168 PubMed DOI PMC
Liang J, Edelsbrunner H, Woodward C. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 1998;7:1884–97. 10.1002/pro.5560070905 PubMed DOI PMC
Liu Y, Yang X, Gan J et al. CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res 2022;50:W159–64. 10.1093/nar/gkac394 PubMed DOI PMC
Murphy JM, Lucet IS, Hildebrand JM et al. Insights into the evolution of divergent nucleotide-binding mechanisms among pseudokinases revealed by crystal structures of human and mouse MLKL. Biochem J 2014;457:369–77. 10.1042/BJ20131270 PubMed DOI
Tian W, Chen C, Lei X et al. CASTp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res 2018;46:W363–7. 10.1093/nar/gky473 PubMed DOI PMC
Wei H, Wang W, Peng Z et al. Q-BioLiP: a comprehensive resource for quaternary structure-based protein–ligand interactions. Genomics Proteomics Bioinform 2024;22:qzae001. 10.1093/gpbjnl/qzae001 PubMed DOI PMC
Wei J, Wang X, Schuurmans D et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv, arXiv:2201.11903, 2023, preprint: not peer reviewed.
Xia C-Q, Pan X, Shen H-B. Protein-Ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 2020;36:3018–27. 10.1093/bioinformatics/btaa110 PubMed DOI
Yao S, Yu D, Zhao J et al. Tree of thoughts: deliberate problem solving with large language models. arXiv, arXiv:2305.10601, 2023, preprint: not peer reviewed.
Yesylevskyy S. MolAR: memory‐safe library for analysis of MD simulations written in rust. J Comput Chem 2025;46:e27536. 10.1002/jcc.27536 PubMed DOI PMC
Zhang C, Zhang X, Freddolino L et al. BioLiP2: an updated structure database for biologically relevant Ligand-Protein interactions. Nucleic Acids Res 2024;52:D404–12. 10.1093/nar/gkad630 PubMed DOI PMC
Zhao Y, He S, Xing Y et al. A point cloud graph neural network for protein–ligand binding site prediction. Int J Mol Sci 2024;25:9280. 10.3390/ijms25179280 PubMed DOI PMC