• This record comes from PubMed

Leveraging large language models for literature-driven prioritization of protein binding pockets

. 2025 Aug 02 ; 41 (8) : .

Language English Country Great Britain, England Media print

Document type Journal Article

Grant support
101101923 Ministry of Education

MOTIVATION: Accurately identifying and prioritizing protein binding pockets is a foundational element of small-molecule drug discovery. Defining these known pockets currently relies on a laborious manual process of extracting key residue data from selected publications, reconciling inconsistent terminology, and independently computing volumetric representations. This manual curation to ensure biological relevance is time-consuming, error-prone, and represents a major bottleneck for efficient, high-throughput drug discovery. RESULTS: We present a novel approach for the identification and prioritization of protein binding pockets for small molecules by combining geometric pocket detection with large language models (LLMs). Our method leverages Fpocket to generate candidate pockets, which are then validated against published experimental data extracted from research articles using LLM with a series of prompts fine-tuned to identify and extract residue-level information associated with experimentally confirmed binding sites. We developed a curated benchmark dataset of diverse proteins and associated literature to train and evaluate the LLM's performance in paper relevance assessment and pocket extraction. AVAILABILITY AND IMPLEMENTATION: The developed benchmark dataset and methodology are freely available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: 10.5281/zenodo.15798647).

See more in PubMed

Aggarwal R, Gupta A, Chelur V  et al.  DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks. J Chem Inf Model  2022;62:5069–79. 10.1021/acs.jcim.1c00799 PubMed DOI

Ahuja S, Mukund S, Deng L  et al.  Structural basis of Nav1.7 inhibition by an isoform-selective small-molecule antagonist. Science  2015;350:aac5464. 10.1126/science.aac5464 PubMed DOI

An Y, Lim J, Glavatskikh M  et al.  In silico fragment-based discovery of CIB1-directed anti-tumor agents by FRASE-Bot. Nat Commun  2024;15:5564. 10.1038/s41467-024-49892-9 PubMed DOI PMC

Capra JA, Laskowski RA, Thornton JM  et al.  Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol  2009;5:e1000585. 10.1371/journal.pcbi.1000585 PubMed DOI PMC

Durrant JD, Votapka L, Sørensen J  et al.  POVME 2.0: an enhanced tool for determining pocket shape and volume characteristics. J Chem Theory Comput  2014;10:5047–56. 10.1021/ct500381c PubMed DOI PMC

Ghersi D, Sanchez R.  Improving accuracy and efficiency of blind protein‐ligand docking by focusing on predicted binding sites. Proteins  2009;74:417–24. 10.1002/prot.22154 PubMed DOI PMC

Graef J, Ehrt C, Rarey M.  Binding site detection remastered: enabling fast, robust, and reliable binding site detection and descriptor calculation with DoGSite3. J Chem Inf Model  2023;63:3128–37. 10.1021/acs.jcim.3c00336 PubMed DOI

Jeevan K, Palistha S, Tayara H  et al.  PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction. J Cheminform  2024;16:66. 10.1186/s13321-024-00865-6 PubMed DOI PMC

Jiménez J, Doerr S, Martínez-Rosell G  et al.  DeepSite: protein-Binding site predictor using 3D-convolutional neural networks. Bioinformatics  2017;33:3036–42. 10.1093/bioinformatics/btx350 PubMed DOI

Kandel J, Tayara H, Chong KT.  PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J Cheminform  2021;13:65. 10.1186/s13321-021-00547-7 PubMed DOI PMC

Kim JJ, Gharpure A, Teng J  et al.  Shared structural mechanisms of general anaesthetics and benzodiazepines. Nature  2020;585:303–8. 10.1038/s41586-020-2654-5 PubMed DOI PMC

Krivák R, Hoksza D.  P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform  2018;10:39. 10.1186/s13321-018-0285-8 PubMed DOI PMC

Kruse AC, Kobilka BK, Gautam D  et al.  Muscarinic acetylcholine receptors: novel opportunities for drug development. Nat Rev Drug Discov  2014;13:549–60. 10.1038/nrd4295 PubMed DOI PMC

Le Guilloux V, Schmidtke P, Tuffery P.  Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics  2009;10:168. 10.1186/1471-2105-10-168 PubMed DOI PMC

Liang J, Edelsbrunner H, Woodward C.  Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci  1998;7:1884–97. 10.1002/pro.5560070905 PubMed DOI PMC

Liu Y, Yang X, Gan J  et al.  CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res  2022;50:W159–64. 10.1093/nar/gkac394 PubMed DOI PMC

Murphy JM, Lucet IS, Hildebrand JM  et al.  Insights into the evolution of divergent nucleotide-binding mechanisms among pseudokinases revealed by crystal structures of human and mouse MLKL. Biochem J  2014;457:369–77. 10.1042/BJ20131270 PubMed DOI

Tian W, Chen C, Lei X  et al.  CASTp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res  2018;46:W363–7. 10.1093/nar/gky473 PubMed DOI PMC

Wei H, Wang W, Peng Z  et al.  Q-BioLiP: a comprehensive resource for quaternary structure-based protein–ligand interactions. Genomics Proteomics Bioinform  2024;22:qzae001. 10.1093/gpbjnl/qzae001 PubMed DOI PMC

Wei J, Wang X, Schuurmans D  et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv, arXiv:2201.11903, 2023, preprint: not peer reviewed.

Xia C-Q, Pan X, Shen H-B.  Protein-Ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics  2020;36:3018–27. 10.1093/bioinformatics/btaa110 PubMed DOI

Yao S, Yu D, Zhao J  et al. Tree of thoughts: deliberate problem solving with large language models. arXiv, arXiv:2305.10601, 2023, preprint: not peer reviewed.

Yesylevskyy S. MolAR: memory‐safe library for analysis of MD simulations written in rust. J Comput Chem 2025;46:e27536. 10.1002/jcc.27536 PubMed DOI PMC

Zhang C, Zhang X, Freddolino L  et al.  BioLiP2: an updated structure database for biologically relevant Ligand-Protein interactions. Nucleic Acids Res  2024;52:D404–12. 10.1093/nar/gkad630 PubMed DOI PMC

Zhao Y, He S, Xing Y  et al.  A point cloud graph neural network for protein–ligand binding site prediction. Int J Mol Sci  2024;25:9280. 10.3390/ijms25179280 PubMed DOI PMC

Find record

Citation metrics

Loading data ...

Archiving options

Loading data ...