CryptoBench: cryptic protein-ligand binding sites dataset and benchmark
Jazyk angličtina Země Velká Británie, Anglie Médium print
Typ dokumentu časopisecké články
Grantová podpora
23-07349S
Czech Science Foundation
PubMed
39693053
PubMed Central
PMC11725321
DOI
10.1093/bioinformatics/btae745
PII: 7927823
Knihovny.cz E-zdroje
- MeSH
- benchmarking MeSH
- databáze proteinů MeSH
- konformace proteinů MeSH
- ligandy MeSH
- proteiny * chemie metabolismus MeSH
- software * MeSH
- vazba proteinů MeSH
- vazebná místa MeSH
- výpočetní biologie * metody MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- ligandy MeSH
- proteiny * MeSH
MOTIVATION: Structure-based methods for detecting protein-ligand binding sites play a crucial role in various domains, from fundamental research to biomedical applications. However, current prediction methodologies often rely on holo (ligand-bound) protein conformations for training and evaluation, overlooking the significance of the apo (ligand-free) states. This oversight is particularly problematic in the case of cryptic binding sites (CBSs) where holo-based assessment yields unrealistic performance expectations. RESULTS: To advance the development in this domain, we introduce CryptoBench, a benchmark dataset tailored for training and evaluating novel CBS prediction methodologies. CryptoBench is constructed upon a large collection of apo-holo protein pairs, grouped by UniProtID, clustered by sequence identity, and filtered to contain only structures with substantial structural change in the binding site. CryptoBench comprises 1107 structures with predefined cross-validation splits, making it the most extensive CBS dataset to date. To establish a performance baseline, we measured the predictive power of sequence- and structure-based CBS residue prediction methods using the benchmark. We selected PocketMiner as the state-of-the-art representative of the structure-based methods for CBS detection, and P2Rank, a widely-used structure-based method for general binding site prediction that is not specifically tailored for cryptic sites. For sequence-based approaches, we trained a neural network to classify binding residues using protein language model embeddings. Our sequence-based approach outperformed PocketMiner and P2Rank across key metrics, including area under the curve, area under the precision-recall curve, Matthew's correlation coefficient, and F1 scores. These results provide baseline benchmark results for future CBS and potentially also non-CBS prediction endeavors, leveraging CryptoBench as the foundational platform for further advancements in the field. AVAILABILITY AND IMPLEMENTATION: The CryptoBench dataset, including the benchmark model, is available on Open Science Framework-https://osf.io/pz4a9/. The code and tutorial are available at the GitHub repository-https://github.com/skrhakv/CryptoBench/.
Zobrazit více v PubMed
Abramson J, Adler J, Dunger J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature 2024;630:493–500. 10.1038/s41586-024-07487-w PubMed DOI PMC
AlQuraishi M. ProteinNet: A standardized data set for machine learning of protein structure. BMC Bioinformatics 2019;20:311. 10.1186/s12859-019-2932-0 PubMed DOI PMC
Beglov D, Hall DR, Wakefield AE. et al. Exploring the structural origins of cryptic sites on proteins. Proc Natl Acad Sci U S A 2018;115:E3416–25. 10.1073/pnas.1711490115 PubMed DOI PMC
Cimermancic P, Weinkam P, Rettenmaier TJ. et al. Cryptosite: expanding the druggable proteome by characterization and prediction of cryptic binding sites. J Mol Biol 2016;428:709–19. 10.1016/j.jmb.2016.01.029 PubMed DOI PMC
Egbert M, Jones G, Collins MR. et al. Ftmove: a web server for detection and analysis of cryptic and allosteric binding sites by mapping multiple protein structures. J Mol Biol 2022;434:167587. PubMed PMC
Ehrt C. Protein binding site comparison. PhD Thesis. Technische Universität Dortmund, 2019.
Feidakis CP, Krivak R, Hoksza D. et al. AHoJ-DB: A PDB-wide assignment of apo & holo relationships based on individual protein-ligand interactions. J Mol Biol 2024;436:168545. 10.1016/j.jmb.2024.168545 PubMed DOI
Feidakis CP, Krivak R, Hoksza D. et al. Ahoj: rapid, tailored search and retrieval of apo and holo protein structures for user-defined ligands. Bioinformatics 2022;38:5452–3. PubMed PMC
Jakubec D, Vondrášek J, Finn RD.. 3DPatch: fast 3D structure visualization with residue conservation. Bioinformatics 2019;35:332–4. 10.1093/bioinformatics/bty464 PubMed DOI PMC
Krivák R, Hoksza D.. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 2018;10:39. PubMed PMC
Kuzmanic A, Bowman GR, Juarez-Jimenez J. et al. Investigating cryptic binding sites by molecular dynamics simulations. Acc Chem Res 2020;53:654–61. 10.1021/acs.accounts.9b00613 PubMed DOI PMC
Lee B, Richards F.. The interpretation of protein structures: estimation of static accessibility. J Mol Biol 1971;55:379–400. 10.1016/0022-2836(71)90324-X PubMed DOI
Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 PubMed DOI
Lin Z, Akin H, Rao R. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 10.1101/2022.07.20.500902, 2022, preprint: not peer reviewed. DOI
Martinez-Rosell G, Lovera S, Sands ZA. et al. Playmolecule crypticscout: predicting protein cryptic sites using mixed-solvent molecular simulations. J Chem Inf Model 2020;60:2314–24. 10.1021/acs.jcim.9b01209 PubMed DOI
Meller A, Ward M, Borowsky J. et al. Predicting locations of cryptic pockets from single protein structures using the pocketminer graph neural network. Nat Commun 2023;14:1177. 10.1038/s41467-023-36699-3 PubMed DOI PMC
Richards FM. Areas, volumes, packing, and protein structure. Annu Rev Biophys Bioeng 1977;6:151–76. 10.1146/annurev.bb.06.060177.001055 PMID: 326146. PubMed DOI
Singh J, Petter R, Baillie T. et al. The resurgence of covalent drugs. Nat Rev Drug Discov 2011;10:307–17. 10.1038/nrd3410 PubMed DOI
Škoda P, Hoksza D. Benchmarking platform for ligand-based virtual screening. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), p.1220–7, Shenzhen, China: IEEE, December 2016. 10.1109/BIBM.2016.7822693 DOI
Škrhák V, Riedlova K, Novotny M. et al. Cryptic binding site prediction with protein language models. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), p.2883–8, Los Alamitos, CA: IEEE Computer Society, December 2023. 10.1109/BIBM58861.2023.10385497 DOI
Smith RD, Carlson HA.. Identification of cryptic binding sites using mixmd with standard and accelerated molecular dynamics. J Chem Inf Model 2021;61:1287–99. 10.1021/acs.jcim.0c01002 PubMed DOI PMC
Smith RHB, Dar AC, Schlessinger A. Pyvol: a pymol plugin for visualization, comparison, and volume calculation of drug-binding sites. bioRxiv, 10.1101/816702, 2019, preprint: not peer reviewed. DOI
Steinegger M, Söding J.. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. 10.1038/nbt.3988 PubMed DOI
Thomas PD, Ebert D, Muruganujan A. et al. Panther: making genome-scale phylogenetics accessible to all. Protein Sci 2022;31:8–22. 10.1002/pro.4218 PubMed DOI PMC
Vajda S, Beglov D, Wakefield AE. et al. Cryptic binding sites on proteins: definition, detection, and druggability. Curr Opin Chem Biol 2018;44:1–8. 10.1016/j.cbpa.2018.05.003 PubMed DOI PMC
Varadi M, Berrisford J, Deshpande M. et al. Pdbe-kb: a community-driven resource for structural and functional annotations. Nucleic Acids Res 2020;48:D344–53. PubMed PMC
Wakefield AE, Kozakov D, Vajda S.. Mapping the binding sites of challenging drug targets. Curr Opin Struct Biol 2022;75:102396. PubMed PMC
wwPDB Consortium. Protein data bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res 2019;47:D520–8. 10.1093/nar/gky949 PubMed DOI PMC
Xu J, Zhang Y.. How significant is a protein structure similarity with tm-score = 0.5? Bioinformatics 2010;26:889–95. 10.1093/bioinformatics/btq066 PubMed DOI PMC
Zhang C, Zhang X, Freddolino P. et al. BioLiP2: an updated structure database for biologically relevant ligand–protein interactions. Nucleic Acids Res 2024;52:D404–12. 10.1093/nar/gkad630 PubMed DOI PMC
Zhao J, Cao Y, Zhang L.. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020;18:417–26. PubMed PMC
Zheng W. Predicting cryptic ligand binding sites based on normal modes guided conformational sampling. Proteins 2021;89:416–26. PubMed