Most cited article - PubMed ID 30109435
P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure
MOTIVATION: Structure-based methods for detecting protein-ligand binding sites play a crucial role in various domains, from fundamental research to biomedical applications. However, current prediction methodologies often rely on holo (ligand-bound) protein conformations for training and evaluation, overlooking the significance of the apo (ligand-free) states. This oversight is particularly problematic in the case of cryptic binding sites (CBSs) where holo-based assessment yields unrealistic performance expectations. RESULTS: To advance the development in this domain, we introduce CryptoBench, a benchmark dataset tailored for training and evaluating novel CBS prediction methodologies. CryptoBench is constructed upon a large collection of apo-holo protein pairs, grouped by UniProtID, clustered by sequence identity, and filtered to contain only structures with substantial structural change in the binding site. CryptoBench comprises 1107 structures with predefined cross-validation splits, making it the most extensive CBS dataset to date. To establish a performance baseline, we measured the predictive power of sequence- and structure-based CBS residue prediction methods using the benchmark. We selected PocketMiner as the state-of-the-art representative of the structure-based methods for CBS detection, and P2Rank, a widely-used structure-based method for general binding site prediction that is not specifically tailored for cryptic sites. For sequence-based approaches, we trained a neural network to classify binding residues using protein language model embeddings. Our sequence-based approach outperformed PocketMiner and P2Rank across key metrics, including area under the curve, area under the precision-recall curve, Matthew's correlation coefficient, and F1 scores. These results provide baseline benchmark results for future CBS and potentially also non-CBS prediction endeavors, leveraging CryptoBench as the foundational platform for further advancements in the field. AVAILABILITY AND IMPLEMENTATION: The CryptoBench dataset, including the benchmark model, is available on Open Science Framework-https://osf.io/pz4a9/. The code and tutorial are available at the GitHub repository-https://github.com/skrhakv/CryptoBench/.
- MeSH
- Benchmarking MeSH
- Databases, Protein MeSH
- Protein Conformation MeSH
- Ligands MeSH
- Proteins * chemistry metabolism MeSH
- Software * MeSH
- Protein Binding MeSH
- Binding Sites MeSH
- Computational Biology * methods MeSH
- Publication type
- Journal Article MeSH
- Names of Substances
- Ligands MeSH
- Proteins * MeSH
Next-generation sequencing technology has created many new opportunities for clinical diagnostics, but it faces the challenge of functional annotation of identified mutations. Various algorithms have been developed to predict the impact of missense variants that influence oncogenic drivers. However, computational pipelines that handle biological data must integrate multiple software tools, which can add complexity and hinder non-specialist users from accessing the pipeline. Here, we have developed an online user-friendly web server tool PredictONCO that is fully automated and has a low barrier to access. The tool models the structure of the mutant protein in the first step. Next, it calculates the protein stability change, pocket level information, evolutionary conservation, and changes in ionisation of catalytic amino acid residues, and uses them as the features in the machine-learning predictor. The XGBoost-based predictor was validated on an independent subset of held-out data, demonstrating areas under the receiver operating characteristic curve (ROC) of 0.97 and 0.94, and the average precision from the precision-recall curve of 0.99 and 0.94 for structure-based and sequence-based predictions, respectively. Finally, PredictONCO calculates the docking results of small molecules approved by regulatory authorities. We demonstrate the applicability of the tool by presenting its usage for variants in two cancer-associated proteins, cellular tumour antigen p53 and fibroblast growth factor receptor FGFR1. Our free web tool will assist with the interpretation of data from next-generation sequencing and navigate treatment strategies in clinical oncology: https://loschmidt.chemi.muni.cz/predictonco/.
- Keywords
- Automation, Machine learning, Mutation, Next-generation sequencing, Oncogenicity, Precision oncology, Prediction, Treatment, Virtual screening, Webserver,
- Publication type
- Journal Article MeSH
Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an in-house machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme-ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.Scientific contributionsThe pipeline introduced in this work allows for the detailed analysis of a large set of protein-ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git . The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/ .
- Keywords
- Bottleneck, Cavity, Cognate ligand, Enzyme, Machine learning, Pocket, Transport, Tunnel,
- Publication type
- Journal Article MeSH
Every year, more than 19 million cancer cases are diagnosed, and this number continues to increase annually. Since standard treatment options have varying success rates for different types of cancer, understanding the biology of an individual's tumour becomes crucial, especially for cases that are difficult to treat. Personalised high-throughput profiling, using next-generation sequencing, allows for a comprehensive examination of biopsy specimens. Furthermore, the widespread use of this technology has generated a wealth of information on cancer-specific gene alterations. However, there exists a significant gap between identified alterations and their proven impact on protein function. Here, we present a bioinformatics pipeline that enables fast analysis of a missense mutation's effect on stability and function in known oncogenic proteins. This pipeline is coupled with a predictor that summarises the outputs of different tools used throughout the pipeline, providing a single probability score, achieving a balanced accuracy above 86%. The pipeline incorporates a virtual screening method to suggest potential FDA/EMA-approved drugs to be considered for treatment. We showcase three case studies to demonstrate the timely utility of this pipeline. To facilitate access and analysis of cancer-related mutations, we have packaged the pipeline as a web server, which is freely available at https://loschmidt.chemi.muni.cz/predictonco/ .Scientific contributionThis work presents a novel bioinformatics pipeline that integrates multiple computational tools to predict the effects of missense mutations on proteins of oncological interest. The pipeline uniquely combines fast protein modelling, stability prediction, and evolutionary analysis with virtual drug screening, while offering actionable insights for precision oncology. This comprehensive approach surpasses existing tools by automating the interpretation of mutations and suggesting potential treatments, thereby striving to bridge the gap between sequencing data and clinical application.
- Keywords
- Bioinformatics, Cancer, Function, High-performance computing, Machine learning, Molecular modelling, Oncology, Personalised medicine, Single nucleotide polymorphism, Stability, Treatment,
- Publication type
- Journal Article MeSH
Core mitochondrial processes such as the electron transport chain, protein translation and the formation of Fe-S clusters (ISC) are of prokaryotic origin and were present in the bacterial ancestor of mitochondria. In animal and fungal models, a family of small Leu-Tyr-Arg motif-containing proteins (LYRMs) uniformly regulates the function of mitochondrial complexes involved in these processes. The action of LYRMs is contingent upon their binding to the acylated form of acyl carrier protein (ACP). This study demonstrates that LYRMs are structurally and evolutionarily related proteins characterized by a core triplet of α-helices. Their widespread distribution across eukaryotes suggests that 12 specialized LYRMs were likely present in the last eukaryotic common ancestor to regulate the assembly and folding of the subunits that are conserved in bacteria but that lack LYRM homologues. The secondary reduction of mitochondria to anoxic environments has rendered the function of LYRMs and their interaction with acylated ACP dispensable. Consequently, these findings strongly suggest that early eukaryotes installed LYRMs in aerobic mitochondria as orchestrated switches, essential for regulating core metabolism and ATP production.
- Keywords
- LECA, LYRM proteins, acyl-ACP, mitochondrial evolution,
- MeSH
- Eukaryota metabolism MeSH
- Phylogeny MeSH
- Humans MeSH
- Mitochondrial Proteins * metabolism genetics MeSH
- Mitochondria * metabolism MeSH
- Evolution, Molecular MeSH
- Models, Molecular MeSH
- Acyl Carrier Protein metabolism genetics MeSH
- Amino Acid Sequence MeSH
- Animals MeSH
- Check Tag
- Humans MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
PredictONCO 1.0 is a unique web server that analyzes effects of mutations on proteins frequently altered in various cancer types. The server can assess the impact of mutations on the protein sequential and structural properties and apply a virtual screening to identify potential inhibitors that could be used as a highly individualized therapeutic approach, possibly based on the drug repurposing. PredictONCO integrates predictive algorithms and state-of-the-art computational tools combined with information from established databases. The user interface was carefully designed for the target specialists in precision oncology, molecular pathology, clinical genetics and clinical sciences. The tool summarizes the effect of the mutation on protein stability and function and currently covers 44 common oncological targets. The binding affinities of Food and Drug Administration/ European Medicines Agency -approved drugs with the wild-type and mutant proteins are calculated to facilitate treatment decisions. The reliability of predictions was confirmed against 108 clinically validated mutations. The server provides a fast and compact output, ideal for the often time-sensitive decision-making process in oncology. Three use cases of missense mutations, (i) K22A in cyclin-dependent kinase 4 identified in melanoma, (ii) E1197K mutation in anaplastic lymphoma kinase 4 identified in lung carcinoma and (iii) V765A mutation in epidermal growth factor receptor in a patient with congenital mismatch repair deficiency highlight how the tool can increase levels of confidence regarding the pathogenicity of the variants and identify the most effective inhibitors. The server is available at https://loschmidt.chemi.muni.cz/predictonco.
- Keywords
- cancer, oncology, personalized medicine, single-nucleotide polymorphism, targeted therapy,
- MeSH
- Precision Medicine * MeSH
- Humans MeSH
- Melanoma * MeSH
- Mutation MeSH
- Proteins MeSH
- Reproducibility of Results MeSH
- Machine Learning MeSH
- Computational Biology MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Proteins MeSH
Knowledge of protein-ligand binding sites (LBSs) enables research ranging from protein function annotation to structure-based drug design. To this end, we have previously developed a stand-alone tool, P2Rank, and the web server PrankWeb (https://prankweb.cz/) for fast and accurate LBS prediction. Here, we present significant enhancements to PrankWeb. First, a new, more accurate evolutionary conservation estimation pipeline based on the UniRef50 sequence database and the HMMER3 package is introduced. Second, PrankWeb now allows users to enter UniProt ID to carry out LBS predictions in situations where no experimental structure is available by utilizing the AlphaFold model database. Additionally, a range of minor improvements has been implemented. These include the ability to deploy PrankWeb and P2Rank as Docker containers, support for the mmCIF file format, improved public REST API access, or the ability to batch download the LBS predictions for the whole PDB archive and parts of the AlphaFold database.
- MeSH
- Databases, Protein MeSH
- Internet MeSH
- Ligands MeSH
- Protein Domains MeSH
- Proteins * chemistry MeSH
- Software * MeSH
- Protein Binding MeSH
- Binding Sites MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Ligands MeSH
- Proteins * MeSH
PrankWeb is an online resource providing an interface to P2Rank, a state-of-the-art method for ligand binding site prediction. P2Rank is a template-free machine learning method based on the prediction of local chemical neighborhood ligandability centered on points placed on a solvent-accessible protein surface. Points with a high ligandability score are then clustered to form the resulting ligand binding sites. In addition, PrankWeb provides a web interface enabling users to easily carry out the prediction and visually inspect the predicted binding sites via an integrated sequence-structure view. Moreover, PrankWeb can determine sequence conservation for the input molecule and use this in both the prediction and result visualization steps. Alongside its online visualization options, PrankWeb also offers the possibility of exporting the results as a PyMOL script for offline visualization. The web frontend communicates with the server side via a REST API. In high-throughput scenarios, therefore, users can utilize the server API directly, bypassing the need for a web-based frontend or installation of the P2Rank application. PrankWeb is available at http://prankweb.cz/, while the web application source code and the P2Rank method can be accessed at https://github.com/jendelel/PrankWebApp and https://github.com/rdk/p2rank, respectively.
- MeSH
- Benchmarking MeSH
- Datasets as Topic MeSH
- Protein Interaction Domains and Motifs MeSH
- Internet MeSH
- Protein Conformation, alpha-Helical MeSH
- Protein Conformation, beta-Strand MeSH
- Humans MeSH
- Ligands MeSH
- Proteins chemistry metabolism MeSH
- Amino Acid Sequence MeSH
- Software * MeSH
- Machine Learning * MeSH
- Thermodynamics MeSH
- Protein Binding MeSH
- Binding Sites MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Ligands MeSH
- Proteins MeSH