Recent advances in protein 3D structure prediction using deep learning have focused on the importance of amino acid residue-residue connections (i.e., pairwise atomic contacts) for accuracy at the expense of mechanistic interpretability. Therefore, we decided to perform a series of analyses based on an alternative framework of residue-residue connections making primary use of the TOP2018 dataset. This framework of residue-residue connections is derived from amino acid residue pairing models both historic and new, all based on genetic principles complemented by relevant biophysical principles. Of these pairing models, three new models (named the GU, Transmuted and Shift pairing models) exhibit the highest observed-over-expected ratios and highest correlations in statistical analyses with various intra- and inter-chain datasets, in comparison to the remaining models. In addition, these new pairing models are universally frequent across different connection ranges, secondary structure connections, and protein sizes. Accordingly, following further statistical and other analyses described herein, we have come to a major conclusion that all three pairing models together could represent the basis of a universal proteomic code (second genetic code) sufficient, in and of itself, to "encode" for both protein folding mechanisms and protein-protein interactions.
- Klíčová slova
- Contact map, Protein 3D structure, Protein folding, Protein-protein interactions, Proteomic code, Sense-antisense,
- MeSH
- aminokyseliny * chemie genetika MeSH
- databáze proteinů MeSH
- lidé MeSH
- molekulární modely * MeSH
- proteiny * chemie genetika metabolismus MeSH
- proteomika * MeSH
- sbalování proteinů * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- aminokyseliny * MeSH
- proteiny * MeSH
The easiest and often most useful way to work with experimentally determined or computationally predicted structures of biomolecules is by viewing their three-dimensional (3D) shapes using a molecular visualization tool. Mol* was collaboratively developed by RCSB Protein Data Bank (RCSB PDB, RCSB.org) and Protein Data Bank in Europe (PDBe, PDBe.org) as an open-source, web-based, 3D visualization software suite for examination and analyses of biostructures. It is capable of displaying atomic coordinates and related experimental data of biomolecular structures together with a variety of annotations, facilitating basic and applied research, training, education, and information dissemination. Across RCSB.org, the RCSB PDB research-focused web portal, Mol* has been implemented to support single-mouse-click atomic-level visualization of biomolecules (e.g., proteins, nucleic acids, carbohydrates) with bound cofactors, small-molecule ligands, ions, water molecules, or other macromolecules. RCSB.org Mol* can seamlessly display 3D structures from various sources, allowing structure interrogation, superimposition, and comparison. Using influenza A H5N1 virus as a topical case study of an important pathogen, we exemplify how Mol* has been embedded within various RCSB.org tools-allowing users to view polymer sequence and structure-based annotations integrated from trusted bioinformatics data resources, assess patterns and trends in groups of structures, and view structures of any size and compositional complexity. In addition to being linked to every experimentally determined biostructure and Computed Structure Model made available at RCSB.org, Standalone Mol* is freely available for visualizing any atomic-level or multi-scale biostructure at rcsb.org/3d-view.
- Klíčová slova
- 3D biostructure, Protein Data Bank, global health, influenza A H5N1 virus, molecular visualization, open‐source, pandemic preparedness, viral pathogen, virus life cycle, web‐based,
- MeSH
- databáze proteinů MeSH
- internet MeSH
- konformace proteinů MeSH
- molekulární modely MeSH
- proteom * chemie MeSH
- software * MeSH
- virové proteiny * chemie MeSH
- virus chřipky A, podtyp H5N1 * chemie MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- proteom * MeSH
- virové proteiny * MeSH
MOTIVATION: Structure-based methods for detecting protein-ligand binding sites play a crucial role in various domains, from fundamental research to biomedical applications. However, current prediction methodologies often rely on holo (ligand-bound) protein conformations for training and evaluation, overlooking the significance of the apo (ligand-free) states. This oversight is particularly problematic in the case of cryptic binding sites (CBSs) where holo-based assessment yields unrealistic performance expectations. RESULTS: To advance the development in this domain, we introduce CryptoBench, a benchmark dataset tailored for training and evaluating novel CBS prediction methodologies. CryptoBench is constructed upon a large collection of apo-holo protein pairs, grouped by UniProtID, clustered by sequence identity, and filtered to contain only structures with substantial structural change in the binding site. CryptoBench comprises 1107 structures with predefined cross-validation splits, making it the most extensive CBS dataset to date. To establish a performance baseline, we measured the predictive power of sequence- and structure-based CBS residue prediction methods using the benchmark. We selected PocketMiner as the state-of-the-art representative of the structure-based methods for CBS detection, and P2Rank, a widely-used structure-based method for general binding site prediction that is not specifically tailored for cryptic sites. For sequence-based approaches, we trained a neural network to classify binding residues using protein language model embeddings. Our sequence-based approach outperformed PocketMiner and P2Rank across key metrics, including area under the curve, area under the precision-recall curve, Matthew's correlation coefficient, and F1 scores. These results provide baseline benchmark results for future CBS and potentially also non-CBS prediction endeavors, leveraging CryptoBench as the foundational platform for further advancements in the field. AVAILABILITY AND IMPLEMENTATION: The CryptoBench dataset, including the benchmark model, is available on Open Science Framework-https://osf.io/pz4a9/. The code and tutorial are available at the GitHub repository-https://github.com/skrhakv/CryptoBench/.
- MeSH
- benchmarking MeSH
- databáze proteinů MeSH
- konformace proteinů MeSH
- ligandy MeSH
- proteiny * chemie metabolismus MeSH
- software * MeSH
- vazba proteinů MeSH
- vazebná místa MeSH
- výpočetní biologie * metody MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- ligandy MeSH
- proteiny * MeSH
Mass spectral libraries are collections of reference spectra, usually associated with specific analytes from which the spectra were generated, that are used for further downstream analysis of new spectra. There are many different formats used for encoding spectral libraries, but none have undergone a standardization process to ensure broad applicability to many applications. As part of the Human Proteome Organization Proteomics Standards Initiative (PSI), we have developed a standardized format for encoding spectral libraries, called mzSpecLib (https://psidev.info/mzSpecLib). It is primarily a data model that flexibly encodes metadata about the library entries using the extensible PSI-MS controlled vocabulary and can be encoded in and converted between different serialization formats. We have also developed a standardized data model and serialization for fragment ion peak annotations, called mzPAF (https://psidev.info/mzPAF). It is defined as a separate standard, since it may be used for other applications besides spectral libraries. The mzSpecLib and mzPAF standards are compatible with existing PSI standards such as ProForma 2.0 and the Universal Spectrum Identifier. The mzSpecLib and mzPAF standards have been primarily defined for peptides in proteomics applications with basic small molecule support. They could be extended in the future to other fields that need to encode spectral libraries for nonpeptidic analytes.
- MeSH
- databáze proteinů normy MeSH
- hmotnostní spektrometrie normy MeSH
- lidé MeSH
- proteomika * normy MeSH
- software MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
Time-resolved X-ray crystallography experiments were first performed in the 1980s, yet they remained a niche technique for decades. With the recent advent of X-ray free electron laser (XFEL) sources and serial crystallographic techniques, time-resolved crystallography has received renewed interest and has become more accessible to a wider user base. Despite this, time-resolved structures represent < 1 % of models deposited in the world-wide Protein Data Bank, indicating that the tools and techniques currently available require further development before such experiments can become truly routine. In this chapter, we demonstrate how applying data multiplexing to time-resolved crystallography can enhance the achievable time resolution at moderately intense monochromatic X-ray sources, ranging from synchrotrons to bench-top sources. We discuss the principles of multiplexing, where this technique may be advantageous, potential pitfalls, and experimental design considerations.
- Klíčová slova
- Mathematical transforms, Multiplexing, Protein dynamics, Time-resolved, X-ray crystallography,
- MeSH
- databáze proteinů MeSH
- konformace proteinů MeSH
- krystalografie rentgenová metody MeSH
- molekulární modely MeSH
- proteiny * chemie MeSH
- synchrotrony MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Research Support, U.S. Gov't, Non-P.H.S. MeSH
- Názvy látek
- proteiny * MeSH
Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.
- MeSH
- databáze proteinů * MeSH
- genetická variace MeSH
- genetické testování metody MeSH
- genomika * metody MeSH
- konformace proteinů MeSH
- lidé MeSH
- proteiny genetika chemie MeSH
- proteom genetika MeSH
- sekvence aminokyselin MeSH
- software MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- proteiny MeSH
- proteom MeSH
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
- MeSH
- databáze proteinů MeSH
- lidé MeSH
- proteiny * chemie MeSH
- strojové učení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
- Názvy látek
- proteiny * MeSH
SUMMARY: Protein design requires information about how mutations affect protein stability. Many web-based predictors are available for this purpose, yet comparing them or using them en masse is difficult. Here, we present BenchStab, a console tool/Python package for easy and quick execution of 19 predictors and result collection on a list of mutants. Moreover, the tool is easily extensible with additional predictors. We created an independent dataset derived from the FireProtDB and evaluated 24 different prediction methods. AVAILABILITY AND IMPLEMENTATION: BenchStab is an open-source Python package available at https://github.com/loschmidt/BenchStab with a detailed README and example usage at https://loschmidt.chemi.muni.cz/benchstab. The BenchStab dataset is available on Zenodo: https://zenodo.org/records/10637728.
- MeSH
- databáze proteinů MeSH
- internet * MeSH
- proteiny chemie MeSH
- software * MeSH
- stabilita proteinů MeSH
- výpočetní biologie metody MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- proteiny MeSH
A single protein structure is rarely sufficient to capture the conformational variability of a protein. Both bound and unbound (holo and apo) forms of a protein are essential for understanding its geometry and making meaningful comparisons. Nevertheless, docking or drug design studies often still consider only single protein structures in their holo form, which are for the most part rigid. With the recent explosion in the field of structural biology, large, curated datasets are urgently needed. Here, we use a previously developed application (AHoJ) to perform a comprehensive search for apo-holo pairs for 468,293 biologically relevant protein-ligand interactions across 27,983 proteins. In each search, the binding pocket is captured and mapped across existing structures within the same UniProt, and the mapped pockets are annotated as apo or holo, based on the presence or absence of ligands. We assemble the results into a database, AHoJ-DB (www.apoholo.cz/db), that captures the variability of proteins with identical sequences, thereby exposing the agents responsible for the observed differences in geometry. We report several metrics for each annotated pocket, and we also include binding pockets that form at the interface of multiple chains. Analysis of the database shows that about 24% of the binding sites occur at the interface of two or more chains and that less than 50% of the total binding sites processed have an apo form in the PDB. These results can be used to train and evaluate predictors, discover potentially druggable proteins, and reveal protein- and ligand-specific relationships that were previously obscured by intermittent or partial data. Availability: www.apoholo.cz/db.
- Klíčová slova
- Apo-holo, binding sites, drug design, ligands, protein structure,
- MeSH
- apoproteiny chemie metabolismus MeSH
- databáze proteinů * MeSH
- konformace proteinů * MeSH
- lidé MeSH
- ligandy MeSH
- molekulární modely MeSH
- proteiny * chemie metabolismus MeSH
- vazba proteinů * MeSH
- vazebná místa MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- apoproteiny MeSH
- ligandy MeSH
- proteiny * MeSH
Channels, tunnels, and pores serve as pathways for the transport of molecules and ions through protein structures, thus participating to their functions. MOLEonline ( https://mole.upol.cz ) is an interactive web-based tool with enhanced capabilities for detecting and characterizing channels, tunnels, and pores within protein structures. MOLEonline has two distinct calculation modes for analysis of channel and tunnels or transmembrane pores. This application gives researchers rich analytical insights into channel detection, structural characterization, and physicochemical properties. ChannelsDB 2.0 ( https://channelsdb2.biodata.ceitec.cz/ ) is a comprehensive database that offers information on the location, geometry, and physicochemical characteristics of tunnels and pores within macromolecular structures deposited in Protein Data Bank and AlphaFill databases. These tunnels are sourced from manual deposition from literature and automatic detection using software tools MOLE and CAVER. MOLEonline and ChannelsDB visualization is powered by the LiteMol Viewer and Mol* viewer, ensuring a user-friendly workspace. This chapter provides an overview of user applications and usage.
- Klíčová slova
- Biomacromolecule, PDB, Physicochemical properties, Pore, Protein, Residues, Tunnel, Visualization, Voronoi, mmCIF, Channel,
- MeSH
- databáze proteinů * MeSH
- internetový prohlížeč MeSH
- iontové kanály metabolismus chemie MeSH
- konformace proteinů MeSH
- molekulární modely MeSH
- proteiny chemie metabolismus MeSH
- software * MeSH
- uživatelské rozhraní počítače MeSH
- výpočetní biologie metody MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- iontové kanály MeSH
- proteiny MeSH