Time-resolved X-ray crystallography experiments were first performed in the 1980s, yet they remained a niche technique for decades. With the recent advent of X-ray free electron laser (XFEL) sources and serial crystallographic techniques, time-resolved crystallography has received renewed interest and has become more accessible to a wider user base. Despite this, time-resolved structures represent < 1 % of models deposited in the world-wide Protein Data Bank, indicating that the tools and techniques currently available require further development before such experiments can become truly routine. In this chapter, we demonstrate how applying data multiplexing to time-resolved crystallography can enhance the achievable time resolution at moderately intense monochromatic X-ray sources, ranging from synchrotrons to bench-top sources. We discuss the principles of multiplexing, where this technique may be advantageous, potential pitfalls, and experimental design considerations.
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
- MeSH
- databáze proteinů MeSH
- lidé MeSH
- proteiny * chemie MeSH
- strojové učení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.
Mass spectrometry proteomics data are typically evaluated against publicly available annotated sequences, but the proteogenomics approach is a useful alternative. A single genome is commonly utilized in custom proteomic and proteogenomic data analysis. We pose the question of whether utilizing numerous different genome assemblies in a search database would be beneficial. We reanalyzed raw data from the exoprotein fraction of four reference Enterobacterial Repetitive Intergenic Consensus (ERIC) I-IV genotypes of the honey bee bacterial pathogen Paenibacillus larvae and evaluated them against three reference databases (from NCBI-protein, RefSeq, and UniProt) together with an array of protein sequences generated by six-frame direct translation of 15 genome assemblies from GenBank. The wide search yielded 453 protein hits/groups, which UpSet analysis categorized into 50 groups based on the success of protein identification by the 18 database components. Nine hits that were not identified by a unique peptide were not considered for marker selection, which discarded the only protein that was not identified by the reference databases. We propose that the variability in successful identifications between genome assemblies is useful for marker mining. The results suggest that various strains of P. larvae can exhibit specific traits that set them apart from the established genotypes ERIC I-V.
- MeSH
- bakteriální proteiny * genetika metabolismus MeSH
- databáze proteinů MeSH
- faktory virulence * genetika metabolismus MeSH
- genom bakteriální * genetika MeSH
- Paenibacillus larvae * genetika patogenita metabolismus MeSH
- proteogenomika * metody MeSH
- proteomika metody MeSH
- včely mikrobiologie MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
A single protein structure is rarely sufficient to capture the conformational variability of a protein. Both bound and unbound (holo and apo) forms of a protein are essential for understanding its geometry and making meaningful comparisons. Nevertheless, docking or drug design studies often still consider only single protein structures in their holo form, which are for the most part rigid. With the recent explosion in the field of structural biology, large, curated datasets are urgently needed. Here, we use a previously developed application (AHoJ) to perform a comprehensive search for apo-holo pairs for 468,293 biologically relevant protein-ligand interactions across 27,983 proteins. In each search, the binding pocket is captured and mapped across existing structures within the same UniProt, and the mapped pockets are annotated as apo or holo, based on the presence or absence of ligands. We assemble the results into a database, AHoJ-DB (www.apoholo.cz/db), that captures the variability of proteins with identical sequences, thereby exposing the agents responsible for the observed differences in geometry. We report several metrics for each annotated pocket, and we also include binding pockets that form at the interface of multiple chains. Analysis of the database shows that about 24% of the binding sites occur at the interface of two or more chains and that less than 50% of the total binding sites processed have an apo form in the PDB. These results can be used to train and evaluate predictors, discover potentially druggable proteins, and reveal protein- and ligand-specific relationships that were previously obscured by intermittent or partial data. Availability: www.apoholo.cz/db.
- MeSH
- apoproteiny chemie metabolismus MeSH
- databáze proteinů * MeSH
- konformace proteinů * MeSH
- lidé MeSH
- ligandy MeSH
- molekulární modely MeSH
- proteiny * chemie metabolismus MeSH
- vazba proteinů * MeSH
- vazebná místa MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Klíčová slova
- Alphafold,
- MeSH
- databáze proteinů MeSH
- konformace proteinů * MeSH
The archiving and dissemination of protein and nucleic acid structures as well as their structural, functional and biophysical annotations is an essential task that enables the broader scientific community to conduct impactful research in multiple fields of the life sciences. The Protein Data Bank in Europe (PDBe; pdbe.org) team develops and maintains several databases and web services to address this fundamental need. From data archiving as a member of the Worldwide PDB consortium (wwPDB; wwpdb.org), to the PDBe Knowledge Base (PDBe-KB; pdbekb.org), we provide data, data-access mechanisms, and visualizations that facilitate basic and applied research and education across the life sciences. Here, we provide an overview of the structural data and annotations that we integrate and make freely available. We describe the web services and data visualization tools we offer, and provide information on how to effectively use or even further develop them. Finally, we discuss the direction of our data services, and how we aim to tackle new challenges that arise from the recent, unprecedented advances in the field of structure determination and protein structure modeling.
- MeSH
- databáze proteinů MeSH
- konformace proteinů MeSH
- nukleové kyseliny * MeSH
- proteiny * chemie MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Geografické názvy
- Evropa MeSH
Engineered small non-antibody protein scaffolds are a promising alternative to antibodies and are especially attractive for use in protein therapeutics and diagnostics. The advantages include smaller size and a more robust, single-domain structural framework with a defined binding surface amenable to mutation. This calls for a more systematic approach in designing new scaffolds suitable for use in one or more methods of directed evolution. We hereby describe a process based on an analysis of protein structures from the Protein Data Bank and their experimental examination. The candidate protein scaffolds were subjected to a thorough screening including computational evaluation of the mutability, and experimental determination of their expression yield in E. coli, solubility, and thermostability. In the next step, we examined several variants of the candidate scaffolds including their wild types and alanine mutants. We proved the applicability of this systematic procedure by selecting a monomeric single-domain human protein with a fold different from previously known scaffolds. The newly developed scaffold, called ProBi (Protein Binder), contains two independently mutable surface patches. We demonstrated its functionality by training it as a binder against human interleukin-10, a medically important cytokine. The procedure yielded scaffold-related variants with nanomolar affinity.
- MeSH
- databáze proteinů MeSH
- interleukin-10 metabolismus MeSH
- konformace proteinů MeSH
- počítačová simulace MeSH
- proteinové inženýrství MeSH
- proteiny chemie genetika metabolismus MeSH
- rekombinantní proteiny chemie genetika metabolismus MeSH
- ribozomy metabolismus MeSH
- řízená evoluce molekul metody MeSH
- sekvence aminokyselin MeSH
- stabilita proteinů MeSH
- vazba proteinů MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.
- MeSH
- anotace sekvence MeSH
- COVID-19 epidemiologie prevence a kontrola virologie MeSH
- databáze proteinů statistika a číselné údaje MeSH
- epidemie MeSH
- internet MeSH
- lidé MeSH
- proteinové domény * MeSH
- proteiny chemie genetika metabolismus MeSH
- SARS-CoV-2 genetika metabolismus fyziologie MeSH
- sekvence aminokyselin MeSH
- sekvenční analýza proteinů metody MeSH
- sekvenční homologie aminokyselin MeSH
- virové proteiny chemie genetika metabolismus MeSH
- výpočetní biologie metody statistika a číselné údaje MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
The majority of naturally occurring proteins have evolved to function under mild conditions inside the living organisms. One of the critical obstacles for the use of proteins in biotechnological applications is their insufficient stability at elevated temperatures or in the presence of salts. Since experimental screening for stabilizing mutations is typically laborious and expensive, in silico predictors are often used for narrowing down the mutational landscape. The recent advances in machine learning and artificial intelligence further facilitate the development of such computational tools. However, the accuracy of these predictors strongly depends on the quality and amount of data used for training and testing, which have often been reported as the current bottleneck of the approach. To address this problem, we present a novel database of experimental thermostability data for single-point mutants FireProtDB. The database combines the published datasets, data extracted manually from the recent literature, and the data collected in our laboratory. Its user interface is designed to facilitate both types of the expected use: (i) the interactive explorations of individual entries on the level of a protein or mutation and (ii) the construction of highly customized and machine learning-friendly datasets using advanced searching and filtering. The database is freely available at https://loschmidt.chemi.muni.cz/fireprotdb.
- MeSH
- anotace sekvence MeSH
- bodová mutace * MeSH
- databáze proteinů * MeSH
- datové soubory jako téma MeSH
- internet MeSH
- molekulární modely MeSH
- proteiny chemie genetika MeSH
- software MeSH
- stabilita proteinů MeSH
- strojové učení statistika a číselné údaje MeSH
- výpočetní biologie metody MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH