Time-resolved X-ray crystallography experiments were first performed in the 1980s, yet they remained a niche technique for decades. With the recent advent of X-ray free electron laser (XFEL) sources and serial crystallographic techniques, time-resolved crystallography has received renewed interest and has become more accessible to a wider user base. Despite this, time-resolved structures represent < 1 % of models deposited in the world-wide Protein Data Bank, indicating that the tools and techniques currently available require further development before such experiments can become truly routine. In this chapter, we demonstrate how applying data multiplexing to time-resolved crystallography can enhance the achievable time resolution at moderately intense monochromatic X-ray sources, ranging from synchrotrons to bench-top sources. We discuss the principles of multiplexing, where this technique may be advantageous, potential pitfalls, and experimental design considerations.
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
- MeSH
- databáze proteinů MeSH
- lidé MeSH
- proteiny * chemie MeSH
- strojové učení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.
Molecular identification of micro- and macroorganisms based on nuclear markers has revolutionized our understanding of their taxonomy, phylogeny and ecology. Today, research on the diversity of eukaryotes in global ecosystems heavily relies on nuclear ribosomal RNA (rRNA) markers. Here, we present the research community-curated reference database EUKARYOME for nuclear ribosomal 18S rRNA, internal transcribed spacer (ITS) and 28S rRNA markers for all eukaryotes, including metazoans (animals), protists, fungi and plants. It is particularly useful for the identification of arbuscular mycorrhizal fungi as it bridges the four commonly used molecular markers-ITS1, ITS2, 18S V4-V5 and 28S D1-D2 subregions. The key benefits of this database over other annotated reference sequence databases are that it is not restricted to certain taxonomic groups and it includes all rRNA markers. EUKARYOME also offers a number of reference long-read sequences that are derived from (meta)genomic and (meta)barcoding-a unique feature that can be used for taxonomic identification and chimera control of third-generation, long-read, high-throughput sequencing data. Taxonomic assignments of rRNA genes in the database are verified based on phylogenetic approaches. The reference datasets are available in multiple formats from the project homepage, http://www.eukaryome.org.
- MeSH
- databáze genetické MeSH
- databáze nukleových kyselin MeSH
- Eukaryota * genetika MeSH
- fylogeneze MeSH
- geny rRNA genetika MeSH
- RNA ribozomální 18S genetika MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
Mass spectrometry proteomics data are typically evaluated against publicly available annotated sequences, but the proteogenomics approach is a useful alternative. A single genome is commonly utilized in custom proteomic and proteogenomic data analysis. We pose the question of whether utilizing numerous different genome assemblies in a search database would be beneficial. We reanalyzed raw data from the exoprotein fraction of four reference Enterobacterial Repetitive Intergenic Consensus (ERIC) I-IV genotypes of the honey bee bacterial pathogen Paenibacillus larvae and evaluated them against three reference databases (from NCBI-protein, RefSeq, and UniProt) together with an array of protein sequences generated by six-frame direct translation of 15 genome assemblies from GenBank. The wide search yielded 453 protein hits/groups, which UpSet analysis categorized into 50 groups based on the success of protein identification by the 18 database components. Nine hits that were not identified by a unique peptide were not considered for marker selection, which discarded the only protein that was not identified by the reference databases. We propose that the variability in successful identifications between genome assemblies is useful for marker mining. The results suggest that various strains of P. larvae can exhibit specific traits that set them apart from the established genotypes ERIC I-V.
- MeSH
- bakteriální proteiny * genetika metabolismus MeSH
- databáze proteinů MeSH
- faktory virulence * genetika metabolismus MeSH
- genom bakteriální * genetika MeSH
- Paenibacillus larvae * genetika patogenita metabolismus MeSH
- proteogenomika * metody MeSH
- proteomika metody MeSH
- včely mikrobiologie MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
A single protein structure is rarely sufficient to capture the conformational variability of a protein. Both bound and unbound (holo and apo) forms of a protein are essential for understanding its geometry and making meaningful comparisons. Nevertheless, docking or drug design studies often still consider only single protein structures in their holo form, which are for the most part rigid. With the recent explosion in the field of structural biology, large, curated datasets are urgently needed. Here, we use a previously developed application (AHoJ) to perform a comprehensive search for apo-holo pairs for 468,293 biologically relevant protein-ligand interactions across 27,983 proteins. In each search, the binding pocket is captured and mapped across existing structures within the same UniProt, and the mapped pockets are annotated as apo or holo, based on the presence or absence of ligands. We assemble the results into a database, AHoJ-DB (www.apoholo.cz/db), that captures the variability of proteins with identical sequences, thereby exposing the agents responsible for the observed differences in geometry. We report several metrics for each annotated pocket, and we also include binding pockets that form at the interface of multiple chains. Analysis of the database shows that about 24% of the binding sites occur at the interface of two or more chains and that less than 50% of the total binding sites processed have an apo form in the PDB. These results can be used to train and evaluate predictors, discover potentially druggable proteins, and reveal protein- and ligand-specific relationships that were previously obscured by intermittent or partial data. Availability: www.apoholo.cz/db.
- MeSH
- apoproteiny chemie metabolismus MeSH
- databáze proteinů * MeSH
- konformace proteinů * MeSH
- lidé MeSH
- ligandy MeSH
- molekulární modely MeSH
- proteiny * chemie metabolismus MeSH
- vazba proteinů * MeSH
- vazebná místa MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- MeSH
- antropologie metody MeSH
- databáze nukleových kyselin MeSH
- genom lidský genetika MeSH
- lidé MeSH
- metadata MeSH
- mitochondriální DNA analýza genetika MeSH
- starobylá DNA * analýza izolace a purifikace MeSH
- vývoj člověka MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- přehledy MeSH
- MeSH
- databáze genetické klasifikace MeSH
- dědičnost genetika MeSH
- DNA * analýza genetika MeSH
- genetické testování metody MeSH
- lidé MeSH
- ochrana genetických informací etika MeSH
- pokrevní příbuzenství * MeSH
- sekvenování celého genomu metody MeSH
- zločinci MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- přehledy MeSH
- Klíčová slova
- Alphafold,
- MeSH
- databáze proteinů MeSH
- konformace proteinů * MeSH