Molecular identification of micro- and macroorganisms based on nuclear markers has revolutionized our understanding of their taxonomy, phylogeny and ecology. Today, research on the diversity of eukaryotes in global ecosystems heavily relies on nuclear ribosomal RNA (rRNA) markers. Here, we present the research community-curated reference database EUKARYOME for nuclear ribosomal 18S rRNA, internal transcribed spacer (ITS) and 28S rRNA markers for all eukaryotes, including metazoans (animals), protists, fungi and plants. It is particularly useful for the identification of arbuscular mycorrhizal fungi as it bridges the four commonly used molecular markers-ITS1, ITS2, 18S V4-V5 and 28S D1-D2 subregions. The key benefits of this database over other annotated reference sequence databases are that it is not restricted to certain taxonomic groups and it includes all rRNA markers. EUKARYOME also offers a number of reference long-read sequences that are derived from (meta)genomic and (meta)barcoding-a unique feature that can be used for taxonomic identification and chimera control of third-generation, long-read, high-throughput sequencing data. Taxonomic assignments of rRNA genes in the database are verified based on phylogenetic approaches. The reference datasets are available in multiple formats from the project homepage, http://www.eukaryome.org.
- MeSH
- Databases, Genetic MeSH
- Databases, Nucleic Acid MeSH
- Eukaryota * genetics MeSH
- Phylogeny MeSH
- Genes, rRNA genetics MeSH
- RNA, Ribosomal, 18S genetics MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
Genetic variation occurring within conserved functional protein domains warrants special attention when examining DNA variation in the context of disease causation. Here we introduce a resource, freely available at www.prot2hg.com, that addresses the question of whether a particular variant falls onto an annotated protein domain and directly translates chromosomal coordinates onto protein residues. The tool can perform a multiple-site query in a simple way, and the whole dataset is available for download as well as incorporated into our own accessible pipeline. To create this resource, National Center for Biotechnology Information protein data were retrieved using the Entrez Programming Utilities. After processing all human protein domains, residue positions were reverse translated and mapped to the reference genome hg19 and stored in a MySQL database. In total, 760 487 protein domains from 42 371 protein models were mapped to hg19 coordinates and made publicly available for search or download (www.prot2hg.com). In addition, this annotation was implemented into the genomics research platform GENESIS in order to query nearly 8000 exomes and genomes of families with rare Mendelian disorders (tgp-foundation.org). When applied to patient genetic data, we found that rare (<1%) variants in the Genome Aggregation Database were significantly more annotated onto a protein domain in comparison to common (>1%) variants. Similarly, variants described as pathogenic or likely pathogenic in ClinVar were more likely to be annotated onto a domain. In addition, we tested a dataset consisting of 60 causal variants in a cohort of patients with epileptic encephalopathy and found that 71% of them (43 variants) were propagated onto protein domains. In summary, we developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in order to increase variant prioritization efficiency.Database URL: www.prot2hg.com.
- MeSH
- Molecular Sequence Annotation methods MeSH
- Data Mining methods MeSH
- Databases, Genetic * MeSH
- Data Curation methods MeSH
- Genetic Variation * MeSH
- Genome, Human genetics MeSH
- Genomics methods MeSH
- Internet MeSH
- Humans MeSH
- Protein Domains genetics MeSH
- Proteins chemistry genetics metabolism MeSH
- Computational Biology methods MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
BACKGROUND: Rapid, accurate and high-throughput identification of vector arthropods is of paramount importance in surveillance programmes that are becoming more common due to the changing geographic occurrence and extent of many arthropod-borne diseases. Protein profiling by MALDI-TOF mass spectrometry fulfils these requirements for identification, and reference databases have recently been established for several vector taxa, mostly with specimens from laboratory colonies. METHODS: We established and validated a reference database containing 20 phlebotomine sand fly (Diptera: Psychodidae, Phlebotominae) species by using specimens from colonies or field-collections that had been stored for various periods of time. RESULTS: Identical biomarker mass patterns ('superspectra') were obtained with colony- or field-derived specimens of the same species. In the validation study, high quality spectra (i.e. more than 30 evaluable masses) were obtained with all fresh insects from colonies, and with 55/59 insects deep-frozen (liquid nitrogen/-80 °C) for up to 25 years. In contrast, only 36/52 specimens stored in ethanol could be identified. This resulted in an overall sensitivity of 87 % (140/161); specificity was 100 %. Duration of storage impaired data counts in the high mass range, and thus cluster analyses of closely related specimens might reflect their storage conditions rather than phenotypic distinctness. A major drawback of MALDI-TOF MS is the restricted availability of in-house databases and the fact that mass spectrometers from 2 companies (Bruker, Shimadzu) are widely being used. We have analysed fingerprints of phlebotomine sand flies obtained by automatic routine procedure on a Bruker instrument by using our database and the software established on a Shimadzu system. The sensitivity with 312 specimens from 8 sand fly species from laboratory colonies when evaluating only high quality spectra was 98.3 %; the specificity was 100 %. The corresponding diagnostic values with 55 field-collected specimens from 4 species were 94.7 % and 97.4 %, respectively. CONCLUSIONS: A centralized high-quality database (created by expert taxonomists and experienced users of mass spectrometers) that is easily amenable to customer-oriented identification services is a highly desirable resource. As shown in the present work, spectra obtained from different specimens with different instruments can be analysed using a centralized database, which should be available in the near future via an online platform in a cost-efficient manner.
- MeSH
- Entomology methods MeSH
- Insect Proteins analysis MeSH
- Molecular Sequence Data MeSH
- Psychodidae chemistry classification MeSH
- Electron Transport Complex IV genetics MeSH
- Sequence Analysis, DNA MeSH
- Sensitivity and Specificity MeSH
- Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization methods MeSH
- Temperature MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Validation Study MeSH
IRESite is an exhaustive, manually annotated non-redundant relational database focused on the IRES elements (Internal Ribosome Entry Site) and containing information not available in the primary public databases. IRES elements were originally found in eukaryotic viruses hijacking initiation of translation of their host. Later on, they were also discovered in 5'-untranslated regions of some eukaryotic mRNA molecules. Currently, IRESite presents up to 92 biologically relevant aspects of every experiment, e.g. the nature of an IRES element, its functionality/defectivity, origin, size, sequence, structure, its relative position with respect to surrounding protein coding regions, positive/negative controls used in the experiment, the reporter genes used to monitor IRES activity, the measured reporter protein yields/activities, and references to original publications as well as cross-references to other databases, and also comments from submitters and our curators. Furthermore, the site presents the known similarities to rRNA sequences as well as RNA-protein interactions. Special care is given to the annotation of promoter-like regions. The annotated data in IRESite are bound to mostly complete, full-length mRNA, and whenever possible, accompanied by original plasmid vector sequences. New data can be submitted through the publicly available web-based interface at http://www.iresite.org and are curated by a team of lab-experienced biologists.
- MeSH
- Databases, Nucleic Acid MeSH
- Financing, Organized MeSH
- Peptide Chain Initiation, Translational MeSH
- Peptide Initiation Factors metabolism MeSH
- Internet MeSH
- RNA, Messenger chemistry MeSH
- Untranslated Regions chemistry MeSH
- Plasmids chemistry MeSH
- Promoter Regions, Genetic MeSH
- Regulatory Sequences, Ribonucleic Acid MeSH
- RNA, Viral chemistry MeSH
- User-Computer Interface MeSH
Twelve Y-chromosomal short tandem repeats (Y-STR) (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385a, DYS385b, DYS437, DYS438, and DYS439) included in the PowerPlex Y Kit (Promega Corporation, Madison, USA) were studied for 1750 unrelated males living in 14 regions of the Czech Republic. A total of 1148 different haplotypes were found. The overall haplotype diversity (HD) was determined as 0.998. Analysis of Molecular Variance (AMOVA) reveals non-significant distances between regions concerning their haplotype distribution, thus allowing to use the whole sample as a representative reference database of the Czech Republic. Median network analysis shows a remarkable bipartite composition of the Czech haplotypes, falling in distinct clusters with Eastern and Western European roots.
- MeSH
- Databases, Nucleic Acid MeSH
- DNA Fingerprinting MeSH
- Haplotypes MeSH
- Humans MeSH
- Chromosomes, Human, Y MeSH
- Polymerase Chain Reaction MeSH
- Genetics, Population MeSH
- Tandem Repeat Sequences MeSH
- Check Tag
- Humans MeSH
- Male MeSH
- Publication type
- Journal Article MeSH
- Geographicals
- Czech Republic MeSH
Following the discovery of serious errors in the structure of biomacromolecules, structure validation has become a key topic of research, especially for ligands and non-standard residues. ValidatorDB (freely available at http://ncbr.muni.cz/ValidatorDB) offers a new step in this direction, in the form of a database of validation results for all ligands and non-standard residues from the Protein Data Bank (all molecules with seven or more heavy atoms). Model molecules from the wwPDB Chemical Component Dictionary are used as reference during validation. ValidatorDB covers the main aspects of validation of annotation, and additionally introduces several useful validation analyses. The most significant is the classification of chirality errors, allowing the user to distinguish between serious issues and minor inconsistencies. Other such analyses are able to report, for example, completely erroneous ligands, alternate conformations or complete identity with the model molecules. All results are systematically classified into categories, and statistical evaluations are performed. In addition to detailed validation reports for each molecule, ValidatorDB provides summaries of the validation results for the entire PDB, for sets of molecules sharing the same annotation (three-letter code) or the same PDB entry, and for user-defined selections of annotations or PDB entries.
- MeSH
- Amino Acids chemistry MeSH
- Molecular Sequence Annotation MeSH
- Databases, Protein * MeSH
- Internet MeSH
- Protein Conformation MeSH
- Ligands MeSH
- Models, Molecular MeSH
- Proteins chemistry MeSH
- Reproducibility of Results MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
All more than 3000 species of Agrilus beetles are phytophagous and some cause economically significant damage to trees and shrubs. Facilitated by international trade, Agrilus species regularly invade new countries and continents. This necessitates a rapid identification of Agrilus species, as the first step for subsequent protective measures. This study provides the first DNA reference library for ~100 Agrilus species from the Northern Hemisphere based on three mitochondrial markers: cox1-5' (DNA barcode fragment), cox1-3', and rrnL. All 329 Agrilus records available in the Barcode of Life Database format, including specimen images and geo data, are released through a public dataset 'Agrilus1 329' available at: dx.doi.org/10.5883/DS-AGRILUS1. All Agrilus species were identified using adult morphology and by using molecular phylogenetic trees, as well as distance- and tree-based algorithms. Most DNA-based species limits agree well with the morphology-based identification. Our results include cases of high intraspecific variability and multiple species para- and polyphyly. DNA barcoding is a powerful species identification tool in Agrilus, although it frequently fails to recover morphologically-delimited Agrilus species-group. Even though the current three-gene database covers only ~3% of the known Agrilus diversity, it contains representatives of all principal lineages from the Northern Hemisphere and represents the most extensive dataset built for DNA-delimited species identification within this genus so far. Molecular data analyses can rapidly and cost-effectively identify an unknown sample, including immature stages and/or non-native taxa, or species not yet formally named.
- MeSH
- Coleoptera genetics MeSH
- Phylogeny * MeSH
- Forestry MeSH
- DNA, Mitochondrial MeSH
- DNA Barcoding, Taxonomic * MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
Throughout the years, DNA barcoding has gained in importance in forensic entomology as it leads to fast and reliable species determination. High-quality results, however, can only be achieved with a comprehensive DNA barcode reference database at hand. In collaboration with the Bavarian State Criminal Police Office, we have initiated at the Bavarian State Collection of Zoology the establishment of a reference library containing arthropods of potential forensic relevance to be used for DNA barcoding applications. CO1-5P' DNA barcode sequences of hundreds of arthropods were obtained via DNA extraction, PCR and Sanger Sequencing, leading to the establishment of a database containing 502 high-quality sequences which provide coverage for 88 arthropod species. Furthermore, we demonstrate an application example of this library using it as a backbone to a high throughput sequencing analysis of arthropod bulk samples collected from human corpses, which enabled the identification of 31 different arthropod Barcode Index Numbers.
- MeSH
- Arthropods genetics MeSH
- Databases, Nucleic Acid * MeSH
- Entomology MeSH
- Polymerase Chain Reaction MeSH
- Electron Transport Complex IV genetics MeSH
- Sequence Analysis, DNA MeSH
- Forensic Sciences * MeSH
- DNA Barcoding, Taxonomic * MeSH
- High-Throughput Nucleotide Sequencing MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
Taxonomic and functional research of microorganisms has increasingly relied upon genome-based data and methods. As the depository of the Global Catalogue of Microorganisms (GCM) 10K prokaryotic type strain sequencing project, Global Catalogue of Type Strain (gcType) has published 1049 type strain genomes sequenced by the GCM 10K project which are preserved in global culture collections with a valid published status. Additionally, the information provided through gcType includes >12 000 publicly available type strain genome sequences from GenBank incorporated using quality control criteria and standard data annotation pipelines to form a high-quality reference database. This database integrates type strain sequences with their phenotypic information to facilitate phenotypic and genotypic analyses. Multiple formats of cross-genome searches and interactive interfaces have allowed extensive exploration of the database's resources. In this study, we describe web-based data analysis pipelines for genomic analyses and genome-based taxonomy, which could serve as a one-stop platform for the identification of prokaryotic species. The number of type strain genomes that are published will continue to increase as the GCM 10K project increases its collaboration with culture collections worldwide. Data of this project is shared with the International Nucleotide Sequence Database Collaboration. Access to gcType is free at http://gctype.wdcm.org/.
This dataset presents comprehensive and easy-to-use information on 29 functional traits of clonal growth, bud banks, and lifespan of members of the Central European flora. The source data were compiled from a number of published sources (see the reference file) and the authors' own observations or studies. In total, 2,909 species are included (2,745 herbs and 164 woody species), out of which 1,532 (i.e., 52.7% of total) are classified as possessing clonal growth organs (1,480, i.e., 53.9%, if woody plants are excluded). This provides a unique, and largely unexplored, set of traits of clonal growth that can be used in studies on comparative plant ecology, plant evolution, community assembly, and ecosystem functioning across the large flora of Central Europe. It can be directly imported into a number of programs and packages that perform trait-based and phylogenetic analyses aimed to answer a variety of open and pressing ecological questions.