BACKGROUND: The advancement of sequencing technologies results in the rapid release of hundreds of new genome assemblies a year providing unprecedented resources for the study of genome evolution. Within this context, the significance of in-depth analyses of repetitive elements, transposable elements (TEs) in particular, is increasingly recognized in understanding genome evolution. Despite the plethora of available bioinformatic tools for identifying and annotating TEs, the phylogenetic distance of the target species from a curated and classified database of repetitive element sequences constrains any automated annotation effort. Moreover, manual curation of raw repeat libraries is deemed essential due to the frequent incompleteness of automatically generated consensus sequences. RESULTS: Here, we present an example of a crowd-sourcing effort aimed at curating and annotating TE libraries of two non-model species built around a collaborative, peer-reviewed teaching process. Manual curation and classification are time-consuming processes that offer limited short-term academic rewards and are typically confined to a few research groups where methods are taught through hands-on experience. Crowd-sourcing efforts could therefore offer a significant opportunity to bridge the gap between learning the methods of curation effectively and empowering the scientific community with high-quality, reusable repeat libraries. CONCLUSIONS: The collaborative manual curation of TEs from two tardigrade species, for which there were no TE libraries available, resulted in the successful characterization of hundreds of new and diverse TEs in a reasonable time frame. Our crowd-sourcing setting can be used as a teaching reference guide for similar projects: A hidden treasure awaits discovery within non-model organisms.
- Publication type
- Journal Article MeSH
The majority of naturally occurring proteins have evolved to function under mild conditions inside the living organisms. One of the critical obstacles for the use of proteins in biotechnological applications is their insufficient stability at elevated temperatures or in the presence of salts. Since experimental screening for stabilizing mutations is typically laborious and expensive, in silico predictors are often used for narrowing down the mutational landscape. The recent advances in machine learning and artificial intelligence further facilitate the development of such computational tools. However, the accuracy of these predictors strongly depends on the quality and amount of data used for training and testing, which have often been reported as the current bottleneck of the approach. To address this problem, we present a novel database of experimental thermostability data for single-point mutants FireProtDB. The database combines the published datasets, data extracted manually from the recent literature, and the data collected in our laboratory. Its user interface is designed to facilitate both types of the expected use: (i) the interactive explorations of individual entries on the level of a protein or mutation and (ii) the construction of highly customized and machine learning-friendly datasets using advanced searching and filtering. The database is freely available at https://loschmidt.chemi.muni.cz/fireprotdb.
- MeSH
- Molecular Sequence Annotation MeSH
- Point Mutation * MeSH
- Databases, Protein * MeSH
- Datasets as Topic MeSH
- Internet MeSH
- Models, Molecular MeSH
- Proteins chemistry genetics MeSH
- Software MeSH
- Protein Stability MeSH
- Machine Learning statistics & numerical data MeSH
- Computational Biology methods MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Light detection in animals is predominantly based on the photopigment composed of a protein moiety, the opsin, and the chromophore retinal. Animal opsins originated very early in metazoan evolution from within the G-Protein Coupled Receptor (GPCR) gene superfamily and diversified into several distinct branches prior to the cnidarian-bilaterian split. The origin of opsin diversity, opsin classification and interfamily relationships have been the matter of long-standing debate. Comparative studies of opsins from various Metazoa provide key insight into the evolutionary history of opsins and the visual perception in animals. Here, we have analyzed the genome assembly of the cephalochordate Branchiostoma lanceolatum, applying BLAST, gene prediction tools and manual curation in order to predict de novo its complete opsin repertoire. We investigated the structure of predicted opsin genes, encoded proteins, their phylogenetic placement, and expression. We identified a total of 22 opsin genes in B. lanceolatum, of which 21 are expressed and the remaining one appears to be a pseudogene. According to our phylogenetic analysis, representatives from the three major opsin groups, namely C-type, the R-type and the Group 4, can be identified in B. lanceolatum. Most of the B. lanceolatum opsins exhibit a stage-specific, but not a tissue-specific, expression pattern. The large number of opsins detected in B. lanceolatum, the observed similarities and differences in terms of sequence characteristics and expression patterns lead us to conclude that there may be a fine tuning in opsin utilization in order to facilitate visually-guided behavior of European amphioxus under various environmental settings.
- MeSH
- Photoreceptor Cells metabolism MeSH
- Phylogeny MeSH
- Genomics methods MeSH
- Lancelets genetics MeSH
- Evolution, Molecular MeSH
- Multigene Family * MeSH
- Opsins classification genetics MeSH
- Gene Expression Profiling MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
A phylogenetic tree at the species level is still far off for highly diverse insect orders, including the Coleoptera, but the taxonomic breadth of public sequence databases is growing. In addition, new types of data may contribute to increasing taxon coverage, such as metagenomic shotgun sequencing for assembly of mitogenomes from bulk specimen samples. The current study explores the application of these techniques for large-scale efforts to build the tree of Coleoptera. We used shotgun data from 17 different ecological and taxonomic datasets (5 unpublished) to assemble a total of 1942 mitogenome contigs of >3000 bp. These sequences were combined into a single dataset together with all mitochondrial data available at GenBank, in addition to nuclear markers widely used in molecular phylogenetics. The resulting matrix of nearly 16,000 species with two or more loci produced trees (RAxML) showing overall congruence with the Linnaean taxonomy at hierarchical levels from suborders to genera. We tested the role of full-length mitogenomes in stabilizing the tree from GenBank data, as mitogenomes might link terminals with non-overlapping gene representation. However, the mitogenome data were only partly useful in this respect, presumably because of the purely automated approach to assembly and gene delimitation, but improvements in future may be possible by using multiple assemblers and manual curation. In conclusion, the combination of data mining and metagenomic sequencing of bulk samples provided the largest phylogenetic tree of Coleoptera to date, which represents a summary of existing phylogenetic knowledge and a defensible tree of great utility, in particular for studies at the intra-familial level, despite some shortcomings for resolving basal nodes.
- MeSH
- Algorithms MeSH
- Coleoptera classification genetics MeSH
- Databases, Genetic MeSH
- Phylogeny * MeSH
- Metagenomics * MeSH
- Mitochondria genetics MeSH
- Base Sequence MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Motivation: Sanger sequencing is still being employed for sequence variant detection by many laboratories, especially in a clinical setting. However, chromatogram interpretation often requires manual inspection and in some cases, considerable expertise. Results: We present GLASS, a web-based Sanger sequence trace viewer, editor, aligner and variant caller, built to assist with the assessment of variations in 'curated' or user-provided genes. Critically, it produces a standardized variant output as recommended by the Human Genome Variation Society. Availability and implementation: GLASS is freely available at http://bat.infspire.org/genomepd/glass/ with source code at https://github.com/infspiredBAT/GLASS. Contact: nikos.darzentas@gmail.com or malcikova.jitka@fnbrno.cz. Supplementary information: Supplementary data are available at Bioinformatics online.
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
- MeSH
- Databases, Protein MeSH
- Humans MeSH
- Proteins * chemistry MeSH
- Machine Learning * MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Dataset MeSH
BACKGROUND: The insertion sequence elements (IS elements) represent the smallest and the most abundant mobile elements in prokaryotic genomes. It has been shown that they play a significant role in genome organization and evolution. To better understand their function in the host genome, it is desirable to have an effective detection and annotation tool. This need becomes even more crucial when considering rapid-growing genomic and metagenomic data. The existing tools for IS elements detection and annotation are usually based on comparing sequence similarity with a database of known IS families. Thus, they have limited ability to discover distant and putative novel IS elements. RESULTS: In this paper, we present digIS, a software tool based on profile hidden Markov models assembled from catalytic domains of transposases. It shows a very good performance in detecting known IS elements when tested on datasets with manually curated annotation. The main contribution of digIS is in its ability to detect distant and putative novel IS elements while maintaining a moderate level of false positives. In this category it outperforms existing tools, especially when tested on large datasets of archaeal and bacterial genomes. CONCLUSION: We provide digIS, a software tool using a novel approach based on manually curated profile hidden Markov models, which is able to detect distant and putative novel IS elements. Although digIS can find known IS elements as well, we expect it to be used primarily by scientists interested in finding novel IS elements. The tool is available at https://github.com/janka2012/digIS.
- MeSH
- Genome, Bacterial genetics MeSH
- Genomics MeSH
- Humans MeSH
- Prokaryotic Cells * MeSH
- Software MeSH
- DNA Transposable Elements * genetics MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
IRESite is an exhaustive, manually annotated non-redundant relational database focused on the IRES elements (Internal Ribosome Entry Site) and containing information not available in the primary public databases. IRES elements were originally found in eukaryotic viruses hijacking initiation of translation of their host. Later on, they were also discovered in 5'-untranslated regions of some eukaryotic mRNA molecules. Currently, IRESite presents up to 92 biologically relevant aspects of every experiment, e.g. the nature of an IRES element, its functionality/defectivity, origin, size, sequence, structure, its relative position with respect to surrounding protein coding regions, positive/negative controls used in the experiment, the reporter genes used to monitor IRES activity, the measured reporter protein yields/activities, and references to original publications as well as cross-references to other databases, and also comments from submitters and our curators. Furthermore, the site presents the known similarities to rRNA sequences as well as RNA-protein interactions. Special care is given to the annotation of promoter-like regions. The annotated data in IRESite are bound to mostly complete, full-length mRNA, and whenever possible, accompanied by original plasmid vector sequences. New data can be submitted through the publicly available web-based interface at http://www.iresite.org and are curated by a team of lab-experienced biologists.
- MeSH
- Databases, Nucleic Acid MeSH
- Financing, Organized MeSH
- Peptide Chain Initiation, Translational MeSH
- Peptide Initiation Factors metabolism MeSH
- Internet MeSH
- RNA, Messenger chemistry MeSH
- Untranslated Regions chemistry MeSH
- Plasmids chemistry MeSH
- Promoter Regions, Genetic MeSH
- Regulatory Sequences, Ribonucleic Acid MeSH
- RNA, Viral chemistry MeSH
- User-Computer Interface MeSH
Phylogenomic analyses of hundreds of protein-coding genes aimed at resolving phylogenetic relationships is now a common practice. However, no software currently exists that includes tools for dataset construction and subsequent analysis with diverse validation strategies to assess robustness. Furthermore, there are no publicly available high-quality curated databases designed to assess deep (>100 million years) relationships in the tree of eukaryotes. To address these issues, we developed an easy-to-use software package, PhyloFisher (https://github.com/TheBrownLab/PhyloFisher), written in Python 3. PhyloFisher includes a manually curated database of 240 protein-coding genes from 304 eukaryotic taxa covering known eukaryotic diversity, a novel tool for ortholog selection, and utilities that will perform diverse analyses required by state-of-the-art phylogenomic investigations. Through phylogenetic reconstructions of the tree of eukaryotes and of the Saccharomycetaceae clade of budding yeasts, we demonstrate the utility of the PhyloFisher workflow and the provided starting database to address phylogenetic questions across a large range of evolutionary time points for diverse groups of organisms. We also demonstrate that undetected paralogy can remain in phylogenomic "single-copy orthogroup" datasets constructed using widely accepted methods such as all vs. all BLAST searches followed by Markov Cluster Algorithm (MCL) clustering and application of automated tree pruning algorithms. Finally, we show how the PhyloFisher workflow helps detect inadvertent paralog inclusions, allowing the user to make more informed decisions regarding orthology assignments, leading to a more accurate final dataset.
The Complex Portal (www.ebi.ac.uk/complexportal) is a manually curated, encyclopaedic database that collates and summarizes information on stable, macromolecular complexes of known function. It captures complex composition, topology and function and links out to a large range of domain-specific resources that hold more detailed data, such as PDB or Reactome. We have made several significant improvements since our last update, including improving compliance to the FAIR data principles by providing complex-specific, stable identifiers that include versioning. Protein complexes are now available from 20 species for download in standards-compliant formats such as PSI-XML, MI-JSON and ComplexTAB or can be accessed via an improved REST API. A component-based JS front-end framework has been implemented to drive a new website and this has allowed the use of APIs from linked services to import and visualize information such as the 3D structure of protein complexes, its role in reactions and pathways and the co-expression of complex components in the tissues of multi-cellular organisms. A first draft of the complete complexome of Saccharomyces cerevisiae is now available to browse and download.
- MeSH
- Databases, Protein * MeSH
- Protein Conformation MeSH
- Humans MeSH
- Macromolecular Substances chemistry MeSH
- Multiprotein Complexes chemistry metabolism MeSH
- Mice MeSH
- Nucleic Acids chemistry MeSH
- Computer Graphics MeSH
- Animals MeSH
- Check Tag
- Humans MeSH
- Mice MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH