BACKGROUND: The mammalian Natural Killer Complex (NKC) harbors genes and gene families encoding a variety of C-type lectin-like proteins expressed on various immune cells. The NKC is a complex genomic region well-characterized in mice, humans and domestic animals. The major limitations of automatic annotation of the NKC in non-model animals include short-read based sequencing, methods of assembling highly homologous and repetitive sequences, orthologues missing from reference databases and weak expression. In this situation, manual annotations of complex genomic regions are necessary. METHODS: This study presents a manual annotation of the genomic structure of the NKC region in a high-quality reference genome of the domestic cat and compares it with other felid species and with representatives of other carnivore families. Reference genomes of Carnivora, irrespective of sequencing and assembly methods, were screened by BLAST to retrieve information on their killer cell lectin-like receptor (KLR) gene content. Phylogenetic analysis of in silico translated proteins of expanded subfamilies was carried out. RESULTS: The overall genomic structure of the NKC in Carnivora is rather conservative in terms of its C-type lectin receptor gene content. A novel KLRH-like gene subfamily (KLRL) was identified in all Carnivora and a novel KLRJ-like gene was annotated in the Mustelidae. In all six families studied, one subfamily (KLRC) expanded and experienced pseudogenization. The KLRH gene subfamily expanded in all carnivore families except the Canidae. The KLRL gene subfamily expanded in carnivore families except the Felidae and Canidae, and in the Canidae it eroded to fragments. CONCLUSIONS: Knowledge of the genomic structure and gene content of the NKC region is a prerequisite for accurate annotations of newly sequenced genomes, especially of endangered wildlife species. Identification of expressed genes, pseudogenes and gene fragments in the context of expanded gene families would allow the assessment of functionally important variability in particular species.
- MeSH
- Molecular Sequence Annotation MeSH
- Killer Cells, Natural * immunology metabolism MeSH
- Carnivora * genetics MeSH
- Phylogeny * MeSH
- Genome MeSH
- Genomics * methods MeSH
- Cats genetics MeSH
- Lectins, C-Type genetics MeSH
- Animals MeSH
- Check Tag
- Cats genetics MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
- Comparative Study MeSH
Many enhancers control gene expression by assembling regulatory factor clusters, also referred to as condensates. This process is vital for facilitating enhancer communication and establishing cellular identity. However, how DNA sequence and transcription factor (TF) binding instruct the formation of high regulatory factor environments remains poorly understood. Here we developed a new approach leveraging enhancer-centric chromatin accessibility quantitative trait loci (caQTLs) to nominate regulatory factor clusters genome-wide. By analyzing TF-binding signatures within the context of caQTLs and comparing episomal versus endogenous enhancer activities, we discovered a class of regulators, 'context-only' TFs, that amplify the activity of cell type-specific caQTL-binding TFs, that is, 'context-initiator' TFs. Similar to super-enhancers, enhancers enriched for context-only TF-binding sites display high coactivator binding and sensitivity to bromodomain-inhibiting molecules. We further show that binding sites for context-only and context-initiator TFs underlie enhancer coordination, providing a mechanistic rationale for how a loose TF syntax confers regulatory specificity.
- MeSH
- Chromatin * genetics metabolism MeSH
- Humans MeSH
- Quantitative Trait Loci * MeSH
- Mice MeSH
- Gene Expression Regulation MeSH
- Transcription Factors * metabolism genetics MeSH
- Protein Binding MeSH
- Binding Sites genetics MeSH
- Enhancer Elements, Genetic * MeSH
- Animals MeSH
- Check Tag
- Humans MeSH
- Mice MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
Mass spectrometry proteomics data are typically evaluated against publicly available annotated sequences, but the proteogenomics approach is a useful alternative. A single genome is commonly utilized in custom proteomic and proteogenomic data analysis. We pose the question of whether utilizing numerous different genome assemblies in a search database would be beneficial. We reanalyzed raw data from the exoprotein fraction of four reference Enterobacterial Repetitive Intergenic Consensus (ERIC) I-IV genotypes of the honey bee bacterial pathogen Paenibacillus larvae and evaluated them against three reference databases (from NCBI-protein, RefSeq, and UniProt) together with an array of protein sequences generated by six-frame direct translation of 15 genome assemblies from GenBank. The wide search yielded 453 protein hits/groups, which UpSet analysis categorized into 50 groups based on the success of protein identification by the 18 database components. Nine hits that were not identified by a unique peptide were not considered for marker selection, which discarded the only protein that was not identified by the reference databases. We propose that the variability in successful identifications between genome assemblies is useful for marker mining. The results suggest that various strains of P. larvae can exhibit specific traits that set them apart from the established genotypes ERIC I-V.
- MeSH
- Bacterial Proteins * genetics metabolism MeSH
- Databases, Protein MeSH
- Virulence Factors * genetics metabolism MeSH
- Genome, Bacterial * genetics MeSH
- Paenibacillus larvae * genetics pathogenicity metabolism MeSH
- Proteogenomics * methods MeSH
- Proteomics methods MeSH
- Bees microbiology MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
BACKGROUND: The advancement of sequencing technologies results in the rapid release of hundreds of new genome assemblies a year providing unprecedented resources for the study of genome evolution. Within this context, the significance of in-depth analyses of repetitive elements, transposable elements (TEs) in particular, is increasingly recognized in understanding genome evolution. Despite the plethora of available bioinformatic tools for identifying and annotating TEs, the phylogenetic distance of the target species from a curated and classified database of repetitive element sequences constrains any automated annotation effort. Moreover, manual curation of raw repeat libraries is deemed essential due to the frequent incompleteness of automatically generated consensus sequences. RESULTS: Here, we present an example of a crowd-sourcing effort aimed at curating and annotating TE libraries of two non-model species built around a collaborative, peer-reviewed teaching process. Manual curation and classification are time-consuming processes that offer limited short-term academic rewards and are typically confined to a few research groups where methods are taught through hands-on experience. Crowd-sourcing efforts could therefore offer a significant opportunity to bridge the gap between learning the methods of curation effectively and empowering the scientific community with high-quality, reusable repeat libraries. CONCLUSIONS: The collaborative manual curation of TEs from two tardigrade species, for which there were no TE libraries available, resulted in the successful characterization of hundreds of new and diverse TEs in a reasonable time frame. Our crowd-sourcing setting can be used as a teaching reference guide for similar projects: A hidden treasure awaits discovery within non-model organisms.
- Publication type
- Journal Article MeSH
Cimex lectularius, known as the common bed bug, is a widespread hematophagous human ectoparasite and urban pest that is not known to be a vector of any human infectious disease agents. However, few studies in the era of molecular biology have profiled the microorganisms harbored by field populations of bed bugs. The objective of this study was to examine the viruses present in a large sampling of common bed bugs and related bat bugs (Cimex pipistrelle). RNA sequencing was undertaken on an international sampling of > 500 field-collected bugs, and multiple workflows were used to assemble contigs and query these against reference nucleotide databases to identify viral genomes. Shuangao bed bug virus 2, an uncharacterized rhabdovirus previously discovered in Cimex hemipterus from China, was found in several bed bug pools from the USA and Europe, as well as in C. pipistrelle, suggesting that this virus is common among bed bug populations. In addition, Shuangao bed bug virus 1 was detected in a bed bug pool from China, and sequences matching Enterobacteria phage P7 were found in all bed bug pools, indicating the ubiquitous presence of phage-derived elements in the genome of the bed bug or its enterobacterial symbiont. However, viral diversity was low in bed bugs in our study, as no other viral genomes were detected with significant coverage. These results provide evidence against frequent virus infection in bed bugs. Nonetheless, our investigation had several important limitations, and additional studies should be conducted to better understand the prevalence and composition of viruses in bed bugs. Most notably, our study largely focused on insects from urban areas in industrialized nations, thus likely missing infrequent virus infections and those that could occur in rural or tropical environments or developing nations.
Streptococcus pyogenes způsobuje rozličná lidská onemocnění od nekomplikovaných infekcí dýchacích cest a kůže až po vážná invazivní onemocnění, která mohou být doprovázena syndromem toxického šoku. Významnými faktory virulence vedle M proteinu kódovaného genem emm jsou pyrogenní exotoxiny, které se považují za superantigeny. V Národní referenční laboratoři pro streptokokové nákazy byly nově zavedeny bioinformatické nástroje pro zpracování dat z celogenomové sekvenace S. pyogenes. Použitím programu SRST2 a platformy BV-BRC byla analyzována WGS data 10 kmenů S. pyogenes izolovaných od pacientů s invazivním onemocněním a byly stanoveny emm typy, sekvenční typy a profily genů kódujících superantigeny. K sestavení sekvencí genomů z krátkých čtení byla zvolena assembly pipeline Unicycler s de novo assemblerem SPAdes.
Streptococcus pyogenes causes a variety of human diseases ranging from uncomplicated respiratory tract and skin infections to severe invasive diseases possibly involving toxic shock syndrome. Besides the emm gene-encoded M protein, important virulence factors are pyrogenic exotoxins, referred to as superantigens. The National Reference Laboratory for Streptococcal Infections has newly introduced bioinformatics tools for processing S. pyogenes whole genome sequencing data. Using the SRST2 software and BV-BRC platform, WGS data of 10 S. pyogenes isolates from patients with invasive disease were analysed, and emm type, sequence type, and superantigen encoding gene profiles were determined. The Unicycler assembly pipeline with the SPAdes de novo assembler was used to assemble genome sequences from short reads.
- MeSH
- Clinical Studies as Topic MeSH
- Clinical Laboratory Techniques methods MeSH
- Humans MeSH
- Whole Genome Sequencing methods MeSH
- Streptococcus pyogenes genetics isolation & purification pathogenicity MeSH
- Superantigens * analysis genetics isolation & purification classification MeSH
- Check Tag
- Humans MeSH
- Publication type
- Research Support, Non-U.S. Gov't MeSH
- Review MeSH
BACKGROUND: The mammalian Leukocyte Receptor Complex (LRC) chromosomal region may contain gene families for the killer cell immunoglobulin-like receptor (KIR) and/or leukocyte immunoglobulin-like receptor (LILR) collections as well as various framing genes. This complex region is well described in humans, mice, and some domestic animals. Although single KIR genes are known in some Carnivora, their complements of LILR genes remain largely unknown due to obstacles in the assembly of regions of high homology in short-read based genomes. METHODS: As part of the analysis of felid immunogenomes, this study focuses on the search for LRC genes in reference genomes and the annotation of LILR genes in Felidae. Chromosome-level genomes based on single-molecule long-read sequencing were preferentially sought and compared to representatives of the Carnivora. RESULTS: Seven putatively functional LILR genes were found across the Felidae and in the Californian sea lion, four to five genes in Canidae, and four to nine genes in Mustelidae. They form two lineages, as seen in the Bovidae. The ratio of functional genes for activating LILRs to inhibitory LILRs is slightly in favor of inhibitory genes in the Felidae and the Canidae; the reverse is seen in the Californian sea lion. This ratio is even in all of the Mustelidae except the Eurasian otter, which has a predominance of activating LILRs. Various numbers of LILR pseudogenes were identified. CONCLUSIONS: The structure of the LRC is rather conservative in felids and the other Carnivora studied. The LILR sub-region is conserved within the Felidae and has slight differences in the Canidae, but it has taken various evolutionary paths in the Mustelidae. Overall, the process of pseudogenization of LILR genes seems to be more frequent for activating receptors. Phylogenetic analysis found no direct orthologues across the Carnivora which corroborate the rapid evolution of LILRs seen in mammals.
- MeSH
- Canidae * MeSH
- Carnivora * genetics MeSH
- Felidae * MeSH
- Phylogeny MeSH
- Genomics MeSH
- Sea Lions * MeSH
- Leukocytes MeSH
- Humans MeSH
- Mustelidae * MeSH
- Mice MeSH
- Receptors, Immunologic genetics MeSH
- Receptors, KIR genetics MeSH
- Animals MeSH
- Check Tag
- Humans MeSH
- Mice MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
BACKGROUND: Diplonemid flagellates are among the most abundant and species-rich of known marine microeukaryotes, colonizing all habitats, depths, and geographic regions of the world ocean. However, little is known about their genomes, biology, and ecological role. RESULTS: We present the first nuclear genome sequence from a diplonemid, the type species Diplonema papillatum. The ~ 280-Mb genome assembly contains about 32,000 protein-coding genes, likely co-transcribed in groups of up to 100. Gene clusters are separated by long repetitive regions that include numerous transposable elements, which also reside within introns. Analysis of gene-family evolution reveals that the last common diplonemid ancestor underwent considerable metabolic expansion. D. papillatum-specific gains of carbohydrate-degradation capability were apparently acquired via horizontal gene transfer. The predicted breakdown of polysaccharides including pectin and xylan is at odds with reports of peptides being the predominant carbon source of this organism. Secretome analysis together with feeding experiments suggest that D. papillatum is predatory, able to degrade cell walls of live microeukaryotes, macroalgae, and water plants, not only for protoplast feeding but also for metabolizing cell-wall carbohydrates as an energy source. The analysis of environmental barcode samples shows that D. papillatum is confined to temperate coastal waters, presumably acting in bioremediation of eutrophication. CONCLUSIONS: Nuclear genome information will allow systematic functional and cell-biology studies in D. papillatum. It will also serve as a reference for the highly diverse diplonemids and provide a point of comparison for studying gene complement evolution in the sister group of Kinetoplastida, including human-pathogenic taxa.
- MeSH
- Euglenozoa genetics MeSH
- Eukaryota * genetics MeSH
- Phylogeny MeSH
- Kinetoplastida * genetics MeSH
- Humans MeSH
- Multigene Family MeSH
- Meiotic Prophase I MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Giardia duodenalis (syn. G. intestinalis, G. lamblia) is a widespread gastrointestinal protozoan parasite with debated taxonomic status. Currently, eight distinct genetic sub-groups, termed assemblages A-H, are defined based on a few genetic markers. Assemblages A and B may represent distinct species and are both of human public health relevance. Genomic studies are scarce and the few reference genomes available, in particular for assemblage B, are insufficient for adequate comparative genomics. Here, by combining long- and short-read sequences generated by PacBio and Illumina sequencing technologies, we provide nine annotated genome sequences for reference from new clinical isolates (four assemblage A and five assemblage B parasite isolates). Isolates chosen represent the currently accepted classification of sub-assemblages AI, AII, BIII and BIV. Synteny over the whole genome was generally high, but we report chromosome-level translocations as a feature that distinguishes assemblage A from B parasites. Orthologue gene group analysis was used to define gene content differences between assemblage A and B and to contribute a gene-set-based operational definition of respective taxonomic units. Giardia is tetraploid, and high allelic sequence heterogeneity (ASH) for assemblage B vs. assemblage A has been observed so far. Noteworthy, here we report an extremely low ASH (0.002%) for one of the assemblage B isolates (a value even lower than the reference assemblage A isolate WB-C6). This challenges the view of low ASH being a notable feature that distinguishes assemblage A from B parasites, and low ASH allowed assembly of the most contiguous assemblage B genome currently available for reference. In conclusion, the description of nine highly contiguous genome assemblies of new isolates of G. duodenalis assemblage A and B adds to our understanding of the genomics and species population structure of this widespread zoonotic parasite.
- MeSH
- Genomics MeSH
- Giardia lamblia * genetics MeSH
- Giardia genetics MeSH
- Giardiasis * parasitology MeSH
- Humans MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
BACKGROUND: The advancement of sequencing technologies today has made a plethora of whole-genome re-sequenced (WGRS) data publicly available. However, research utilizing the WGRS data without further configuration is nearly impossible. To solve this problem, our research group has developed an interactive Allele Catalog Tool to enable researchers to explore the coding region allelic variation present in over 1,000 re-sequenced accessions each for soybean, Arabidopsis, and maize. RESULTS: The Allele Catalog Tool was designed originally with soybean genomic data and resources. The Allele Catalog datasets were generated using our variant calling pipeline (SnakyVC) and the Allele Catalog pipeline (AlleleCatalog). The variant calling pipeline is developed to parallelly process raw sequencing reads to generate the Variant Call Format (VCF) files, and the Allele Catalog pipeline takes VCF files to perform imputations, functional effect predictions, and assemble alleles for each gene to generate curated Allele Catalog datasets. Both pipelines were utilized to generate the data panels (VCF files and Allele Catalog files) in which the accessions of the WGRS datasets were collected from various sources, currently representing over 1,000 diverse accessions for soybean, Arabidopsis, and maize individually. The main features of the Allele Catalog Tool include data query, visualization of results, categorical filtering, and download functions. Queries are performed from user input, and results are a tabular format of summary results by categorical description and genotype results of the alleles for each gene. The categorical information is specific to each species; additionally, available detailed meta-information is provided in modal popups. The genotypic information contains the variant positions, reference or alternate genotypes, the functional effect classes, and the amino-acid changes of each accession. Besides that, the results can also be downloaded for other research purposes. CONCLUSIONS: The Allele Catalog Tool is a web-based tool that currently supports three species: soybean, Arabidopsis, and maize. The Soybean Allele Catalog Tool is hosted on the SoyKB website ( https://soykb.org/SoybeanAlleleCatalogTool/ ), while the Allele Catalog Tool for Arabidopsis and maize is hosted on the KBCommons website ( https://kbcommons.org/system/tools/AlleleCatalogTool/Zmays and https://kbcommons.org/system/tools/AlleleCatalogTool/Athaliana ). Researchers can use this tool to connect variant alleles of genes with meta-information of species.
- MeSH
- Alleles * MeSH
- Arabidopsis * genetics MeSH
- Data Mining * methods MeSH
- Datasets as Topic * MeSH
- Gene Frequency MeSH
- Genotype MeSH
- Glycine max * genetics MeSH
- Internet * MeSH
- Zea mays * genetics MeSH
- Metadata MeSH
- Mutation MeSH
- Pigmentation genetics MeSH
- Genes, Plant genetics MeSH
- Software * MeSH
- Amino Acid Substitution MeSH
- Plant Dormancy genetics MeSH
- Data Visualization MeSH
- Publication type
- Journal Article MeSH