Genomic data mining
Dotaz
Zobrazit nápovědu
Natural products represent a rich reservoir of small molecule drug candidates utilized as antimicrobial drugs, anticancer therapies, and immunomodulatory agents. These molecules are microbial secondary metabolites synthesized by co-localized genes termed Biosynthetic Gene Clusters (BGCs). The increase in full microbial genomes and similar resources has led to development of BGC prediction algorithms, although their precision and ability to identify novel BGC classes could be improved. Here we present a deep learning strategy (DeepBGC) that offers reduced false positive rates in BGC identification and an improved ability to extrapolate and identify novel BGC classes compared to existing machine-learning tools. We supplemented this with random forest classifiers that accurately predicted BGC product classes and potential chemical activity. Application of DeepBGC to bacterial genomes uncovered previously undetectable putative BGCs that may code for natural products with novel biologic activities. The improved accuracy and classification ability of DeepBGC represents a major addition to in-silico BGC identification.
As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic insight into the underlying molecular systems; (2) provide better follow-up experimental testing and treatment options, and (3) better manage gene lists derived from organisms that are not well-studied. We discuss some promising approaches that may help achieve these advances, especially the use of extended dictionaries of biomedical concepts and molecular mechanisms, as well as greater use of annotation benchmarks.
Thousands of eukaryotes transcriptomes have been generated, mainly to investigate nuclear genes expression, and the amount of available data is constantly increasing. A neglected but promising use of this large amount of data is to assemble organelle genomes. To assess the reliability of this approach, we attempted to reconstruct complete mitochondrial genomes from RNA-Seq experiments of Reticulitermes termite species, for which transcriptomes and conspecific mitogenomes are available. We successfully assembled complete molecules, although a few gaps corresponding to tRNAs had to be filled manually. We also reconstructed, for the first time, the mitogenome of Reticulitermes banyulensis. The accuracy and completeness of mitogenomes reconstruction appeared independent from transcriptome size, read length and sequencing design (single/paired end), and using reference genomes from congeneric or intra-familial taxa did not significantly affect the assembly. Transcriptome-derived mitogenomes were found highly similar to the conspecific ones obtained from genome sequencing (nucleotide divergence ranging from 0% to 3.5%) and yielded a congruent phylogenetic tree. Reads from contaminants and nuclear transcripts, although slowing down the process, did not result in chimeric sequence reconstruction. We suggest that the described approach has the potential to increase the number of available mitogenomes by exploiting the rapidly increasing number of transcriptomes.
- MeSH
- anotace sekvence metody MeSH
- data mining metody MeSH
- fylogeneze MeSH
- genom mitochondriální * MeSH
- Isoptera genetika MeSH
- reprodukovatelnost výsledků MeSH
- sekvence nukleotidů genetika MeSH
- sekvenční analýza DNA MeSH
- sekvenování transkriptomu MeSH
- transkriptom genetika MeSH
- vysoce účinné nukleotidové sekvenování MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- validační studie MeSH
Alzheimer's disease (AD) is the most frequent cause of dementia. Misfolded protein pathological hallmarks of AD are brain deposits of amyloid-β (Aβ) plaques and phosphorylated tau neurofibrillary tangles. However, doubts about the role of Aβ in AD pathology have been raised as Aβ is a common component of extracellular brain deposits found, also by in vivo imaging, in non-demented aged individuals. It has been suggested that some individuals are more prone to Aβ neurotoxicity and hence more likely to develop AD when aging brains start accumulating Aβ plaques. Here, we applied genome-wide transcriptomic profiling of lymphoblastoid cells lines (LCLs) from healthy individuals and AD patients for identifying genes that predict sensitivity to Aβ. Real-time PCR validation identified 3.78-fold lower expression of RGS2 (regulator of G-protein signaling 2; P=0.0085) in LCLs from healthy individuals exhibiting high vs low Aβ sensitivity. Furthermore, RGS2 showed 3.3-fold lower expression (P=0.0008) in AD LCLs compared with controls. Notably, RGS2 expression in AD LCLs correlated with the patients' cognitive function. Lower RGS2 expression levels were also discovered in published expression data sets from postmortem AD brain tissues as well as in mild cognitive impairment and AD blood samples compared with controls. In conclusion, Aβ sensitivity phenotyping followed by transcriptomic profiling and published patient data mining identified reduced peripheral and brain expression levels of RGS2, a key regulator of G-protein-coupled receptor signaling and neuronal plasticity. RGS2 is suggested as a novel AD biomarker (alongside other genes) toward early AD detection and future disease modifying therapeutics.
- MeSH
- Alzheimerova nemoc diagnóza genetika patologie MeSH
- amyloidní beta-protein genetika MeSH
- amyloidní plaky genetika patologie MeSH
- buněčné linie MeSH
- časná diagnóza MeSH
- celogenomová asociační studie * MeSH
- data mining * MeSH
- exprese genu genetika MeSH
- fenotyp MeSH
- genetické asociační studie MeSH
- genetické markery genetika MeSH
- lidé MeSH
- mozek patologie MeSH
- neurofibrilární klubka genetika patologie MeSH
- proteiny RGS genetika MeSH
- senioři MeSH
- stanovení celkové genové exprese * MeSH
- výpočetní biologie MeSH
- Check Tag
- lidé MeSH
- mužské pohlaví MeSH
- senioři MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Genetic variation occurring within conserved functional protein domains warrants special attention when examining DNA variation in the context of disease causation. Here we introduce a resource, freely available at www.prot2hg.com, that addresses the question of whether a particular variant falls onto an annotated protein domain and directly translates chromosomal coordinates onto protein residues. The tool can perform a multiple-site query in a simple way, and the whole dataset is available for download as well as incorporated into our own accessible pipeline. To create this resource, National Center for Biotechnology Information protein data were retrieved using the Entrez Programming Utilities. After processing all human protein domains, residue positions were reverse translated and mapped to the reference genome hg19 and stored in a MySQL database. In total, 760 487 protein domains from 42 371 protein models were mapped to hg19 coordinates and made publicly available for search or download (www.prot2hg.com). In addition, this annotation was implemented into the genomics research platform GENESIS in order to query nearly 8000 exomes and genomes of families with rare Mendelian disorders (tgp-foundation.org). When applied to patient genetic data, we found that rare (<1%) variants in the Genome Aggregation Database were significantly more annotated onto a protein domain in comparison to common (>1%) variants. Similarly, variants described as pathogenic or likely pathogenic in ClinVar were more likely to be annotated onto a domain. In addition, we tested a dataset consisting of 60 causal variants in a cohort of patients with epileptic encephalopathy and found that 71% of them (43 variants) were propagated onto protein domains. In summary, we developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in order to increase variant prioritization efficiency.Database URL: www.prot2hg.com.
- MeSH
- anotace sekvence metody MeSH
- data mining metody MeSH
- databáze genetické * MeSH
- datové kurátorství metody MeSH
- genetická variace * MeSH
- genom lidský genetika MeSH
- genomika metody MeSH
- internet MeSH
- lidé MeSH
- proteinové domény genetika MeSH
- proteiny chemie genetika metabolismus MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
A phylogenetic tree at the species level is still far off for highly diverse insect orders, including the Coleoptera, but the taxonomic breadth of public sequence databases is growing. In addition, new types of data may contribute to increasing taxon coverage, such as metagenomic shotgun sequencing for assembly of mitogenomes from bulk specimen samples. The current study explores the application of these techniques for large-scale efforts to build the tree of Coleoptera. We used shotgun data from 17 different ecological and taxonomic datasets (5 unpublished) to assemble a total of 1942 mitogenome contigs of >3000 bp. These sequences were combined into a single dataset together with all mitochondrial data available at GenBank, in addition to nuclear markers widely used in molecular phylogenetics. The resulting matrix of nearly 16,000 species with two or more loci produced trees (RAxML) showing overall congruence with the Linnaean taxonomy at hierarchical levels from suborders to genera. We tested the role of full-length mitogenomes in stabilizing the tree from GenBank data, as mitogenomes might link terminals with non-overlapping gene representation. However, the mitogenome data were only partly useful in this respect, presumably because of the purely automated approach to assembly and gene delimitation, but improvements in future may be possible by using multiple assemblers and manual curation. In conclusion, the combination of data mining and metagenomic sequencing of bulk samples provided the largest phylogenetic tree of Coleoptera to date, which represents a summary of existing phylogenetic knowledge and a defensible tree of great utility, in particular for studies at the intra-familial level, despite some shortcomings for resolving basal nodes.
- MeSH
- algoritmy MeSH
- brouci klasifikace genetika MeSH
- databáze genetické MeSH
- fylogeneze * MeSH
- metagenomika * MeSH
- mitochondrie genetika MeSH
- sekvence nukleotidů MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
The cell wall of the model actinomycete Streptomyces coelicolor M145 has recently been shown to contain the novel glycopolymer teichulosonic acid. The major building block of this polymer is 2-keto-3-deoxy-D-glycero-D-galacto-nononic acid (Kdn), suggesting initial clues about the genetic control of biosynthesis of this cell wall component. Here, through genome mining and gene knockouts, we demonstrate that the sco4879-sco4882 genomic region of S. coelicolor M145 is necessary for biosynthesis of teichulosonic acid. Specifically, mutants carrying individual knockouts of sco4879, sco4880 and sco4881 genes do not produce Kdn-containing glycopolymer and instead accumulate the minor cell wall component poly(diglycosyl 1-phosphate). Our studies provide evidence that this region is at least partly responsible for biosynthesis of Kdn, whereas flanking genes might control the other steps of teichulosonic acid formation.
- MeSH
- bakteriální polysacharidy biosyntéza MeSH
- buněčná stěna genetika metabolismus MeSH
- data mining MeSH
- DNA bakterií genetika MeSH
- inzerční mutageneze MeSH
- klonování DNA MeSH
- kyseliny cukerné metabolismus MeSH
- magnetická rezonanční spektroskopie MeSH
- Streptomyces coelicolor genetika metabolismus MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
BACKGROUND: The advancement of sequencing technologies today has made a plethora of whole-genome re-sequenced (WGRS) data publicly available. However, research utilizing the WGRS data without further configuration is nearly impossible. To solve this problem, our research group has developed an interactive Allele Catalog Tool to enable researchers to explore the coding region allelic variation present in over 1,000 re-sequenced accessions each for soybean, Arabidopsis, and maize. RESULTS: The Allele Catalog Tool was designed originally with soybean genomic data and resources. The Allele Catalog datasets were generated using our variant calling pipeline (SnakyVC) and the Allele Catalog pipeline (AlleleCatalog). The variant calling pipeline is developed to parallelly process raw sequencing reads to generate the Variant Call Format (VCF) files, and the Allele Catalog pipeline takes VCF files to perform imputations, functional effect predictions, and assemble alleles for each gene to generate curated Allele Catalog datasets. Both pipelines were utilized to generate the data panels (VCF files and Allele Catalog files) in which the accessions of the WGRS datasets were collected from various sources, currently representing over 1,000 diverse accessions for soybean, Arabidopsis, and maize individually. The main features of the Allele Catalog Tool include data query, visualization of results, categorical filtering, and download functions. Queries are performed from user input, and results are a tabular format of summary results by categorical description and genotype results of the alleles for each gene. The categorical information is specific to each species; additionally, available detailed meta-information is provided in modal popups. The genotypic information contains the variant positions, reference or alternate genotypes, the functional effect classes, and the amino-acid changes of each accession. Besides that, the results can also be downloaded for other research purposes. CONCLUSIONS: The Allele Catalog Tool is a web-based tool that currently supports three species: soybean, Arabidopsis, and maize. The Soybean Allele Catalog Tool is hosted on the SoyKB website ( https://soykb.org/SoybeanAlleleCatalogTool/ ), while the Allele Catalog Tool for Arabidopsis and maize is hosted on the KBCommons website ( https://kbcommons.org/system/tools/AlleleCatalogTool/Zmays and https://kbcommons.org/system/tools/AlleleCatalogTool/Athaliana ). Researchers can use this tool to connect variant alleles of genes with meta-information of species.
- MeSH
- alely * MeSH
- Arabidopsis * genetika MeSH
- data mining * metody MeSH
- datové soubory jako téma * MeSH
- frekvence genu MeSH
- genotyp MeSH
- Glycine max * genetika MeSH
- internet * MeSH
- kukuřice setá * genetika MeSH
- metadata MeSH
- mutace MeSH
- pigmentace genetika MeSH
- rostlinné geny genetika MeSH
- software * MeSH
- substituce aminokyselin MeSH
- vegetační klid genetika MeSH
- vizualizace dat MeSH
- Publikační typ
- časopisecké články MeSH