graph-based clustering
Dotaz
Zobrazit nápovědu
BACKGROUND: The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization. RESULTS: We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families. CONCLUSIONS: Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.
A large proportion of genomic information, particularly repetitive elements, is usually ignored when researchers are using next-generation sequencing. Here we demonstrate the usefulness of this repetitive fraction in phylogenetic analyses, utilizing comparative graph-based clustering of next-generation sequence reads, which results in abundance estimates of different classes of genomic repeats. Phylogenetic trees are then inferred based on the genome-wide abundance of different repeat types treated as continuously varying characters; such repeats are scattered across chromosomes and in angiosperms can constitute a majority of nuclear genomic DNA. In six diverse examples, five angiosperms and one insect, this method provides generally well-supported relationships at interspecific and intergeneric levels that agree with results from more standard phylogenetic analyses of commonly used markers. We propose that this methodology may prove especially useful in groups where there is little genetic differentiation in standard phylogenetic markers. At the same time as providing data for phylogenetic inference, this method additionally yields a wealth of data for comparative studies of genome evolution.
- MeSH
- DNA rostlinná genetika MeSH
- Drosophila klasifikace genetika MeSH
- fylogeneze * MeSH
- genom genetika MeSH
- hmyzí geny genetika MeSH
- Magnoliopsida genetika MeSH
- repetitivní sekvence nukleových kyselin genetika MeSH
- shluková analýza MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
MOTIVATION: Repetitive DNA makes up large portions of plant and animal nuclear genomes, yet it remains the least-characterized genome component in most species studied so far. Although the recent availability of high-throughput sequencing data provides necessary resources for in-depth investigation of genomic repeats, its utility is hampered by the lack of specialized bioinformatics tools and appropriate computational resources that would enable large-scale repeat analysis to be run by biologically oriented researchers. RESULTS: Here we present RepeatExplorer, a collection of software tools for characterization of repetitive elements, which is accessible via web interface. A key component of the server is the computational pipeline using a graph-based sequence clustering algorithm to facilitate de novo repeat identification without the need for reference databases of known elements. Because the algorithm uses short sequences randomly sampled from the genome as input, it is ideal for analyzing next-generation sequence reads. Additional tools are provided to aid in classification of identified repeats, investigate phylogenetic relationships of retroelements and perform comparative analysis of repeat composition between multiple species. The server allows to analyze several million sequence reads, which typically results in identification of most high and medium copy repeats in higher plant genomes.
- MeSH
- algoritmy MeSH
- DNA chemie MeSH
- Eukaryota genetika MeSH
- fylogeneze MeSH
- genom MeSH
- internet MeSH
- repetitivní sekvence nukleových kyselin * MeSH
- sekvenční analýza DNA * MeSH
- shluková analýza MeSH
- software * MeSH
- vysoce účinné nukleotidové sekvenování * MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
RepeatExplorer2 is a novel version of a computational pipeline that uses graph-based clustering of next-generation sequencing reads for characterization of repetitive DNA in eukaryotes. The clustering algorithm facilitates repeat identification in any genome by using relatively small quantities of short sequence reads, and additional tools within the pipeline perform automatic annotation and quantification of the identified repeats. The pipeline is integrated into the Galaxy platform, which provides a user-friendly web interface for script execution and documentation of the results. Compared to the original version of the pipeline, RepeatExplorer2 provides automated annotation of transposable elements, identification of tandem repeats and enhanced visualization of analysis results. Here, we present an overview of the RepeatExplorer2 workflow and provide procedures for its application to (i) de novo repeat identification in a single species, (ii) comparative repeat analysis in a set of species, (iii) development of satellite DNA probes for cytogenetic experiments and (iv) identification of centromeric repeats based on ChIP-seq data. Each procedure takes approximately 2 d to complete. RepeatExplorer2 is available at https://repeatexplorer-elixir.cerit-sc.cz .
- MeSH
- DNA sondy chemie genetika MeSH
- DNA chemie genetika MeSH
- genomika metody MeSH
- lidé MeSH
- repetitivní sekvence nukleových kyselin MeSH
- sekvenční analýza DNA metody MeSH
- shluková analýza MeSH
- software MeSH
- transpozibilní elementy DNA MeSH
- vysoce účinné nukleotidové sekvenování metody MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Satellite DNA is one of the major classes of repetitive DNA, characterized by tandemly arranged repeat copies that form contiguous arrays up to megabases in length. This type of genomic organization makes satellite DNA difficult to assemble, which hampers characterization of satellite sequences by computational analysis of genomic contigs. Here, we present tandem repeat analyzer (TAREAN), a novel computational pipeline that circumvents this problem by detecting satellite repeats directly from unassembled short reads. The pipeline first employs graph-based sequence clustering to identify groups of reads that represent repetitive elements. Putative satellite repeats are subsequently detected by the presence of circular structures in their cluster graphs. Consensus sequences of repeat monomers are then reconstructed from the most frequent k-mers obtained by decomposing read sequences from corresponding clusters. The pipeline performance was successfully validated by analyzing low-pass genome sequencing data from five plant species where satellite DNA was previously experimentally characterized. Moreover, novel satellite repeats were predicted for the genome of Vicia faba and three of these repeats were verified by detecting their sequences on metaphase chromosomes using fluorescence in situ hybridization.
- MeSH
- DNA rostlinná genetika MeSH
- genom rostlinný * MeSH
- hrách setý genetika MeSH
- hybridizace in situ fluorescenční MeSH
- konsenzuální sekvence MeSH
- kukuřice setá genetika MeSH
- Magnoliopsida genetika MeSH
- mapování chromozomů metody MeSH
- metafáze MeSH
- počítačová grafika MeSH
- šáchorovité genetika MeSH
- satelitní DNA klasifikace genetika MeSH
- sekvence nukleotidů MeSH
- sekvenční analýza DNA MeSH
- shluková analýza MeSH
- software * MeSH
- Vicia faba genetika MeSH
- Publikační typ
- časopisecké články MeSH
Repetitive sequences are ubiquitous components of all eukaryotic genomes. They contribute to genome evolution and the regulation of gene transcription. However, the uncontrolled activity of repetitive sequences can negatively affect genome functions and stability. Therefore, repetitive DNAs are embedded in a highly repressive heterochromatic environment in plant cell nuclei. Here, we analyzed the sequence, composition and the epigenetic makeup of peculiar non-pericentromeric heterochromatic segments in the genome of the Australian crucifer Ballantinia antipoda. By the combination of high throughput sequencing, graph-based clustering and cytogenetics, we found that the heterochromatic segments consist of a mixture of unique sequences and an A-T-rich 174 bp satellite repeat (BaSAT1). BaSAT1 occupies about 10% of the B. antipoda nuclear genome in >250 000 copies. Unlike many other highly repetitive sequences, BaSAT1 repeats are hypomethylated; this contrasts with the normal patterns of DNA methylation in the B. antipoda genome. Detailed analysis of several copies revealed that these non-methylated BaSAT1 repeats were also devoid of heterochromatic histone H3K9me2 methylation. However, the factors decisive for the methylation status of BaSAT1 repeats remain currently unknown. In summary, we show that even highly repetitive sequences can exist as hypomethylated in the plant nuclear genome.
- MeSH
- Arabidopsis genetika MeSH
- cévnaté rostliny chemie genetika metabolismus MeSH
- epigeneze genetická MeSH
- fylogeneze MeSH
- genom rostlinný MeSH
- heterochromatin genetika metabolismus MeSH
- histony chemie metabolismus MeSH
- metylace DNA genetika MeSH
- satelitní DNA chemie genetika metabolismus MeSH
- vysoce účinné nukleotidové sekvenování MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
... 6.11 Gene Prediction 193 -- 6.12 Statistical Approaches to Gene Prediction 197 -- 6.13 Similarity-Based ... ... Algorithms 247 -- 8.1 Graphs 247 -- 8.2 Graphs and Genetics 260 -- 8.3 DNA Sequencing 262 -- 8.4 Shortest ... ... and Trees 339 -- 10.1 Gene Expression Analysis 339 -- 10.2 Hierarchical Clustering 343 -- 10.3 k-Means ... ... Clustering 346 -- 10.4 Clustering and Corrupted Cliques 348 -- 10.5 Evolutionary Trees 354 -- 10.6 Distance-Based ... ... 366 -- 10.9 Character-Based Tree Reconstruction 368 -- 10.10 Small Parsimony Problem 370 -- 10.11 Large ...
Computational molecular biology series
[1st ed.] xviii, 435 s. : il.
- MeSH
- algoritmy MeSH
- informatika MeSH
- Konspekt
- Lékařské vědy. Lékařství
- NLK Obory
- lékařská informatika
... Origin of Life Problem, 288 Autocatalytic Sets of Catalytic Polymers, 298 -- Growth on the Infinite Graph ... ... Components in the Genetic Regulatory Systems of Prokaryotes and Eukaryotes, 412 -- An Ensemble Theory Based ... ... on Random Directed Graphs, 419 Summary, 439 -- 12. ...
1st ed. 709 s. : il.
- Klíčová slova
- Biologie, Evoluce, Fylogeneze,
- MeSH
- biologická evoluce MeSH
- biologie MeSH
- fylogeneze MeSH
- molekulární evoluce MeSH
- původ života MeSH
BACKGROUND: The banana family (Musaceae) includes genetically a diverse group of species and their diploid and polyploid hybrids that are widely cultivated in the tropics. In spite of their socio-economic importance, the knowledge of Musaceae genomes is basically limited to draft genome assemblies of two species, Musa acuminata and M. balbisiana. Here we aimed to complement this information by analyzing repetitive genome fractions of six species selected to represent various phylogenetic groups within the family. RESULTS: Low-pass sequencing of M. acuminata, M. ornata, M. textilis, M. beccarii, M. balbisiana, and Ensete gilletii genomes was performed using a 454/Roche platform. Sequence reads were subjected to analysis of their overall intra- and inter-specific similarities and, all major repeat families were quantified using graph-based clustering. Maximus/SIRE and Angela lineages of Ty1/copia long terminal repeat (LTR) retrotransposons and the chromovirus lineage of Ty3/gypsy elements were found to make up most of highly repetitive DNA in all species (14-34.5% of the genome). However, there were quantitative differences and sequence variations detected for classified repeat families as well as for the bulk of total repetitive DNA. These differences were most pronounced between species from different taxonomic sections of the Musaceae family, whereas pairs of closely related species (M. acuminata/M. ornata and M. beccarii/M. textilis) shared similar populations of repetitive elements. CONCLUSIONS: This study provided the first insight into the composition and sequence variation of repetitive parts of Musaceae genomes. It allowed identification of repetitive sequences specific for a single species or a group of species that can be utilized as molecular markers in breeding programs and generated computational resources that will be instrumental in repeat masking and annotation in future genome assembly projects.
- MeSH
- banánovníkovité klasifikace genetika MeSH
- DNA rostlinná analýza genetika MeSH
- fylogeneze MeSH
- genetická variace MeSH
- genom rostlinný * MeSH
- molekulární evoluce MeSH
- repetitivní sekvence nukleových kyselin * MeSH
- sekvenční analýza DNA MeSH
- výpočetní biologie metody MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
OBJECTIVES: The aim was to describe the contribution of basal ganglia (BG) thalamo-cortical circuitry to the whole-brain functional connectivity in focal epilepsies. METHODS: Interictal resting-state fMRI recordings were acquired in 46 persons with focal epilepsies. Of these 46, 22 had temporal lobe epilepsy: 9 left temporal (LTLE), 13 right temporal (RTLE); 15 had frontal lobe epilepsy (FLE); and 9 had parietal/occipital lobe epilepsy (POLE). There were 20 healthy controls. The complete weighted network was analyzed based on correlation matrices of 90 and 194 regions. The network topology was quantified on a global and regional level by measures based on graph theory, and connection-level changes were analyzed by the partial least square method. RESULTS: In all patient groups except RTLE, the shift of the functional network topology away from random was observed (normalized clustering coefficient and characteristic path length were higher in patient groups than in controls). Links contributing to this change were found in the cortico-subcortical connections. Weak connections (low correlations) consistently contributed to this modification of the network. The importance of regions changed: decreases in the subcortical areas and both decreases and increases in the cortical areas were observed in node strength, clustering coefficient and eigenvector centrality in patient groups when compared to controls. Node strength decreases of the basal ganglia, i.e. the putamen, caudate, and pallidum, were displayed in LTLE, FLE, and POLE. The connectivity within the basal ganglia-thalamus circuitry was not disturbed; the disturbance concerned the connectivity between the circuitry and the cortex. SIGNIFICANCE: Focal epilepsies affect large-scale brain networks beyond the epileptogenic zones. Cortico-subcortical functional connectivity disturbance was displayed in LTLE, FLE, and POLE. Significant changes in the resting-state functional connectivity between cortical and subcortical structures suggest an important role of the BG and thalamus in focal epilepsies.
- MeSH
- bazální ganglia diagnostické zobrazování patofyziologie MeSH
- dospělí MeSH
- elektroencefalografie MeSH
- epilepsie parciální diagnostické zobrazování patofyziologie MeSH
- kyslík krev MeSH
- lidé středního věku MeSH
- lidé MeSH
- magnetická rezonanční tomografie MeSH
- mapování mozku * MeSH
- mladý dospělý MeSH
- mozková kůra diagnostické zobrazování MeSH
- nervová síť diagnostické zobrazování MeSH
- nervové dráhy diagnostické zobrazování patofyziologie MeSH
- počítačové zpracování obrazu MeSH
- senioři MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mladý dospělý MeSH
- mužské pohlaví MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH