Bioinformatic pipeline
Dotaz
Zobrazit nápovědu
BACKGROUND: High-throughput bioinformatics analyses of next generation sequencing (NGS) data often require challenging pipeline optimization. The key problem is choosing appropriate tools and selecting the best parameters for optimal precision and recall. RESULTS: Here we introduce ToTem, a tool for automated pipeline optimization. ToTem is a stand-alone web application with a comprehensive graphical user interface (GUI). ToTem is written in Java and PHP with an underlying connection to a MySQL database. Its primary role is to automatically generate, execute and benchmark different variant calling pipeline settings. Our tool allows an analysis to be started from any level of the process and with the possibility of plugging almost any tool or code. To prevent an over-fitting of pipeline parameters, ToTem ensures the reproducibility of these by using cross validation techniques that penalize the final precision, recall and F-measure. The results are interpreted as interactive graphs and tables allowing an optimal pipeline to be selected, based on the user's priorities. Using ToTem, we were able to optimize somatic variant calling from ultra-deep targeted gene sequencing (TGS) data and germline variant detection in whole genome sequencing (WGS) data. CONCLUSIONS: ToTem is a tool for automated pipeline optimization which is freely available as a web application at https://totem.software .
Background: Extensive genome rearrangements, known as chromothripsis, have been recently identified in several cancer types. Chromothripsis leads to complex structural variants (cSVs) causing aberrant gene expression and the formation of de novo fusion genes, which can trigger cancer development, or worsen its clinical course. The functional impact of cSVs can be studied at the RNA level using whole transcriptome sequencing (total RNA-Seq). It represents a powerful tool for discovering, profiling, and quantifying changes of gene expression in the overall genomic context. However, bioinformatic analysis of transcriptomic data, especially in cases with cSVs, is a complex and challenging task, and the development of proper bioinformatic tools for transcriptome studies is necessary. Methods: We designed a bioinformatic workflow for the analysis of total RNA-Seq data consisting of two separate parts (pipelines): The first pipeline incorporates a statistical solution for differential gene expression analysis in a biologically heterogeneous sample set. We utilized results from transcriptomic arrays which were carried out in parallel to increase the precision of the analysis. The second pipeline is used for the identification of de novo fusion genes. Special attention was given to the filtering of false positives (FPs), which was achieved through consensus fusion calling with several fusion gene callers. We applied the workflow to the data obtained from ten patients with chronic lymphocytic leukemia (CLL) to describe the consequences of their cSVs in detail. The fusion genes identified by our pipeline were correlated with genomic break-points detected by genomic arrays. Results: We set up a novel solution for differential gene expression analysis of individual samples and de novo fusion gene detection from total RNA-Seq data. The results of the differential gene expression analysis were concordant with results obtained by transcriptomic arrays, which demonstrates the analytical capabilities of our method. We also showed that the consensus fusion gene detection approach was able to identify true positives (TPs) efficiently. Detected coordinates of fusion gene junctions were in concordance with genomic breakpoints assessed using genomic arrays. Discussion: Byapplying our methods to real clinical samples, we proved that our approach for total RNA-Seq data analysis generates results consistent with other genomic analytical techniques. The data obtained by our analyses provided clues for the study of the biological consequences of cSVs with far-reaching implications for clinical outcome and management of cancer patients. The bioinformatic workflow is also widely applicable for addressing other research questions in different contexts, for which transcriptomic data are generated.
- Publikační typ
- časopisecké články MeSH
Unique molecular identifiers (UMIs) show outstanding performance in targeted high-throughput resequencing, being the most promising approach for the accurate identification of rare variants in complex DNA samples. This approach has application in multiple areas, including cancer diagnostics, thus demanding dedicated software and algorithms. Here we introduce MAGERI, a computational pipeline that efficiently handles all caveats of UMI-based analysis to obtain high-fidelity mutation profiles and call ultra-rare variants. Using an extensive set of benchmark datasets including gold-standard biological samples with known variant frequencies, cell-free DNA from tumor patient blood samples and publicly available UMI-encoded datasets we demonstrate that our method is both robust and efficient in calling rare variants. The versatility of our software is supported by accurate results obtained for both tumor DNA and viral RNA samples in datasets prepared using three different UMI-based protocols.
- MeSH
- databáze genetické MeSH
- lidé MeSH
- nádorové biomarkery krev genetika MeSH
- nádory genetika MeSH
- RNA virová genetika MeSH
- sekvenční analýza DNA metody MeSH
- sekvenční analýza RNA metody MeSH
- software * MeSH
- výpočetní biologie metody MeSH
- vysoce účinné nukleotidové sekvenování metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
Next generation sequencing (NGS) platforms are replacing traditional molecular biology protocols like cloning and Sanger sequencing. However, accuracy of NGS platforms has rarely been measured when quantifying relative frequencies of genotypes or taxa within populations. Here we developed a new bioinformatic pipeline (QRS) that pools similar sequence variants and estimates their frequencies in NGS data sets from populations or communities. We tested whether the estimated frequency of representative sequences, generated by 454 amplicon sequencing, differs significantly from that obtained by Sanger sequencing of cloned PCR products. This was performed by analysing sequence variation of the highly variable first internal transcribed spacer (ITS1) of the ichthyosporean Caullerya mesnili, a microparasite of cladocerans of the genus Daphnia. This analysis also serves as a case example of the usage of this pipeline to study within-population variation. Additionally, a public Illumina data set was used to validate the pipeline on community-level data. Overall, there was a good correspondence in absolute frequencies of C. mesnili ITS1 sequences obtained from Sanger and 454 platforms. Furthermore, analyses of molecular variance (amova) revealed that population structure of C. mesnili differs across lakes and years independently of the sequencing platform. Our results support not only the usefulness of amplicon sequencing data for studies of within-population structure but also the successful application of the QRS pipeline on Illumina-generated data. The QRS pipeline is freely available together with its documentation under GNU Public Licence version 3 at http://code.google.com/p/quantification-representative-sequences.
- MeSH
- Daphnia parazitologie MeSH
- genetická variace * MeSH
- Mesomycetozoea klasifikace genetika MeSH
- mezerníky ribozomální DNA chemie genetika MeSH
- sekvenční analýza DNA * MeSH
- software MeSH
- výpočetní biologie metody MeSH
- vysoce účinné nukleotidové sekvenování MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- hodnotící studie MeSH
- práce podpořená grantem MeSH
- srovnávací studie MeSH
Molecular profiling of tumor samples has acquired importance in cancer research, but currently also plays an important role in the clinical management of cancer patients. Rapid identification of genomic aberrations improves diagnosis, prognosis and effective therapy selection. This can be attributed mainly to the development of next-generation sequencing (NGS) methods, especially targeted DNA panels. Such panels enable a relatively inexpensive and rapid analysis of various aberrations with clinical impact specific to particular diagnoses. In this review, we discuss the experimental approaches and bioinformatic strategies available for the development of an NGS panel for a reliable analysis of selected biomarkers. Compliance with defined analytical steps is crucial to ensure accurate and reproducible results. In addition, a careful validation procedure has to be performed before the application of NGS targeted assays in routine clinical practice. With more focus on bioinformatics, we emphasize the need for thorough pipeline validation and management in relation to the particular experimental setting as an integral part of the NGS method establishment. A robust and reproducible bioinformatic analysis running on powerful machines is essential for proper detection of genomic variants in clinical settings since distinguishing between experimental noise and real biological variants is fundamental. This review summarizes state-of-the-art bioinformatic solutions for careful detection of the SNV/Indels and CNVs for targeted sequencing resulting in translation of sequencing data into clinically relevant information. Finally, we share our experience with the development of a custom targeted NGS panel for an integrated analysis of biomarkers in lymphoproliferative disorders.
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Next generation sequencing (NGS) technology allows laboratories to investigate virome composition in clinical and environmental samples in a culture-independent way. There is a need for bioinformatic tools capable of parallel processing of virome sequencing data by exactly identical methods: this is especially important in studies of multifactorial diseases, or in parallel comparison of laboratory protocols. RESULTS: We have developed a web-based application allowing direct upload of sequences from multiple virome samples using custom parameters. The samples are then processed in parallel using an identical protocol, and can be easily reanalyzed. The pipeline performs de-novo assembly, taxonomic classification of viruses as well as sample analyses based on user-defined grouping categories. Tables of virus abundance are produced from cross-validation by remapping the sequencing reads to a union of all observed reference viruses. In addition, read sets and reports are created after processing unmapped reads against known human and bacterial ribosome references. Secured interactive results are dynamically plotted with population and diversity charts, clustered heatmaps and a sortable and searchable abundance table. CONCLUSIONS: The Vipie web application is a unique tool for multi-sample metagenomic analysis of viral data, producing searchable hits tables, interactive population maps, alpha diversity measures and clustered heatmaps that are grouped in applicable custom sample categories. Known references such as human genome and bacterial ribosomal genes are optionally removed from unmapped ('dark matter') reads. Secured results are accessible and shareable on modern browsers. Vipie is a freely available web-based tool whose code is open source.
- MeSH
- genetická variace MeSH
- genomika metody MeSH
- internet * MeSH
- lidé MeSH
- mikrobiota genetika MeSH
- software * MeSH
- viry genetika MeSH
- vysoce účinné nukleotidové sekvenování * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
BACKGROUND: Environmental DNA and metabarcoding allow the identification of a mixture of species and launch a new era in bio- and eco-assessment. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available; each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy. Adding to this complexity, the computation capacity of high-performance computing systems is frequently required for such analyses. To address the difficulties, bioinformatic pipelines need to combine state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune each study. Software containerization technologies ease the sharing and running of software packages across operating systems; thus, they strongly facilitate pipeline development and usage. Likewise programming languages specialized for big data pipelines incorporate features like roll-back checkpoints and on-demand partial pipeline execution. FINDINGS: PEMA is a containerized assembly of key metabarcoding analysis tools that requires low effort in setting up, running, and customizing to researchers' needs. Based on third-party tools, PEMA performs read pre-processing, (molecular) operational taxonomic unit clustering, amplicon sequence variant inference, and taxonomy assignment for 16S and 18S ribosomal RNA, as well as ITS and COI marker gene data. Owing to its simplified parameterization and checkpoint support, PEMA allows users to explore alternative algorithms for specific steps of the pipeline without the need of a complete re-execution. PEMA was evaluated against both mock communities and previously published datasets and achieved results of comparable quality. CONCLUSIONS: A high-performance computing-based approach was used to develop PEMA; however, it can be used in personal computers as well. PEMA's time-efficient performance and good results will allow it to be used for accurate environmental DNA metabarcoding analysis, thus enhancing the applicability of next-generation biodiversity assessment studies.
- MeSH
- Archaea MeSH
- Bacteria MeSH
- environmentální DNA chemie genetika MeSH
- houby MeSH
- metagenomika metody normy MeSH
- referenční standardy MeSH
- respirační komplex IV genetika MeSH
- RNA ribozomální 16S genetika MeSH
- RNA ribozomální 18S genetika MeSH
- rostliny MeSH
- senzitivita a specificita MeSH
- software MeSH
- taxonomické DNA čárové kódování metody normy MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
BACKGROUND: Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata). RESULTS: We have developed a modular bioinformatics pipeline to improve genome sequence assemblies, which can handle various types of data. The pipeline comprises several semi-automated tools. However, unlike classical automated tools that are based on global parameters, the semi-automated tools proposed an expert mode for a user who can decide on suggested improvements through local compromises. The pipeline was used to improve the draft genome sequence of Musa acuminata. Genotyping by sequencing (GBS) of a segregating population and paired-end sequencing were used to detect and correct scaffold misassemblies. Long insert size paired-end reads identified scaffold junctions and fusions missed by automated assembly methods. GBS markers were used to anchor scaffolds to pseudo-molecules with a new bioinformatics approach that avoids the tedious step of marker ordering during genetic map construction. Furthermore, a genome map was constructed and used to assemble scaffolds into super scaffolds. Finally, a consensus gene annotation was projected on the new assembly from two pre-existing annotations. This approach reduced the total Musa scaffold number from 7513 to 1532 (i.e. by 80%), with an N50 that increased from 1.3 Mb (65 scaffolds) to 3.0 Mb (26 scaffolds). 89.5% of the assembly was anchored to the 11 Musa chromosomes compared to the previous 70%. Unknown sites (N) were reduced from 17.3 to 10.0%. CONCLUSION: The release of the Musa acuminata reference genome version 2 provides a platform for detailed analysis of banana genome variation, function and evolution. Bioinformatics tools developed in this work can be used to improve genome sequence assemblies in other species.
The emergence of high-throughput methodologies such as next-generation sequencing and proteomics has necessitated significant advancements in biological databases and bioinformatic tools, therefore reshaping the landscape of research into parasitic peptidases. In this review we outline the development of these resources along the -omics technologies and their transformative impact on the field. Apart from extensive summary of general and specific databases and tools, we provide a general pipeline on how to use these resources effectively to identify candidate peptidases from these large datasets and how to gain as much information about them as possible without leaving the office chair. This pipeline is then applied in an illustrative case study on the endothelin-converting enzyme 1 homologue from Schistosoma mansoni and attempts to highlight the contemporary capabilities of bioinformatics. The case study demonstrate how such approach can aid to hypothesize enzyme functions and interactions through computational analysis alone effectively and emphasizes how such virtual investigations can guide and optimize subsequent wet lab experiments therefore potentially saving precious time and resources. Finally, by showing what can be achieved without traditional wet laboratory methods, this review provides a compelling narrative on the use of bioinformatics to bridge the gap between big data and practical research applications, highlighting the key role of these technologies in furthering our understanding of parasitic diseases.
Recent advances in Next-Generation Sequencing (NGS) make comparative analyses of the composition and diversity of whole microbial communities possible at a far greater depth than ever before. This brings new challenges, such as an increased dependence on computation to process these huge datasets. The demand on system resources usually requires migrating from Windows to Linux-based operating systems and prior familiarity with command-line interfaces. To overcome this barrier, we developed a fully automated and easy-to-install package as well as a complete, easy-to-follow pipeline for microbial metataxonomic analysis operating in the Windows Subsystem for Linux (WSL)-Bioinformatics Through Windows (BTW). BTW combines several open-access tools for processing marker gene data, including 16S rRNA, bringing the user from raw sequencing reads to diversity-related conclusions. It includes data quality filtering, clustering, taxonomic assignment and further statistical analyses, directly in WSL, avoiding the prior need of migrating from Windows to Linux. BTW is expected to boost the use of NGS amplicon data by facilitating rapid access to a set of bioinformatics tools for Windows users. Moreover, several Linux command line tools became more reachable, which will enhance bioinformatics accessibility to a wider range of researchers and practitioners in the life sciences and medicine. BTW is available in GitHub (https://github.com/vpylro/BTW). The package is freely available for noncommercial users.
- Publikační typ
- časopisecké články MeSH