ToTem: a tool for variant calling pipeline optimization
Language English Country England, Great Britain Media electronic
Document type Journal Article, Research Support, Non-U.S. Gov't
PubMed
29940847
PubMed Central
PMC6020218
DOI
10.1186/s12859-018-2227-x
PII: 10.1186/s12859-018-2227-x
Knihovny.cz E-resources
- Keywords
- Benchmarking, Next generation sequencing, Parameter optimization, Variant calling,
- MeSH
- Reproducibility of Results MeSH
- Software MeSH
- Computational Biology methods MeSH
- High-Throughput Nucleotide Sequencing methods MeSH
- Research Design MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
BACKGROUND: High-throughput bioinformatics analyses of next generation sequencing (NGS) data often require challenging pipeline optimization. The key problem is choosing appropriate tools and selecting the best parameters for optimal precision and recall. RESULTS: Here we introduce ToTem, a tool for automated pipeline optimization. ToTem is a stand-alone web application with a comprehensive graphical user interface (GUI). ToTem is written in Java and PHP with an underlying connection to a MySQL database. Its primary role is to automatically generate, execute and benchmark different variant calling pipeline settings. Our tool allows an analysis to be started from any level of the process and with the possibility of plugging almost any tool or code. To prevent an over-fitting of pipeline parameters, ToTem ensures the reproducibility of these by using cross validation techniques that penalize the final precision, recall and F-measure. The results are interpreted as interactive graphs and tables allowing an optimal pipeline to be selected, based on the user's priorities. Using ToTem, we were able to optimize somatic variant calling from ultra-deep targeted gene sequencing (TGS) data and germline variant detection in whole genome sequencing (WGS) data. CONCLUSIONS: ToTem is a tool for automated pipeline optimization which is freely available as a web application at https://totem.software .
Department of Computer Science Faculty of Science Palacky University Olomouc Czech Republic
Genomics Core Facility European Molecular Biology Laboratory Heidelberg Germany
See more in PubMed
Park JY, Kricka LJ, Fortina P. Next-generation sequencing in the clinic. Nat Biotechnol. 2013;31:990–992. doi: 10.1038/nbt.2743. PubMed DOI
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2014;15:256–278. doi: 10.1093/bib/bbs086. PubMed DOI PMC
DePristo MA, Banks E, Poplin RE, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. PubMed DOI PMC
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinforma Ed Board Andreas Baxevanis Al. 2013;43:11. PubMed PMC
Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep. 2015;5:srep17875. doi: 10.1038/srep17875. PubMed DOI PMC
Sandmann S, de Graaf AO, Karimi M, van der Reijden BA, Hellström-Lindberg E, Jansen JH, et al. Evaluating variant calling tools for non-matched next-generation sequencing data. Sci Rep. 2017;7:srep43169. doi: 10.1038/srep43169. PubMed DOI PMC
Talwalkar A, Liptrap J, Newcomb J, Hartl C, Terhorst J, Curtis K, et al. SMaSH: a benchmarking toolkit for human genome variant calling. Bioinformatics. 2014;30:2787–2795. doi: 10.1093/bioinformatics/btu345. PubMed DOI PMC
Bahcall OG. Genomics: Benchmarking genome analysis pipelines. Nat Rev Genet. 2015;16:194. PubMed
Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. PubMed DOI
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:sdata201625. doi: 10.1038/sdata.2016.25. PubMed DOI PMC
rtg-tools . RTG tools: utilities for accurate VCF comparison and manipulation. Java. Real time genomics. 2017.
Cleary JG, Braithwaite R, Gaastra K, Hilbush BS, Inglis S, Irvine SA, et al. Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines. bioRxiv. 2015:023754. 10.1101/023754.
hap.py: Haplotype VCF comparison tools. C++. Illumina; 2017. https://github.com/Illumina/hap.py. Accessed 18 Dec 2017.
GIAB General Group. The Joint Initiative for Metrology in Biology. http://jimb.stanford.edu/giab-general-group/. Accessed 19 Dec 2017.
Contribute to benchmarking-tools development by creating an account on GitHub. HTML. Global alliance for genomics and health; 2017. https://github.com/ga4gh/benchmarking-tools. Accessed 19 Dec 2017.
Popitsch N, WGS500 Consortium. Schuh A, Taylor JC. ReliableGenome: annotation of genomic regions with high/low variant calling concordance. Bioinforma Oxf Engl. 2017;33:155–160. doi: 10.1093/bioinformatics/btw587. PubMed DOI PMC
Goldfeder RL, Priest JR, Zook JM, Grove ME, Waggott D, Wheeler MT, et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 2016;8:24. doi: 10.1186/s13073-016-0269-0. PubMed DOI PMC
Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. PubMed DOI PMC
Guo Y, Ding X, Shen Y, Lyon GJ, Wang K. SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Sci Rep. 2015;5:14283. doi: 10.1038/srep14283. PubMed DOI PMC
Malcikova J, Stano-Kozubik K, Tichy B, Kantorova B, Pavlova S, Tom N, et al. Detailed analysis of therapy-driven clonal evolution of TP53 mutations in chronic lymphocytic leukemia. Leukemia. 2015;29:877–885. doi: 10.1038/leu.2014.297. PubMed DOI PMC
Kubesova B, Pavlova S, Malcikova J, Kabathova J, Radova L, Tom N, et al. Low-burden TP53 mutations in chronic phase of myeloproliferative neoplasms: association with age, hydroxyurea administration, disease type and JAK2 mutational status. Leukemia. 2017; 10.1038/leu.2017.230 PubMed PMC
Gerstung M, Papaemmanuil E, Campbell PJ. Subclonal variant calling with multiple samples and prior knowledge. Bioinforma Oxf Engl. 2014;30:1198–1204. doi: 10.1093/bioinformatics/btt750. PubMed DOI PMC
Lai Z, Markovets A, Ahdesmaki M, Chapman B, Hofmann O, McEwen R, et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016;44:e108. doi: 10.1093/nar/gkw227. PubMed DOI PMC
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinforma Oxf Engl. 2009;25:2283–2285. doi: 10.1093/bioinformatics/btp373. PubMed DOI PMC
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–576. doi: 10.1101/gr.129684.111. PubMed DOI PMC
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. PubMed DOI PMC
vcflib: a simple C++ library for parsing and manipulating VCF files, + many command-line utilities. C++. vcflib; 2017. https://github.com/vcflib/vcflib. Accessed 22 Dec 2017.
Chapman B. bcbio-nextgen: Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. Python. 2017. https://github.com/bcbio/bcbio-nextgen. Accessed 19 Dec 2017.
Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016;44:W3–10. doi: 10.1093/nar/gkw343. PubMed DOI PMC