Accurate sequencing of DNA motifs able to form alternative (non-B) structures
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, Research Support, N.I.H., Extramural
Grantová podpora
R01 GM136684
NIGMS NIH HHS - United States
PubMed
37433640
PubMed Central
PMC10519405
DOI
10.1101/gr.277490.122
PII: gr.277490.122
Knihovny.cz E-zdroje
- MeSH
- DNA genetika MeSH
- lidé MeSH
- nanopóry * MeSH
- nukleotidové motivy MeSH
- sekvenční analýza DNA MeSH
- vysoce účinné nukleotidové sekvenování MeSH
- Z-DNA * MeSH
- zastoupení bazí MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Research Support, N.I.H., Extramural MeSH
- Názvy látek
- DNA MeSH
- Z-DNA * MeSH
Approximately 13% of the human genome at certain motifs have the potential to form noncanonical (non-B) DNA structures (e.g., G-quadruplexes, cruciforms, and Z-DNA), which regulate many cellular processes but also affect the activity of polymerases and helicases. Because sequencing technologies use these enzymes, they might possess increased errors at non-B structures. To evaluate this, we analyzed error rates, read depth, and base quality of Illumina, Pacific Biosciences (PacBio) HiFi, and Oxford Nanopore Technologies (ONT) sequencing at non-B motifs. All technologies showed altered sequencing success for most non-B motif types, although this could be owing to several factors, including structure formation, biased GC content, and the presence of homopolymers. Single-nucleotide mismatch errors had low biases in HiFi and ONT for all non-B motif types but were increased for G-quadruplexes and Z-DNA in all three technologies. Deletion errors were increased for all non-B types but Z-DNA in Illumina and HiFi, as well as only for G-quadruplexes in ONT. Insertion errors for non-B motifs were highly, moderately, and slightly elevated in Illumina, HiFi, and ONT, respectively. Additionally, we developed a probabilistic approach to determine the number of false positives at non-B motifs depending on sample size and variant frequency, and applied it to publicly available data sets (1000 Genomes, Simons Genome Diversity Project, and gnomAD). We conclude that elevated sequencing errors at non-B DNA motifs should be considered in low-read-depth studies (single-cell, ancient DNA, and pooled-sample population sequencing) and in scoring rare variants. Combining technologies should maximize sequencing accuracy in future studies of non-B DNA.
Center for Medical Genomics The Pennsylvania State University University Park Pennsylvania 16802 USA
Department of Biology The Pennsylvania State University University Park Pennsylvania 16802 USA
Department of Biology The Pennsylvania State University University Park Pennsylvania 16802 USA;
Department of Operations and Decision Systems Université Laval Quebec Quebec G1V0A6 Canada
Department of Statistics The Pennsylvania State University University Park Pennsylvania 16802 USA
Faculty of Informatics Masaryk University 60200 Brno Czech Republic
Institute of Economics and L'EMbeDS Sant'Anna School of Advanced Studies Pisa 56127 Italy
Laboratory of Cell Biology NCI CCR National Institutes of Health Bethesda Maryland 20892 USA
Zobrazit více v PubMed
The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 PubMed DOI PMC
Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12: R18. 10.1186/gb-2011-12-2-r18 PubMed DOI PMC
Aitchison J. 1982. The statistical analysis of compositional data. J R Statist Soc Ser B 44: 139–160. 10.1111/j.2517-6161.1982.tb01195.x DOI
Barbič A, Zimmer DP, Crothers DM. 2003. Structural origins of adenine-tract bending. Proc Natl Acad Sci 100: 2369–2373. 10.1073/pnas.0437877100 PubMed DOI PMC
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc Ser B 57: 289–300. 10.1111/j.2517-6161.1995.tb02031.x DOI
Biffi G, Tannahill D, McCafferty J, Balasubramanian S. 2013. Quantitative visualization of DNA G-quadruplex structures in human cells. Nat Chem 5: 182–186. 10.1038/nchem.1548 PubMed DOI PMC
Bowden R, Davies RW, Heger A, Pagnamenta AT, de Cesare M, Oikkonen LE, Parkes D, Freeman C, Dhalla F, Patel SY, et al. 2019. Sequencing of human genomes with nanopore technology. Nat Commun 10: 1869. 10.1038/s41467-019-09637-5 PubMed DOI PMC
Burge S, Parkinson GN, Hazel P, Todd AK, Neidle S. 2006. Quadruplex DNA: sequence, topology and structure. Nucleic Acids Res 34: 5402–5415. 10.1093/nar/gkl655 PubMed DOI PMC
Cer RZ, Donohue DE, Mudunuri US, Temiz NA, Loss MA, Starner NJ, Halusa GN, Volfovsky N, Yi M, Luke BT, et al. 2013. Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools. Nucleic Acids Res 41: D94–D100. 10.1093/nar/gks955 PubMed DOI PMC
Cook RD, Sanford W. 1982. Residuals and influence in regression. Chapman and Hall, New York.
Daniel B, Deamer DW. 2019. Nanopore sequencing: an introduction. World Scientific, Singapore.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36: e105. 10.1093/nar/gkn425 PubMed DOI PMC
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science 323: 133–138. 10.1126/science.1162986 PubMed DOI
Fungtammasan A, Ananda G, Hile SE, Su MS-W, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications. Genome Res 25: 736–749. 10.1101/gr.185892.114 PubMed DOI PMC
Garrison E, Marth G. 2012. Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907 [q-bio.GN].
Ghosh A, Bansal M. 2003. A glossary of DNA structures from A to Z. Acta Crystallogr D Biol Crystallogr 59: 620–626. 10.1107/s0907444903003251 PubMed DOI
Guiblet WM, Cremona MA, Cechova M, Harris RS, Kejnovská I, Kejnovsky E, Eckert K, Chiaromonte F, Makova KD. 2018. Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate. Genome Res 28: 1767–1778. 10.1101/gr.241257.118 PubMed DOI PMC
Hänsel-Hertsch R, Beraldi D, Lensing SV, Marsico G, Zyner K, Parry A, Di Antonio M, Pike J, Kimura H, Narita M, et al. 2016. G-quadruplex structures mark human regulatory chromatin. Nat Genet 48: 1267–1272. 10.1038/ng.3662 PubMed DOI
Hile SE, Eckert KA. 2004. Positive correlation between DNA polymerase α-primase pausing and mutagenesis within polypyrimidine/polypurine microsatellite sequences. J Mol Biol 335: 745–759. 10.1016/j.jmb.2003.10.075 PubMed DOI
Hile SE, Wang X, Lee MYWT, Eckert KA. 2012. Beyond translesion synthesis: polymerase κ fidelity as a potential determinant of microsatellite stability. Nucleic Acids Res 40: 1636–1647. 10.1093/nar/gkr889 PubMed DOI PMC
Htun H, Dahlberg J. 1988. Single strands, triple strands, and kinks in H-DNA. Science 241: 1791–1796. 10.1126/science.3175620 PubMed DOI
Jain A, Wang G, Vasquez KM. 2008. DNA triple helices: biological consequences and therapeutic potential. Biochimie 90: 1117–1130. 10.1016/j.biochi.2008.02.011 PubMed DOI PMC
Jain A, Bacolla A, Chakraborty P, Grosse F, Vasquez KM. 2010. Human DHX9 helicase unwinds triple-helical DNA structures. Biochemistry 49: 6992–6999. 10.1021/bi100795m PubMed DOI PMC
Jain M, Olsen HE, Paten B, Akeson M. 2016. Erratum to: The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17: 256. 10.1186/s13059-016-1122-x PubMed DOI PMC
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, et al. 2018. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36: 338–345. 10.1038/nbt.4060 PubMed DOI PMC
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, et al. 2020. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581: 434–443. 10.1038/s41586-020-2308-7 PubMed DOI PMC
Karst SM, Ziels RM, Kirkegaard RH, Sørensen EA, McDonald D, Zhu Q, Knight R, Albertsen M. 2021. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods 18: 165–169. 10.1038/s41592-020-01041-y PubMed DOI
Kelkar YD, Eckert KA, Chiaromonte F, Makova KD. 2011. A matter of life or death: how microsatellites emerge in and vanish from the human genome. Genome Res 21: 2038–2048. 10.1101/gr.122937.111 PubMed DOI PMC
Kishikawa T, Momozawa Y, Ozeki T, Mushiroda T, Inohara H, Kamatani Y, Kubo M, Okada Y. 2019. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 9: 1784. 10.1038/s41598-018-38346-0 PubMed DOI PMC
Koo H-S, Wu H-M, Crothers DM. 1986. DNA bending at adenine thymine tracts. Nature 320: 501–506. 10.1038/320501a0 PubMed DOI
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. 2013. Software for computing and annotating genomic ranges. PLoS Comput Biol 9: e1003118. 10.1371/journal.pcbi.1003118 PubMed DOI PMC
Lee H, Schatz MC. 2012. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28: 2097–2105. 10.1093/bioinformatics/bts330 PubMed DOI PMC
Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100. 10.1093/bioinformatics/bty191 PubMed DOI PMC
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. 10.1093/bioinformatics/btp352 PubMed DOI PMC
Logsdon GA, Vollger MR, Eichler EE. 2020. Long-read human genome sequencing and its applications. Nat Rev Genet 21: 597–614. 10.1038/s41576-020-0236-x PubMed DOI PMC
Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. 2016. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538: 201–206. 10.1038/nature18964 PubMed DOI PMC
Metzker ML. 2010. Sequencing technologies: the next generation. Nat Rev Genet 11: 31–46. 10.1038/nrg2626 PubMed DOI
Mirkin EV, Mirkin SM. 2007. Replication fork stalling at natural impediments. Microbiol Mol Biol Rev 71: 13–35. 10.1128/MMBR.00030-06 PubMed DOI PMC
Nag DK, Petes TD. 1991. Seven-base-pair inverted repeats in DNA form stable hairpins in vivo in Saccharomyces cerevisiae. Genetics 129: 669–673. 10.1093/genetics/129.3.669 PubMed DOI PMC
Nielsen R, Paul JS, Albrechtsen A, Song YS. 2011. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: 443–451. 10.1038/nrg2986 PubMed DOI PMC
Quail MA, Otto TD, Gu Y, Harris SR, Skelly TF, McQuillan JA, Swerdlow HP, Oyola SO. 2012. Optimal enzymes for amplifying sequencing libraries. Nat Methods 9: 10–11. 10.1038/nmeth.1814 PubMed DOI
Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842. 10.1093/bioinformatics/btq033 PubMed DOI PMC
R Core Team. 2022. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/.
Sahakyan AB, Chambers VS, Marsico G, Santner T, Di Antonio M, Balasubramanian S. 2017. Machine learning model for sequence-driven DNA G-quadruplex formation. Sci Rep 7: 14535. 10.1038/s41598-017-14017-4 PubMed DOI PMC
Schirmer M, D'Amore R, Ijaz UZ, Hall N, Quince C. 2016. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics 17: 125. 10.1186/s12859-016-0976-y PubMed DOI PMC
Sen D, Gilbert W. 1988. Formation of parallel four-stranded complexes by guanine-rich motifs in DNA and its implications for meiosis. Nature 334: 364–366. 10.1038/334364a0 PubMed DOI
Shafer ABA, Peart CR, Tusso S, Maayan I, Brelsford A, Wheat CW, Wolf JBW. 2017. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods in Ecol Evol 8: 907–917. 10.1111/2041-210x.12700 DOI
Shin S-I, Ham S, Park J, Seo SH, Lim CH, Jeon H, Huh J, Roh T-Y. 2016. Z-DNA-forming sites identified by ChIP-Seq are associated with actively transcribed regions in the human genome. DNA Res 23: 477–486. 10.1093/dnares/dsw031 PubMed DOI PMC
Sinden RR, Pytlos-Sinden MJ, Potaman VN. 2007. Slipped strand DNA structures. Front Biosci 12: 4788–4799. 10.2741/2427 PubMed DOI
Singleton CK, Klysik J, Stirdivant SM, Wells RD. 1982. Left-handed Z-DNA is induced by supercoiling in physiological ionic conditions. Nature 299: 312–316. 10.1038/299312a0 PubMed DOI
Slatkin M, Racimo F. 2016. Ancient DNA and human history. Proc Natl Acad Sci USA 113: 6380–6387. 10.1073/pnas.1524306113 PubMed DOI PMC
Stein M, Hile SE, Weissensteiner MH, Lee M, Zhang S, Kejnovský E, Kejnovská I, Makova KD, Eckert KA. 2022. Variation in G-quadruplex sequence and topology differentially impacts human DNA polymerase fidelity. DNA Repair (Amst) 119: 103402. 10.1016/j.dnarep.2022.103402 PubMed DOI PMC
Stoler N, Nekrutenko A. 2021. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform 3: lqab019. 10.1093/nargab/lqab019 PubMed DOI PMC
Tabangin ME, Woo JG, Martin LJ. 2009. The effect of minor allele frequency on the likelihood of obtaining false positives. BMC Proc 3 Suppl 7: S41. 10.1186/1753-6561-3-S7-S41 PubMed DOI PMC
Wagner J, Olson ND, Harris L, Khan Z, Farek J, Mahmoud M, Stankovic A, Kovacevic V, Yoo B, Miller N, et al. 2022. Benchmarking challenging small variants with linked and long reads. Cell Genom 2: 100128. 10.1016/j.xgen.2022.100128 PubMed DOI PMC
Wainschtein P, Jain D, Zheng Z, TOPMed Anthropometry Working Group, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Adrienne Cupples L, Shadyab AH, McKnight B, Shoemaker BM, et al. 2022. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat Genet 54: 263–273. 10.1038/s41588-021-00997-7 PubMed DOI PMC
Wang G, Vasquez KM. 2014. Impact of alternative DNA structures on DNA damage, DNA repair, and genetic instability. DNA Repair (Amst) 19: 143–151. 10.1016/j.dnarep.2014.03.017 PubMed DOI PMC
Wang AH-J, Quigley GJ, Kolpak FJ, Crawford JL, van Boom JH, van der Marel G, Rich A. 1979. Molecular structure of a left-handed double helical DNA fragment at atomic resolution. Nature 282: 680–686. 10.1038/282680a0 PubMed DOI
Wickham H. 2011. ggplot2. Wiley Interdiscip Rev Comput Stat 3: 180–185. 10.1002/wics.147 DOI
Zhao J, Bacolla A, Wang G, Vasquez KM. 2010. Non-B DNA structure-induced genetic instability and evolution. Cell Mol Life Sci 67: 43–62. 10.1007/s00018-009-0131-2 PubMed DOI PMC
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al. 2016. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 3: 160025. 10.1038/sdata.2016.25 PubMed DOI PMC
Special Issue "Bioinformatics of Unusual DNA and RNA Structures"