PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction

. 2023 Dec 19 ; 24 (1) : 487. [epub] 20231219

Jazyk angličtina Země Anglie, Velká Británie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid38114921

Grantová podpora
360121 Grantová Agentura, Univerzita Karlova
CZ.02.1.01/0.0/0.0/16 019/0000729 European Regional Development Fund

Odkazy

PubMed 38114921
PubMed Central PMC10731698
DOI 10.1186/s12859-023-05613-5
PII: 10.1186/s12859-023-05613-5
Knihovny.cz E-zdroje

BACKGROUND: The specific recognition of a DNA locus by a given transcription factor is a widely studied issue. It is generally agreed that the recognition can be influenced not only by the binding motif but by the larger context of the binding site. In this work, we present a novel heuristic algorithm that can reconstruct the unique binding sites captured in a sequencing experiment without using the reference genome. RESULTS: We present PAPerFly, the Partial Assembly-based Peak Finder, a tool for the binding site and binding context reconstruction from the sequencing data without any prior knowledge. This tool operates without the need to know the reference genome of the respective organism. We employ algorithmic approaches that are used during genome assembly. The proposed algorithm constructs a de Bruijn graph from the sequencing data. Based on this graph, sequences and their enrichment are reconstructed using a novel heuristic algorithm. The reconstructed sequences are aligned and the peaks in the sequence enrichment are identified. Our approach was tested by processing several ChIP-seq experiments available in the ENCODE database and comparing the results of Paperfly and standard methods. CONCLUSIONS: We show that PAPerFly, an algorithm tailored for experiment analysis without the reference genome, yields better results than an aggregation of ChIP-seq agnostic tools. Our tool is freely available at https://github.com/Caeph/paperfly/ or on Zenodo ( https://doi.org/10.5281/zenodo.7116424 ).

Zobrazit více v PubMed

Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. PubMed DOI

Riley TR, Slattery M, Abe N, Rastogi C, Liu D, Mann RS, Bussemaker HJ. Selex-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Hox Genes. 2014 doi: 10.1007/978-1-4939-1242-1. PubMed DOI PMC

Isakova A, Groux R, Imbeault M, Rainer P, Alpern D, Dainese R, Ambrosini G, Trono D, Bucher P, Deplancke B. Smile-seq identifies binding motifs of single and dimeric transcription factors. Nat Methods. 2017;14(3):316–322. doi: 10.1038/nmeth.4143. PubMed DOI

Kribelbauer JF, Rastogi C, Bussemaker HJ, Mann RS. Low-affinity binding sites and the transcription factor specificity paradox in eukaryotes. Ann Rev Cell Dev Biol. 2019;35:357–379. doi: 10.1146/annurev-cellbio-100617-062719. PubMed DOI PMC

Thomas R, Thomas S, Holloway AK, Pollard KS. Features that define the best chip-seq peak calling algorithms. Brief Bioinform. 2017;18(3):441–450. doi: 10.1093/bib/bbw035. PubMed DOI PMC

Tuteja G, White P, Schug J, Kaestner KH. Extracting transcription factor targets from chip-seq data. Nucleic Acids Res. 2009;37(17):113–113. doi: 10.1093/nar/gkp536. PubMed DOI PMC

Nakato R, Sakata T. Methods for chip-seq analysis: a practical workflow and advanced applications. Methods. 2021;187:44–53. doi: 10.1016/j.ymeth.2020.03.005. PubMed DOI

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. Model-based analysis of chip-seq (macs) Genome Biol. 2008;9(9):1–9. doi: 10.1186/gb-2008-9-9-r137. PubMed DOI PMC

Gaspar JM. Improved peak-calling with macs2. BioRxiv. 2018 doi: 10.1101/496521. DOI

Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Sundaramurthi JC, Lee J, Kandimalla M, Chen I-MA, Kyrpides NC, Reddy T. Genomes online database (gold) v. 8: overview and updates. Nucleic Acids Res. 2021;49(D1):723–733. doi: 10.1093/nar/gkaa983. PubMed DOI PMC

Miga KH, Wang T. The need for a human pangenome reference sequence. Ann Rev Genom Human Genet. 2021;22:81. doi: 10.1146/annurev-genom-120120-081921. PubMed DOI PMC

Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M, et al. An integrated map of structural variation in 2504 human genomes. Nature. 2015;526(7571):75–81. doi: 10.1038/nature15394. PubMed DOI PMC

Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee H, Chan C-KK, Visendi P, Lai K, Doležel J, Batley J, et al. The pangenome of hexaploid bread wheat. Plant J. 2017;90(5):1007–1013. doi: 10.1111/tpj.13515. PubMed DOI

Bailey TL, Williams N, Misleh C, Li WW. Meme: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34(suppl–2):369–373. doi: 10.1093/nar/gkl198. PubMed DOI PMC

Machanick P, Bailey TL. Meme-chip: motif analysis of large DNA datasets. Bioinformatics. 2011;27(12):1696–1697. doi: 10.1093/bioinformatics/btr189. PubMed DOI PMC

Dror I, Golan T, Levy C, Rohs R, Mandel-Gutfreund Y. A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Res. 2015;25(9):1268–1280. doi: 10.1101/gr.184671.114. PubMed DOI PMC

Yella VR, Bhimsaria D, Ghoshdastidar D, Rodríguez-Martínez JA, Ansari AZ, Bansal M. Flexibility and structure of flanking DNA impact transcription factor affinity for its core motif. Nucleic Acids Res. 2018;46(22):11883–11897. doi: 10.1093/nar/gky1057. PubMed DOI PMC

Penvose A, Keenan JL, Bray D, Ramlall V, Siggers T. Comprehensive study of nuclear receptor DNA binding provides a revised framework for understanding receptor specificity. Nat Commun. 2019;10(1):1–15. doi: 10.1038/s41467-019-10264-3. PubMed DOI PMC

Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21(suppl–2):79–85. doi: 10.1093/bioinformatics/bti1114. PubMed DOI

Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. PubMed DOI PMC

Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18(5):821–829. doi: 10.1101/gr.074492.107. PubMed DOI PMC

Namiki T, Hachiya T, Tanaka H, Sakakibara Y. Metavelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40(20):155–155. doi: 10.1093/nar/gks678. PubMed DOI PMC

He X, Cicek AE, Wang Y, Schulz MH, Le H-S, Bar-Joseph Z. De novo chip-seq analysis. Genome Biol. 2015;16(1):1–10. doi: 10.1186/s13059-015-0756-4. PubMed DOI PMC

Chikhi R, Limasset A, Medvedev P. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–208. doi: 10.1093/bioinformatics/btw279. PubMed DOI PMC

Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011. PubMed DOI PMC

Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab. (LANL), Los Alamos, NM (United States); 2008.

Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. Commun ACM. 1975;18(6):333–340. doi: 10.1145/360825.360855. DOI

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. Blast+: architecture and applications. BMC Bioinform. 2009;10(1):1–9. doi: 10.1186/1471-2105-10-421. PubMed DOI PMC

Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947 doi: 10.1214/aoms/1177730491. DOI

Dunn OJ. Multiple comparisons among means. J Am Stat Assoc. 1961;56(293):52–64. doi: 10.1080/01621459.1961.10482090. DOI

Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133(6):1106–1117. doi: 10.1016/j.cell.2008.04.043. PubMed DOI

Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, Van Der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, et al. Jaspar 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018;46(D1):260–266. doi: 10.1093/nar/gkx1188. PubMed DOI PMC

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163. PubMed DOI PMC

Consortium EP, et al. The encode (encyclopedia of DNA elements) project. Science. 2004;306(5696):636–40. doi: 10.1126/science.1105136. PubMed DOI

Šošić M, Šikić M. Edlib: a c/c++ library for fast, exact sequence alignment using edit distance. Bioinformatics. 2017;33(9):1394–1395. doi: 10.1093/bioinformatics/btw753. PubMed DOI PMC

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...