JavaScript is NOT enabled !

Please enable JavaScript.

Article

FT
PubMed

This record comes from PubMed

PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction

Faltejsková, Kateřina
Author Faltejsková, Kateřina ORCID Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 542/2, 160 00, Prague, Czech Republic. katerina.faltejskova@uochb.cas.cz Computer Science Institute, Faculty of Mathematics and Physics, Charles University, Malostranské náměstí 25, 118 00, Prague, Czech Republic. katerina.faltejskova@uochb.cas.cz
Vondrášek, Jiří
Author Vondrášek, Jiří Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 542/2, 160 00, Prague, Czech Republic. jiri.vondrasek@uochb.cas.cz

BMC bioinformatics. 2023 Dec 19 ; 24 (1) : 487. [epub] 20231219

BMC Bioinformatics
ISSN 1471-2105
Source

Language English Country England, Great Britain Media electronic

Document type Journal Article

Persistent link https://www.medvik.cz/link/pmid38114921

Grant support
360121 Grantová Agentura, Univerzita Karlova
CZ.02.1.01/0.0/0.0/16 019/0000729 European Regional Development Fund

Online Full text

PubMed 38114921
PubMed Central PMC10731698
DOI 10.1186/s12859-023-05613-5
PII: 10.1186/s12859-023-05613-5
Knihovny.cz E-resources

Keywords
Algorithm, ChIP-seq, DNA recognition, Graph theory, Peak analysis, Transcription factor,
MeSH
Algorithms * MeSH
Chromatin Immunoprecipitation Sequencing MeSH
Genome MeSH
Sequence Analysis, DNA methods MeSH
Transcription Factors * metabolism MeSH
Binding Sites MeSH
Publication type
Journal Article MeSH
Names of Substances
Transcription Factors * MeSH

BACKGROUND: The specific recognition of a DNA locus by a given transcription factor is a widely studied issue. It is generally agreed that the recognition can be influenced not only by the binding motif but by the larger context of the binding site. In this work, we present a novel heuristic algorithm that can reconstruct the unique binding sites captured in a sequencing experiment without using the reference genome. RESULTS: We present PAPerFly, the Partial Assembly-based Peak Finder, a tool for the binding site and binding context reconstruction from the sequencing data without any prior knowledge. This tool operates without the need to know the reference genome of the respective organism. We employ algorithmic approaches that are used during genome assembly. The proposed algorithm constructs a de Bruijn graph from the sequencing data. Based on this graph, sequences and their enrichment are reconstructed using a novel heuristic algorithm. The reconstructed sequences are aligned and the peaks in the sequence enrichment are identified. Our approach was tested by processing several ChIP-seq experiments available in the ENCODE database and comparing the results of Paperfly and standard methods. CONCLUSIONS: We show that PAPerFly, an algorithm tailored for experiment analysis without the reference genome, yields better results than an aggregation of ChIP-seq agnostic tools. Our tool is freely available at https://github.com/Caeph/paperfly/ or on Zenodo ( https://doi.org/10.5281/zenodo.7116424 ).

Computer Science Institute Faculty of Mathematics and Physics Charles University Malostranské náměstí 25 118 00 Prague Czech Republic

Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences Flemingovo náměstí 542 2 160 00 Prague Czech Republic

See more in PubMed

Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. PubMed DOI

Riley TR, Slattery M, Abe N, Rastogi C, Liu D, Mann RS, Bussemaker HJ. Selex-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Hox Genes. 2014 doi: 10.1007/978-1-4939-1242-1. PubMed DOI PMC

Isakova A, Groux R, Imbeault M, Rainer P, Alpern D, Dainese R, Ambrosini G, Trono D, Bucher P, Deplancke B. Smile-seq identifies binding motifs of single and dimeric transcription factors. Nat Methods. 2017;14(3):316–322. doi: 10.1038/nmeth.4143. PubMed DOI

Kribelbauer JF, Rastogi C, Bussemaker HJ, Mann RS. Low-affinity binding sites and the transcription factor specificity paradox in eukaryotes. Ann Rev Cell Dev Biol. 2019;35:357–379. doi: 10.1146/annurev-cellbio-100617-062719. PubMed DOI PMC

Thomas R, Thomas S, Holloway AK, Pollard KS. Features that define the best chip-seq peak calling algorithms. Brief Bioinform. 2017;18(3):441–450. doi: 10.1093/bib/bbw035. PubMed DOI PMC

Tuteja G, White P, Schug J, Kaestner KH. Extracting transcription factor targets from chip-seq data. Nucleic Acids Res. 2009;37(17):113–113. doi: 10.1093/nar/gkp536. PubMed DOI PMC

Nakato R, Sakata T. Methods for chip-seq analysis: a practical workflow and advanced applications. Methods. 2021;187:44–53. doi: 10.1016/j.ymeth.2020.03.005. PubMed DOI

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. Model-based analysis of chip-seq (macs) Genome Biol. 2008;9(9):1–9. doi: 10.1186/gb-2008-9-9-r137. PubMed DOI PMC

Gaspar JM. Improved peak-calling with macs2. BioRxiv. 2018 doi: 10.1101/496521. DOI

Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Sundaramurthi JC, Lee J, Kandimalla M, Chen I-MA, Kyrpides NC, Reddy T. Genomes online database (gold) v. 8: overview and updates. Nucleic Acids Res. 2021;49(D1):723–733. doi: 10.1093/nar/gkaa983. PubMed DOI PMC

Miga KH, Wang T. The need for a human pangenome reference sequence. Ann Rev Genom Human Genet. 2021;22:81. doi: 10.1146/annurev-genom-120120-081921. PubMed DOI PMC

Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M, et al. An integrated map of structural variation in 2504 human genomes. Nature. 2015;526(7571):75–81. doi: 10.1038/nature15394. PubMed DOI PMC

Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee H, Chan C-KK, Visendi P, Lai K, Doležel J, Batley J, et al. The pangenome of hexaploid bread wheat. Plant J. 2017;90(5):1007–1013. doi: 10.1111/tpj.13515. PubMed DOI

Bailey TL, Williams N, Misleh C, Li WW. Meme: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34(suppl–2):369–373. doi: 10.1093/nar/gkl198. PubMed DOI PMC

Machanick P, Bailey TL. Meme-chip: motif analysis of large DNA datasets. Bioinformatics. 2011;27(12):1696–1697. doi: 10.1093/bioinformatics/btr189. PubMed DOI PMC

Dror I, Golan T, Levy C, Rohs R, Mandel-Gutfreund Y. A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Res. 2015;25(9):1268–1280. doi: 10.1101/gr.184671.114. PubMed DOI PMC

Yella VR, Bhimsaria D, Ghoshdastidar D, Rodríguez-Martínez JA, Ansari AZ, Bansal M. Flexibility and structure of flanking DNA impact transcription factor affinity for its core motif. Nucleic Acids Res. 2018;46(22):11883–11897. doi: 10.1093/nar/gky1057. PubMed DOI PMC

Penvose A, Keenan JL, Bray D, Ramlall V, Siggers T. Comprehensive study of nuclear receptor DNA binding provides a revised framework for understanding receptor specificity. Nat Commun. 2019;10(1):1–15. doi: 10.1038/s41467-019-10264-3. PubMed DOI PMC

Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21(suppl–2):79–85. doi: 10.1093/bioinformatics/bti1114. PubMed DOI

Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. PubMed DOI PMC

Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18(5):821–829. doi: 10.1101/gr.074492.107. PubMed DOI PMC

Namiki T, Hachiya T, Tanaka H, Sakakibara Y. Metavelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40(20):155–155. doi: 10.1093/nar/gks678. PubMed DOI PMC

He X, Cicek AE, Wang Y, Schulz MH, Le H-S, Bar-Joseph Z. De novo chip-seq analysis. Genome Biol. 2015;16(1):1–10. doi: 10.1186/s13059-015-0756-4. PubMed DOI PMC

Chikhi R, Limasset A, Medvedev P. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–208. doi: 10.1093/bioinformatics/btw279. PubMed DOI PMC

Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011. PubMed DOI PMC

Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab. (LANL), Los Alamos, NM (United States); 2008.

Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. Commun ACM. 1975;18(6):333–340. doi: 10.1145/360825.360855. DOI

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. Blast+: architecture and applications. BMC Bioinform. 2009;10(1):1–9. doi: 10.1186/1471-2105-10-421. PubMed DOI PMC

Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947 doi: 10.1214/aoms/1177730491. DOI

Dunn OJ. Multiple comparisons among means. J Am Stat Assoc. 1961;56(293):52–64. doi: 10.1080/01621459.1961.10482090. DOI

Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133(6):1106–1117. doi: 10.1016/j.cell.2008.04.043. PubMed DOI

Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, Van Der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, et al. Jaspar 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018;46(D1):260–266. doi: 10.1093/nar/gkx1188. PubMed DOI PMC

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163. PubMed DOI PMC

Consortium EP, et al. The encode (encyclopedia of DNA elements) project. Science. 2004;306(5696):636–40. doi: 10.1126/science.1105136. PubMed DOI

Šošić M, Šikić M. Edlib: a c/c++ library for fast, exact sequence alignment using edit distance. Bioinformatics. 2017;33(9):1394–1395. doi: 10.1093/bioinformatics/btw753. PubMed DOI PMC

Borrow
RIS

Find record

In BMC

PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction

Find record

Citation metrics

Archiving options