Solving the transcription start site identification problem with ADAPT-CAGE: a Machine Learning algorithm for the analysis of CAGE data
Status PubMed-not-MEDLINE Language English Country Great Britain, England Media electronic
Document type Journal Article, Research Support, Non-U.S. Gov't
PubMed
31965016
PubMed Central
PMC6972925
DOI
10.1038/s41598-020-57811-3
PII: 10.1038/s41598-020-57811-3
Knihovny.cz E-resources
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Cap Analysis of Gene Expression (CAGE) has emerged as a powerful experimental technique for assisting in the identification of transcription start sites (TSSs). There is strong evidence that CAGE also identifies capping sites along various other locations of transcribed loci such as splicing byproducts, alternative isoforms and capped molecules overlapping introns and exons. We present ADAPT-CAGE, a Machine Learning framework which is trained to distinguish between CAGE signal derived from TSSs and transcriptional noise. ADAPT-CAGE provides highly accurate experimentally derived TSSs on a genome-wide scale. It has been specifically designed for flexibility and ease-of-use by only requiring aligned CAGE data and the underlying genomic sequence. When compared to existing algorithms, ADAPT-CAGE exhibits improved performance on every benchmark that we designed based on both annotation- and experimentally-driven strategies. This performance boost brings ADAPT-CAGE in the spotlight as a computational framework that is able to assist in the refinement of gene regulatory networks, the incorporation of accurate information of gene expression regulators and alternative promoter usage in both physiological and pathological conditions.
Central European Institute of Technology Masaryk University Kamenice 735 5 62500 Brno Czech Republic
Department of Electrical and Computer Engineering University of Thessaly Volos Greece
See more in PubMed
Shiraki T, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA. 2003;100:15776–15781. doi: 10.1073/pnas.2136655100. PubMed DOI PMC
(dgt), T. F. C. A. T. R. P. A. C. & The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature507, 462–470 (2014). PubMed PMC
Carninci P. RNA dust: where are the genes? DNA Res. 2010;17:51–59. doi: 10.1093/dnares/dsq006. PubMed DOI PMC
Takahashi H, Lassmann T, Murata M, Carninci P. 5′ end-centered expression profiling using cap-analysis gene expression and next-generation sequencing. Nat. Protoc. 2012;7:542–561. doi: 10.1038/nprot.2012.005. PubMed DOI PMC
Fejes-Toth K, et al. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs: Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project. Nature. 2009;457:1028. doi: 10.1038/nature07759. PubMed DOI PMC
Frith MC, et al. A code for transcription initiation in mammalian genomes. Genome Res. 2008;18:1–12. doi: 10.1101/gr.6831208. PubMed DOI PMC
Ohmiya H, et al. RECLU: a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (CAGE) BMC Genomics. 2014;15:269. doi: 10.1186/1471-2164-15-269. PubMed DOI PMC
Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011;5:1752–1779. doi: 10.1214/11-AOAS466. DOI
Haberle V, Forrest ARR, Hayashizaki Y, Carninci P, Lenhard B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015;43:e51. doi: 10.1093/nar/gkv054. PubMed DOI PMC
Gan Y, Guan J, Zhou S. A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics. 2012;13:4. doi: 10.1186/1471-2105-13-4. PubMed DOI PMC
Fukue Y, Sumida N, Nishikawa J-I, Ohyama T. Core promoter elements of eukaryotic genes have a highly distinctive mechanical property. Nucleic Acids Res. 2004;32:5834–5840. doi: 10.1093/nar/gkh905. PubMed DOI PMC
Kanhere A, Bansal M. Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes. Nucleic Acids Res. 2005;33:3165–3175. doi: 10.1093/nar/gki627. PubMed DOI PMC
Abeel T, Saeys Y, Bonnet E, Rouze P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genome Research. 2008;18:310–323. doi: 10.1101/gr.6991408. PubMed DOI PMC
Valen E, et al. Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Research. 2008;19:255–265. doi: 10.1101/gr.084541.108. PubMed DOI PMC
Johnson JL, et al. Lineage-Determining Transcription Factor TCF-1 Initiates the Epigenetic Identity of T Cells. Immunity. 2018;48:243–257.e10. doi: 10.1016/j.immuni.2018.01.012. PubMed DOI PMC
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. PubMed DOI PMC
ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. PubMed DOI PMC
Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods. 2012;9:215–216. doi: 10.1038/nmeth.1906. PubMed DOI PMC
O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. doi: 10.1093/nar/gkv1189. PubMed DOI PMC
Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33:4255–4264. doi: 10.1093/nar/gki737. PubMed DOI PMC
Khan A, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research. 2018;46:D1284–D1284. doi: 10.1093/nar/gkx1188. PubMed DOI PMC
Kursa, M. B. & Rudnicki, W. R. Feature Selection with theBorutaPackage. Journal of Statistical Software36 (2010).
Kuhn, M. Building Predictive Models inRUsing thecaretPackage. Journal of Statistical Software28 (2008).
Thomas-Chollier M, et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nat. Protoc. 2011;6:1860–1869. doi: 10.1038/nprot.2011.409. PubMed DOI
Chang C-C, Lin C-J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011;2(27):1–27:27. doi: 10.1145/1961189.1961199. DOI
Helmuth, J., Li, N., Arrigoni, L., Gianmoena, K. & Cadenas, C. normR: Regime enrichment calling for ChIP-seq data. bioRxiv (2016).
DIANA-miRGen v4: indexing promoters and regulators for more than 1500 microRNAs