• This record comes from PubMed

Solving the transcription start site identification problem with ADAPT-CAGE: a Machine Learning algorithm for the analysis of CAGE data

. 2020 Jan 21 ; 10 (1) : 877. [epub] 20200121

Status PubMed-not-MEDLINE Language English Country Great Britain, England Media electronic

Document type Journal Article, Research Support, Non-U.S. Gov't

Links

PubMed 31965016
PubMed Central PMC6972925
DOI 10.1038/s41598-020-57811-3
PII: 10.1038/s41598-020-57811-3
Knihovny.cz E-resources

Cap Analysis of Gene Expression (CAGE) has emerged as a powerful experimental technique for assisting in the identification of transcription start sites (TSSs). There is strong evidence that CAGE also identifies capping sites along various other locations of transcribed loci such as splicing byproducts, alternative isoforms and capped molecules overlapping introns and exons. We present ADAPT-CAGE, a Machine Learning framework which is trained to distinguish between CAGE signal derived from TSSs and transcriptional noise. ADAPT-CAGE provides highly accurate experimentally derived TSSs on a genome-wide scale. It has been specifically designed for flexibility and ease-of-use by only requiring aligned CAGE data and the underlying genomic sequence. When compared to existing algorithms, ADAPT-CAGE exhibits improved performance on every benchmark that we designed based on both annotation- and experimentally-driven strategies. This performance boost brings ADAPT-CAGE in the spotlight as a computational framework that is able to assist in the refinement of gene regulatory networks, the incorporation of accurate information of gene expression regulators and alternative promoter usage in both physiological and pathological conditions.

See more in PubMed

Shiraki T, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA. 2003;100:15776–15781. doi: 10.1073/pnas.2136655100. PubMed DOI PMC

(dgt), T. F. C. A. T. R. P. A. C. & The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature507, 462–470 (2014). PubMed PMC

Carninci P. RNA dust: where are the genes? DNA Res. 2010;17:51–59. doi: 10.1093/dnares/dsq006. PubMed DOI PMC

Takahashi H, Lassmann T, Murata M, Carninci P. 5′ end-centered expression profiling using cap-analysis gene expression and next-generation sequencing. Nat. Protoc. 2012;7:542–561. doi: 10.1038/nprot.2012.005. PubMed DOI PMC

Fejes-Toth K, et al. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs: Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project. Nature. 2009;457:1028. doi: 10.1038/nature07759. PubMed DOI PMC

Frith MC, et al. A code for transcription initiation in mammalian genomes. Genome Res. 2008;18:1–12. doi: 10.1101/gr.6831208. PubMed DOI PMC

Ohmiya H, et al. RECLU: a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (CAGE) BMC Genomics. 2014;15:269. doi: 10.1186/1471-2164-15-269. PubMed DOI PMC

Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011;5:1752–1779. doi: 10.1214/11-AOAS466. DOI

Haberle V, Forrest ARR, Hayashizaki Y, Carninci P, Lenhard B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015;43:e51. doi: 10.1093/nar/gkv054. PubMed DOI PMC

Gan Y, Guan J, Zhou S. A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics. 2012;13:4. doi: 10.1186/1471-2105-13-4. PubMed DOI PMC

Fukue Y, Sumida N, Nishikawa J-I, Ohyama T. Core promoter elements of eukaryotic genes have a highly distinctive mechanical property. Nucleic Acids Res. 2004;32:5834–5840. doi: 10.1093/nar/gkh905. PubMed DOI PMC

Kanhere A, Bansal M. Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes. Nucleic Acids Res. 2005;33:3165–3175. doi: 10.1093/nar/gki627. PubMed DOI PMC

Abeel T, Saeys Y, Bonnet E, Rouze P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genome Research. 2008;18:310–323. doi: 10.1101/gr.6991408. PubMed DOI PMC

Valen E, et al. Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Research. 2008;19:255–265. doi: 10.1101/gr.084541.108. PubMed DOI PMC

Johnson JL, et al. Lineage-Determining Transcription Factor TCF-1 Initiates the Epigenetic Identity of T Cells. Immunity. 2018;48:243–257.e10. doi: 10.1016/j.immuni.2018.01.012. PubMed DOI PMC

Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. PubMed DOI PMC

ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. PubMed DOI PMC

Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods. 2012;9:215–216. doi: 10.1038/nmeth.1906. PubMed DOI PMC

O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. doi: 10.1093/nar/gkv1189. PubMed DOI PMC

Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33:4255–4264. doi: 10.1093/nar/gki737. PubMed DOI PMC

Khan A, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research. 2018;46:D1284–D1284. doi: 10.1093/nar/gkx1188. PubMed DOI PMC

Kursa, M. B. & Rudnicki, W. R. Feature Selection with theBorutaPackage. Journal of Statistical Software36 (2010).

Kuhn, M. Building Predictive Models inRUsing thecaretPackage. Journal of Statistical Software28 (2008).

Thomas-Chollier M, et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nat. Protoc. 2011;6:1860–1869. doi: 10.1038/nprot.2011.409. PubMed DOI

Chang C-C, Lin C-J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011;2(27):1–27:27. doi: 10.1145/1961189.1961199. DOI

Helmuth, J., Li, N., Arrigoni, L., Gianmoena, K. & Cadenas, C. normR: Regime enrichment calling for ChIP-seq data. bioRxiv (2016).

Find record

Citation metrics

Loading data ...

    Archiving options