Genomic benchmarks: a collection of datasets for genomic sequence classification
Jazyk angličtina Země Anglie, Velká Británie Médium electronic
Typ dokumentu časopisecké články, práce podpořená grantem
PubMed
37127596
PubMed Central
PMC10150520
DOI
10.1186/s12863-023-01123-8
PII: 10.1186/s12863-023-01123-8
Knihovny.cz E-zdroje
- Klíčová slova
- Benchmark, Convolutional neural network, Dataset, Deep learning, Genomics,
- MeSH
- benchmarking * MeSH
- chromatin MeSH
- genomika metody MeSH
- lidé MeSH
- myši MeSH
- neuronové sítě (počítačové) * MeSH
- strojové učení MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- myši MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Názvy látek
- chromatin MeSH
BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Zobrazit více v PubMed
Oubounyt M, Louadi Z, Tayara H, Chong KT. DeePromoter: robust promoter predictor using deep learning. Front Genet. 2019;10:286. doi: 10.3389/fgene.2019.00286. PubMed DOI PMC
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. 2021;22(5). PubMed
Quang D, Xie X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–7. doi: 10.1016/j.ymeth.2019.03.020. PubMed DOI PMC
Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics. 2019;20(2):11–23. PubMed PMC
Shen Z, Zhang Q, Han K, Huang Ds. A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans Comput Biol Bioinforma. 2020;19(2):753–62. PubMed
Georgakilas GK, Grioni A, Liakos KG, Chalupova E, Plessas FC, Alexiou P. Multi-branch convolutional neural network for identification of small non-coding RNA genomic loci. Sci Rep. 2020;10(1):1–10. doi: 10.1038/s41598-020-66454-3. PubMed DOI PMC
Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision. Institute of Electrical and Electronics Engineers Inc., United States. 2017. p. 843–852.
Nawi NM, Atomi WH, Rehman MZ. The effect of data pre-processing on optimized training of artificial neural networks. Procedia Technol. 2013;11:32–9. doi: 10.1016/j.protcy.2013.12.159. DOI
Rajpurkar P, Zhang J, Lopyrev K, Liang P. Squad: 100,000+ questions for machine comprehension of text. 2016. arXiv preprint arXiv:1606.05250.
Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, Portland, Oregon, USA. 2011. p. 142–150.
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L, Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE. 2009. p. 248–255.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis. 2015;115(3):211–52. doi: 10.1007/s11263-015-0816-y. DOI
Moult J, Pedersen JT, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Wiley Online Library; 1995. PubMed
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. doi: 10.1038/s41586-021-03819-2. PubMed DOI PMC
Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016;32(3):362–9. doi: 10.1093/bioinformatics/btv604. PubMed DOI
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. doi: 10.1093/bioinformatics/btl158. PubMed DOI
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics. 2018;34(22):3835–42. doi: 10.1093/bioinformatics/bty458. PubMed DOI
Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem. 2019;571:53–61. doi: 10.1016/j.ab.2019.02.017. PubMed DOI
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou’s trinucleotide composition. Comput Methods Prog Biomed. 2017;146:69–75. doi: 10.1016/j.cmpb.2017.05.008. PubMed DOI
Jia C, He W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Sci Rep. 2016;6(1):1–7. doi: 10.1038/srep38741. PubMed DOI PMC
He W, Jia C. EnhancerPred2. 0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron–ion interaction potential feature selection. Mol BioSyst. 2017;13(4):767–74. doi: 10.1039/C7MB00054E. PubMed DOI
Nguyen QH, Nguyen-Vo TH, Le NQK, Do TT, Rahardja S, Nguyen BP. iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genomics. 2019;20(9):1–10. PubMed PMC
Khanal J, Tayara H, Chong KT. Identifying enhancers and their strength by the integration of word embedding and convolution neural network. IEEE Access. 2020;8:58369–76. doi: 10.1109/ACCESS.2020.2982666. DOI
Zhang TH, Flores M, Huang Y. ES-ARCNN: Predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem. 2021;618:114120. doi: 10.1016/j.ab.2021.114120. PubMed DOI
Inayat N, Khan M, Iqbal N, Khan S, Raza M, Khan DM, et al. iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods. IEEE Access. 2021;9:40783–96. doi: 10.1109/ACCESS.2021.3062291. DOI
Mu X, Wang Y, Duan M, Liu S, Li F, Wang X, et al. A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers. Int J Mol Sci. 2021;22(6):3079. doi: 10.3390/ijms22063079. PubMed DOI PMC
Yang R, Wu F, Zhang C, Zhang L. iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength. Int J Mol Sci. 2021;22(7):3589. doi: 10.3390/ijms22073589. PubMed DOI PMC
Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 2007;35(suppl_1):88–92. doi: 10.1093/nar/gkl822. PubMed DOI PMC
Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61. doi: 10.1038/nature12787. PubMed DOI PMC
ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57. doi: 10.1038/nature11247. PubMed DOI PMC
Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30. doi: 10.1038/nature14248. PubMed DOI PMC
Lin H, Li QZ. Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci. 2011;130(2):91–100. doi: 10.1007/s12064-010-0114-8. PubMed DOI
Schmid CD, Perier R, Praz V, Bucher P. EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 2006;34(suppl_1):82–5. doi: 10.1093/nar/gkj146. PubMed DOI PMC
Gordon L, Chervonenkis AY, Gammerman AJ, Shahmuradov IA, Solovyev VV. Sequence alignment kernel for recognition of promoter regions. Bioinformatics. 2003;19(15):1964–71. doi: 10.1093/bioinformatics/btg265. PubMed DOI
Ohler U. Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res. 2006;34(20):5943–50. doi: 10.1093/nar/gkl608. PubMed DOI PMC
Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ. Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics. 2008;9(1):1–13. doi: 10.1186/1471-2105-9-113. PubMed DOI PMC
Rani TS, Bhavani SD, Bapi RS. Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics. 2007;23(5):582–8. doi: 10.1093/bioinformatics/btl670. PubMed DOI
Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, et al. iProEP: a computational predictor for predicting promoter. Mol Ther Nucleic Acids. 2019;17:337–46. doi: 10.1016/j.omtn.2019.05.028. PubMed DOI PMC
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:8026–37.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. {TensorFlow}: A System for {Large-Scale} Machine Learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16). USENIX Association, Savannah, GA, USA. 2016. p. 265–283.
Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE. 2017;12(2):0171410. doi: 10.1371/journal.pone.0171410. PubMed DOI PMC
Cohn D, Zuk O, Kaplan T. Enhancer identification using transfer and adversarial deep learning of DNA sequences. BioRxiv. 2018:264200.
Kvon EZ, Kazmar T, Stampfel G, Yáñez-Cuna JO, Pagani M, Schernhuber K, et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature. 2014;512(7512):91–5. doi: 10.1038/nature13395. PubMed DOI
Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR. The ensembl regulatory build. Genome Biol. 2015;16(1):1–8. doi: 10.1186/s13059-015-0621-5. PubMed DOI PMC
Hoskins RA, Carlson JW, Kennedy C, Acevedo D, Evans-Holm M, Frise E, et al. Sequence finishing and mapping of Drosophila melanogaster heterochromatin. Science. 2007;316(5831):1625–8. doi: 10.1126/science.1139816. PubMed DOI PMC
dos Santos G, Schroeder AJ, Goodman JL, Strelets VB, Crosby MA, Thurmond J, et al. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 2015;43(D1):690–7. doi: 10.1093/nar/gku1099. PubMed DOI PMC
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic Acids Res. 2021;49(D1):884–91. PubMed PMC
Klimentova E, Polacek J, Simecek P, Alexiou P. PENGUINN: Precise exploration of nuclear G-quadruplexes using interpretable neural networks. Front Genet. 2020;11:1287. doi: 10.3389/fgene.2020.568546. PubMed DOI PMC
Albawi S, Mohammed TA, Al-Zawi S, Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET). IEEE. 2017. p. 1–6.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. PubMed PMC