Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning
Status PubMed-not-MEDLINE Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
21-00580S
Grantová Agentura České Republiky
21-00580S
Grantová Agentura České Republiky
21-00580S
Grantová Agentura České Republiky
21-00580S
Grantová Agentura České Republiky
21-00580S
Grantová Agentura České Republiky
21-00580S
Grantová Agentura České Republiky
PubMed
39696434
PubMed Central
PMC11656987
DOI
10.1186/s13040-024-00410-z
PII: 10.1186/s13040-024-00410-z
Knihovny.cz E-zdroje
- Klíčová slova
- CNN-LSTM, DNABERT, Deep learning, Eukaryote, Regulatory mechanisms, Repeat, SHAP score, Sequence analysis, TFBS, Transcription factor binding sites, Transposable elements,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key sequence modules, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs. RESULTS: We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting), (ii) hybrid convolutional/long and short memory network models, and (iii) a DNA pre-trained transformer-based model using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. Positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box regulatory sequence, and TG..CA nucleotide patterns around both LTR edges. CONCLUSIONS: Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.
Zobrazit více v PubMed
Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon JM, et al. Exceptional, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genet. 2009;5(11):e1000732. PubMed PMC
Klaver B, Berkhout B. Comparison of 5’ and 3’ long terminal repeat promoter function in human immunodeficiency virus. J Virol. 1994;68(6):3830–40. PubMed PMC
Jedlicka P, Lexa M, Kejnovsky E. What Can Long Terminal Repeats Tell Us About the Age of LTR Retrotransposons, Gene Conversion and Ectopic Recombination? Front Plant Sci. 2020;11. PubMed PMC
Luo X, Chen S, Zhang Y. PlantRep: a database of plant repetitive elements. Plant Cell Rep. 2022;41:1163–6. PubMed PMC
Bennetzen J, Wang H. The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annu Rev Plant Biol. 2014;65:505–30. PubMed
Grandbastien MA, Audeon C, Bonnivard E, Casacuberta JM, Chalhoub B, Costa APP, et al. Stress activation and genomic impact of Tnt1 retrotransposons in Solanaceae. Cytogenet Genome Res. 2005;110(1–4):229–41. PubMed
Sigman MJ, Slotkin RK. The First Rule of Plant Transposable Element Silencing: Location, Location, Location. Plant Cell. 2016;28(2):304–13. PubMed PMC
Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82. PubMed
Arango-López J, Orozco-Arias S, Salazar JA, Guyot R. Application of Data Mining Algorithms to Classify Biological Data: The Coffea canephora Genome Case. In: Solano A, Ordoñez H, editors. Advances in Computing. Cham: Springer International Publishing; 2017. pp. 156–70.
Orozco-Arias S, Candamil-Cortes MS, Jaimes PA, Valencia-Castrillon E, Tabares-Soto R, Isaza G, et al. Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning. J Integr Bioinform. 2022;19(3):20210036. PubMed PMC
Casacuberta JM, Santiago N. Plant LTR-retrotransposons and MITEs: control of transposition and impact on the evolution of plant genes and genomes. Gene. 2003;311:1–11. PubMed
Dutilleul A, Rodari A, van Lint C. Depicting HIV-1 transcriptional mechanisms: a summary of what we know. Viruses. 2020;12(12):1385. PubMed PMC
Cui X, Cao X. Epigenetic regulation and functional exaptation of transposable elements in higher plants. Curr Opin Plant Biol. 2014;21:83–8. PubMed
Thompson PJ, Macfarlan TS, Lorincz MC. Long Terminal Repeats: From Parasitic Elements to Building Blocks of the Transcriptional Regulatory Repertoire. Mol Cell. 2016;62:766–76. PubMed PMC
Hermant C, Torres-Padilla ME. TFs for TEs: the transcription factor repertoire of mammalian transposable elements. Genes Dev. 2021;35(1–2):22–39. PubMed PMC
Turcotte K, Srinivasan S, Bureau T. Survey of transposable elements from rice genomic sequences. Plant J. 2001;25:169–79. PubMed
Arkhipova IR, Mazo AM, Cherkasova VA, Gorelova TV, Schuppe NG, Ilyin YV. The steps of reverse transcription of drosophila mobile dispersed genetic elements and U3-R-U5 structure of their LTRs. Cell. 1986;44(4):555–63. PubMed
Zhang L, Yan L, Jiang J, Wang Y, Jiang Y, Yan T, et al. The structure and retrotransposition mechanism of LTR-retrotransposons in the asexual yeast Candida albicans. Virulence. 2014;5(6):655–64. PubMed PMC
Du J, Tian Z, Hans CS, Laten HM, Cannon SB, Jackson SA, et al. Evolutionary conservation, diversity and specificity of LTR-retrotransposons in flowering plants: insights from genome-wide analysis and multi-specific comparison. Plant J. 2010;63(4):584–98. PubMed
Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A, Baraniuk R, et al. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun. 2022;13(1):1728. PubMed PMC
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Red Hook, NY: Curran Associates, Inc.; 2017. p. 6000–10.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. PubMed PMC
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. PubMed PMC
Chen Y, Qi Y, Wu Y, Zhang F, Liao X, Shang X. BERTE: High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network. bioRxiv. 2024. 10.1101/2024.01.28.577612.
Kotov A, Zinovyev A, Monsoro-Burq A. scEvoNet: a gradient boosting-based method for prediction of cell state evolution. BMC Bioinformatics. 2023;24(1):83. PubMed PMC
Messad F, Louveau I, Koffi B, Gilbert H, Gondret F. Investigation of muscle transcriptomes using gradient boosting machine learning identifies molecular predictors of feed efficiency in growing pigs. BMC Genomics. 2019;20(1):659. PubMed PMC
Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A, Deepa Kanmani S, Venkatesan C, Suresh Gnana Dhas C. Analysis of DNA Sequence Classification Using CNN and Hybrid Models. Comput Math Methods Med. 2021;1835056. 10.1155/2021/1835056. PubMed PMC
Liang S, Zhu B, Zhang Y, Cheng S, Jin J. A Double Channel CNN-LSTM Model for Text Classification. In: IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). Washington: IEEE Computer Society; 2020. pp. 1316–21.
Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–80. PubMed
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018. arXiv:1810.04805. 10.48550/arXiv.1810.04805.
Koo PK, Eddy SR. Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput Biol. 2019;15(12):e1007560. PubMed PMC
Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions. Adv Neural Inf Process Syst. 2017;30:4765–74.
Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY,: Association for Computing Machinery; 2016. pp. 1135–44.
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning. vol. 70. Sydney: ML Research Press; 2017. pp. 3145–53.
An W, Guo Y, Bian Y, Ma H, Yang J, Li C, et al. MoDNA: motif-oriented pre-training for DNA language model. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. BCB ’22. New York: Association for Computing Machinery; 2022.
Danilevicz MF, Gill M, Fernandez CGT, Petereit J, Upadhyaya SR, Batley J, et al. DNABERT-based explainable lncRNA identification in plant genome assemblies. Comput Struct Biotechnol J. 2023;21:5676–85. PubMed PMC
Zhou SS, Yan XM, Zhang KF, Liu H, Xu J, Nie S, et al. A comprehensive annotation dataset of intact LTR retrotransposons of 300 plant genomes. Sci Data. 2021;2021(8):174. PubMed PMC
Li W, Godzik A. CD-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. PubMed
Youens-Clark K. Mastering python for bioinformatics: How to write flexible, documented, tested python code for research computing. Sebastopol, CA: O’Reilly Media; 2021.
Vitte C, Panaud O. Formation of Solo-LTRs Through Unequal Homologous Recombination Counterbalances Amplifications of LTR Retrotransposons in Rice (Oryza sativa L.). Mol Biol Evol. 2003;20(4):528–40. PubMed
Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Berhanu LR, Turchi L, Blanc-Mathieu R, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022;50(D1):D165–73. PubMed PMC
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. PubMed PMC
Manning CD, Raghavan P, Schütze H. In: Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. pp. 118–20.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.
Chollet F, et al. Keras. https://github.com/fchollet/keras. Accessed 15 Feb 2024.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv. 2020;1910.03771. 10.48550/arXiv.1910.03771.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Advances in Neural Information Processing Systems 32. Red Hook, NY: Curran Associates, Inc.; 2019. pp. 8024–35.
Santafe G, Inza I, Lozano JA. Dealing with the evaluation of supervised classification algorithms. Artif Intell Rev. 2015;44(4):467–508.
Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Sci Rep. 2024;14(1):6086. PubMed PMC
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. PubMed PMC
Pizarro J, Guerrero E, Galindo PL. Multiple comparison procedures applied to model selection. Neurocomputing. 2002;48(1–4):155–73.
Janez Demšar. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
scikit-posthocs - PyPI. Python Software Foundation. https://pypi.org/project/scikit-posthocs/. Accessed 15 Feb 2024.
Covert I, Lundberg SM, Lee SI. Understanding Global Feature Contributions With Additive Importance Measures. Adv Neural Inf Process Syst. 2020;33:17212–23.
Lundberg S, Erion G, Chen H, DeGrave A, Prutkin J, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):2522–5839. PubMed PMC
Shahmuradov IA, Umarov R, Solovyev V. TSSPlant: a new tool for prediction of plant Pol II promoters. Nucleic Acids Res. 2017;45(8):e65. PubMed PMC
Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44(W1):W160–5. PubMed PMC
Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 2015;43(W1):W39–49. PubMed PMC
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Washington: AAAI Press; 1994. pp. 28–36. PubMed
Kolberg L, Raudvere U, Kuzmin I, Adler P, Vilo J, Peterson H. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023;51(W1):W207–12. PubMed PMC
Schietgat L, Vens C, Cerri R, Fischer CN, Costa E, Ramon J, et al. A machine learning based framework to identify and classify long terminal repeat retrotransposons. PLoS Comput Biol. 2018;14(4):e1006097. PubMed PMC
Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, et al. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ. 2021;9:e11456. PubMed PMC
Abrusán G, Grundmann N, Makałowski W. TEclass - a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics. 2009;25(10):1329–30. PubMed
Nakano FK, Pinto WJ, Pappa GL, Cerri R. Top-down strategies for hierarchical classification of transposable elements with neural networks. In: International Joint Conference on Neural Networks (RIJCNN). Washington: IEEE; 2017. pp. 2539–46.
da Cruz MHP, Domingues DS, Saito PTM, Paschoal AR, Bugatti PH. TERL: classification of transposable elements by convolutional neural networks. Brief Bioinform. 2021;22(3):bbaa185. PubMed
Yan H, Bombarely A, Li S. DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics. 2020;36(15):4269–75. PubMed
Rocheta M, Carvalho L, Viegas W, Morais-Cecilio L. Corky, a Gypsy-like retrotransposon is differentially transcribed in Quercus suber tissues. BMC Res Notes. 2012;5(1):432. PubMed PMC
Yuan HY, Kagale S, Ferrie AMR. Multifaceted roles of transcription factors during plant embryogenesis. Front Plant Sci. 2024;14. 10.3389/fpls.2023.1322728. PubMed PMC
Hong JC. General Aspects of Plant Transcription Factor Families. In: Gonzalez DH, editor. Plant Transcription Factors. Boston: Academic Press; 2016. p. 35–56.
Boer DR, Freire-Rios A, van den Berg WM, Saaki T, Manfield I, Kepinski S, et al. Structural Basis for DNA Binding Specificity by the Auxin-Dependent ARF Transcription Factors. Cell. 2014;156(3):577–89. PubMed
Strader L, Weijers D, Wagner D. Plant transcription factors - being in the right place with the right company. Curr Opin Plant Biol. 2022;65:102136. PubMed PMC
Duan K, Ding X, Zhang Q, Zhu H, Pan A, Huang J. AtCopeg1, the unique gene originated from AtCopia95 retrotransposon family, is sensitive to external hormones and abiotic stresses. Plant Cell Rep. 2008;27(6):1065–73. PubMed
Matsunaga W, Kobayashi A, Kato A, Ito H. The effects of heat induction and the siRNA biogenesis pathway on the transgenerational transposition of ONSEN, a copia-like retrotransposon in Arabidopsis thaliana. Plant Cell Physiol. 2011;53(5):824–33. PubMed
Jiao Y, Deng XW. A genome-wide transcriptional activity survey of rice transposable element-related genes. Genome Biol. 2007;8(2):R28. PubMed PMC
Mascagni F, Vangelisti A, Usai G, Giordani T, Cavallini A, Natali L. A computational genome-wide analysis of long terminal repeats retrotransposon expression in sunflower roots (Helianthus annuus L.). Genetica. 2020;148(1):13–23. PubMed
Ito H. Environmental stress and transposons in plants. Genes Genet Syst. 2022;97:169–75. PubMed
Cavrak VV, Lettner N, Jamge S, Kosarewicz A, Bayer LM, Scheid MO. How a retrotransposon exploits the plant’s heat stress response for its activation. PLoS Genet. 2014;10(1):e1004115. PubMed PMC
Ito H, Gaubert H, Bucher E, Mirouze M, Vaillant I, Paszkowski J. An siRNA pathway prevents transgenerational retrotransposition in plants subjected to stress. Nature. 2011;472(7341):115–9. PubMed
Makarevitch I, Waters AJ, West PT, Stitzer M, Hirsch CN, Ross-Ibarra J, et al. Transposable elements contribute to activation of maize genes in response to abiotic stress. PLoS Genet. 2015;11(1):e1004915. PubMed PMC
Deneweth J, Van de Peer Y, Vermeirssen V. Nearby transposable elements impact plant stress gene regulatory networks: a meta-analysis in A. thaliana and S. lycopersicum. BMC Genomics. 2022;23(18):18–23. PubMed PMC
Lu F, Cui X, Zhang S, Jenuwein T, Cao X. Arabidopsis REF6 is a histone H3 lysine 27 demethylase. Nat Genet. 2011;43(7):715–9. PubMed
Hénaff E, Vives C, Desvoyes B, Chaurasia A, Payet J, Gutiérrez C, et al. Extensive amplification of the e2f transcription factor binding sites by transposons during evolution of brassica species. Plant J. 2014;77:852–62. PubMed
Qiu Y, Köhler C. Mobility connects: transposable elements wire new transcriptional networks by transferring transcription factor binding motifs. Biochem Soc Trans. 2020;48(3):1005–17. PubMed PMC
Quadrana L. The contribution of transposable elements to transcriptional novelty in plants: the FLC affair. Transcription. 2020;11(3–4):192–8. PubMed PMC
Dong Z, Xiao Y, Govindarajulu R, Feil R, Siddoway ML, Nielsen T, et al. The regulatory landscape of a core maize domestication module controlling bud dormancy and growth repression. Nat Commun. 2019;10(1):3810. PubMed PMC
Cermak T, Kubat Z, Hobza R, et al. Survey of repetitive sequences in Silene latifolia with respect to their distribution on sex chromosomes. Chromosom Res. 2008;16:961–76. PubMed
Filatov D, Howell EC, Groutides C, Armstrong SJ. Recent spread of a retrotransposon in the Silene latifolia genome, apart from the Y chromosome. Genetics. 2009;181:811–7. PubMed PMC
Hobza R, Cegan R, Jesionek W, Kejnovsky E, Vyskot B, Kubat Z. Impact of repetitive elements on the Y chromosome formation in plants. Genes. 2017;8(11):302. PubMed PMC
Jesionek W, Bodlakova M, Kubat Z, et al. Fundamentally different repetitive element composition of sex chromosomes in Rumex acetosa. Ann Bot. 2021;127:33–47. PubMed PMC
Kubat Z, Zluvova J, Vogel I, et al. Possible mechanisms responsible for absence of a retrotransposon family on a plant Y chromosome. New Phytol. 2014;202:662–78. PubMed
Steflova P, Tokan V, Vogel I, et al. Contrasting patterns of transposable element and satellite distribution on sex chromosomes (XY1Y2) in the dioecious plant Rumex acetosa. Genome Biol Evol. 2013;5:769–82. PubMed PMC
Pontis J, Pulver C, Playfoot CJ, Planet E, Grun D, Offner S, et al. Primate-specific transposable elements shape transcriptional networks during human development. Nat Commun. 2022;13(1):7178. PubMed PMC
Gebrie A. Transposable elements as essential elements in the control of gene expression. Mobile DNA. 2023;14(1):9. PubMed PMC