Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning

. 2024 Dec 18 ; 17 (1) : 57. [epub] 20241218

Status PubMed-not-MEDLINE Jazyk angličtina Země Velká Británie, Anglie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid39696434

Grantová podpora
21-00580S Grantová Agentura České Republiky
21-00580S Grantová Agentura České Republiky
21-00580S Grantová Agentura České Republiky
21-00580S Grantová Agentura České Republiky
21-00580S Grantová Agentura České Republiky
21-00580S Grantová Agentura České Republiky

Odkazy

PubMed 39696434
PubMed Central PMC11656987
DOI 10.1186/s13040-024-00410-z
PII: 10.1186/s13040-024-00410-z
Knihovny.cz E-zdroje

BACKGROUND: Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key sequence modules, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs. RESULTS: We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting), (ii) hybrid convolutional/long and short memory network models, and (iii) a DNA pre-trained transformer-based model using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. Positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box regulatory sequence, and TG..CA nucleotide patterns around both LTR edges. CONCLUSIONS: Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.

Erratum v

PubMed

Zobrazit více v PubMed

Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon JM, et al. Exceptional, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genet. 2009;5(11):e1000732. PubMed PMC

Klaver B, Berkhout B. Comparison of 5’ and 3’ long terminal repeat promoter function in human immunodeficiency virus. J Virol. 1994;68(6):3830–40. PubMed PMC

Jedlicka P, Lexa M, Kejnovsky E. What Can Long Terminal Repeats Tell Us About the Age of LTR Retrotransposons, Gene Conversion and Ectopic Recombination? Front Plant Sci. 2020;11. PubMed PMC

Luo X, Chen S, Zhang Y. PlantRep: a database of plant repetitive elements. Plant Cell Rep. 2022;41:1163–6. PubMed PMC

Bennetzen J, Wang H. The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annu Rev Plant Biol. 2014;65:505–30. PubMed

Grandbastien MA, Audeon C, Bonnivard E, Casacuberta JM, Chalhoub B, Costa APP, et al. Stress activation and genomic impact of Tnt1 retrotransposons in Solanaceae. Cytogenet Genome Res. 2005;110(1–4):229–41. PubMed

Sigman MJ, Slotkin RK. The First Rule of Plant Transposable Element Silencing: Location, Location, Location. Plant Cell. 2016;28(2):304–13. PubMed PMC

Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82. PubMed

Arango-López J, Orozco-Arias S, Salazar JA, Guyot R. Application of Data Mining Algorithms to Classify Biological Data: The Coffea canephora Genome Case. In: Solano A, Ordoñez H, editors. Advances in Computing. Cham: Springer International Publishing; 2017. pp. 156–70.

Orozco-Arias S, Candamil-Cortes MS, Jaimes PA, Valencia-Castrillon E, Tabares-Soto R, Isaza G, et al. Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning. J Integr Bioinform. 2022;19(3):20210036. PubMed PMC

Casacuberta JM, Santiago N. Plant LTR-retrotransposons and MITEs: control of transposition and impact on the evolution of plant genes and genomes. Gene. 2003;311:1–11. PubMed

Dutilleul A, Rodari A, van Lint C. Depicting HIV-1 transcriptional mechanisms: a summary of what we know. Viruses. 2020;12(12):1385. PubMed PMC

Cui X, Cao X. Epigenetic regulation and functional exaptation of transposable elements in higher plants. Curr Opin Plant Biol. 2014;21:83–8. PubMed

Thompson PJ, Macfarlan TS, Lorincz MC. Long Terminal Repeats: From Parasitic Elements to Building Blocks of the Transcriptional Regulatory Repertoire. Mol Cell. 2016;62:766–76. PubMed PMC

Hermant C, Torres-Padilla ME. TFs for TEs: the transcription factor repertoire of mammalian transposable elements. Genes Dev. 2021;35(1–2):22–39. PubMed PMC

Turcotte K, Srinivasan S, Bureau T. Survey of transposable elements from rice genomic sequences. Plant J. 2001;25:169–79. PubMed

Arkhipova IR, Mazo AM, Cherkasova VA, Gorelova TV, Schuppe NG, Ilyin YV. The steps of reverse transcription of drosophila mobile dispersed genetic elements and U3-R-U5 structure of their LTRs. Cell. 1986;44(4):555–63. PubMed

Zhang L, Yan L, Jiang J, Wang Y, Jiang Y, Yan T, et al. The structure and retrotransposition mechanism of LTR-retrotransposons in the asexual yeast Candida albicans. Virulence. 2014;5(6):655–64. PubMed PMC

Du J, Tian Z, Hans CS, Laten HM, Cannon SB, Jackson SA, et al. Evolutionary conservation, diversity and specificity of LTR-retrotransposons in flowering plants: insights from genome-wide analysis and multi-specific comparison. Plant J. 2010;63(4):584–98. PubMed

Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A, Baraniuk R, et al. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun. 2022;13(1):1728. PubMed PMC

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Red Hook, NY: Curran Associates, Inc.; 2017. p. 6000–10.

Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. PubMed PMC

Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. PubMed PMC

Chen Y, Qi Y, Wu Y, Zhang F, Liao X, Shang X. BERTE: High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network. bioRxiv. 2024. 10.1101/2024.01.28.577612.

Kotov A, Zinovyev A, Monsoro-Burq A. scEvoNet: a gradient boosting-based method for prediction of cell state evolution. BMC Bioinformatics. 2023;24(1):83. PubMed PMC

Messad F, Louveau I, Koffi B, Gilbert H, Gondret F. Investigation of muscle transcriptomes using gradient boosting machine learning identifies molecular predictors of feed efficiency in growing pigs. BMC Genomics. 2019;20(1):659. PubMed PMC

Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A, Deepa Kanmani S, Venkatesan C, Suresh Gnana Dhas C. Analysis of DNA Sequence Classification Using CNN and Hybrid Models. Comput Math Methods Med. 2021;1835056. 10.1155/2021/1835056. PubMed PMC

Liang S, Zhu B, Zhang Y, Cheng S, Jin J. A Double Channel CNN-LSTM Model for Text Classification. In: IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). Washington: IEEE Computer Society; 2020. pp. 1316–21.

Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735–80. PubMed

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018. arXiv:1810.04805. 10.48550/arXiv.1810.04805.

Koo PK, Eddy SR. Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput Biol. 2019;15(12):e1007560. PubMed PMC

Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions. Adv Neural Inf Process Syst. 2017;30:4765–74.

Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY,: Association for Computing Machinery; 2016. pp. 1135–44.

Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning. vol. 70. Sydney: ML Research Press; 2017. pp. 3145–53.

An W, Guo Y, Bian Y, Ma H, Yang J, Li C, et al. MoDNA: motif-oriented pre-training for DNA language model. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. BCB ’22. New York: Association for Computing Machinery; 2022.

Danilevicz MF, Gill M, Fernandez CGT, Petereit J, Upadhyaya SR, Batley J, et al. DNABERT-based explainable lncRNA identification in plant genome assemblies. Comput Struct Biotechnol J. 2023;21:5676–85. PubMed PMC

Zhou SS, Yan XM, Zhang KF, Liu H, Xu J, Nie S, et al. A comprehensive annotation dataset of intact LTR retrotransposons of 300 plant genomes. Sci Data. 2021;2021(8):174. PubMed PMC

Li W, Godzik A. CD-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. PubMed

Youens-Clark K. Mastering python for bioinformatics: How to write flexible, documented, tested python code for research computing. Sebastopol, CA: O’Reilly Media; 2021.

Vitte C, Panaud O. Formation of Solo-LTRs Through Unequal Homologous Recombination Counterbalances Amplifications of LTR Retrotransposons in Rice (Oryza sativa L.). Mol Biol Evol. 2003;20(4):528–40. PubMed

Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Berhanu LR, Turchi L, Blanc-Mathieu R, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022;50(D1):D165–73. PubMed PMC

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. PubMed PMC

Manning CD, Raghavan P, Schütze H. In: Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. pp. 118–20.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.

Chollet F, et al. Keras. https://github.com/fchollet/keras. Accessed 15 Feb 2024.

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv. 2020;1910.03771. 10.48550/arXiv.1910.03771.

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Advances in Neural Information Processing Systems 32. Red Hook, NY: Curran Associates, Inc.; 2019. pp. 8024–35.

Santafe G, Inza I, Lozano JA. Dealing with the evaluation of supervised classification algorithms. Artif Intell Rev. 2015;44(4):467–508.

Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Sci Rep. 2024;14(1):6086. PubMed PMC

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. PubMed PMC

Pizarro J, Guerrero E, Galindo PL. Multiple comparison procedures applied to model selection. Neurocomputing. 2002;48(1–4):155–73.

Janez Demšar. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.

scikit-posthocs - PyPI. Python Software Foundation. https://pypi.org/project/scikit-posthocs/. Accessed 15 Feb 2024.

Covert I, Lundberg SM, Lee SI. Understanding Global Feature Contributions With Additive Importance Measures. Adv Neural Inf Process Syst. 2020;33:17212–23.

Lundberg S, Erion G, Chen H, DeGrave A, Prutkin J, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):2522–5839. PubMed PMC

Shahmuradov IA, Umarov R, Solovyev V. TSSPlant: a new tool for prediction of plant Pol II promoters. Nucleic Acids Res. 2017;45(8):e65. PubMed PMC

Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44(W1):W160–5. PubMed PMC

Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 2015;43(W1):W39–49. PubMed PMC

Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Washington: AAAI Press; 1994. pp. 28–36. PubMed

Kolberg L, Raudvere U, Kuzmin I, Adler P, Vilo J, Peterson H. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023;51(W1):W207–12. PubMed PMC

Schietgat L, Vens C, Cerri R, Fischer CN, Costa E, Ramon J, et al. A machine learning based framework to identify and classify long terminal repeat retrotransposons. PLoS Comput Biol. 2018;14(4):e1006097. PubMed PMC

Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, et al. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ. 2021;9:e11456. PubMed PMC

Abrusán G, Grundmann N, Makałowski W. TEclass - a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics. 2009;25(10):1329–30. PubMed

Nakano FK, Pinto WJ, Pappa GL, Cerri R. Top-down strategies for hierarchical classification of transposable elements with neural networks. In: International Joint Conference on Neural Networks (RIJCNN). Washington: IEEE; 2017. pp. 2539–46.

da Cruz MHP, Domingues DS, Saito PTM, Paschoal AR, Bugatti PH. TERL: classification of transposable elements by convolutional neural networks. Brief Bioinform. 2021;22(3):bbaa185. PubMed

Yan H, Bombarely A, Li S. DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics. 2020;36(15):4269–75. PubMed

Rocheta M, Carvalho L, Viegas W, Morais-Cecilio L. Corky, a Gypsy-like retrotransposon is differentially transcribed in Quercus suber tissues. BMC Res Notes. 2012;5(1):432. PubMed PMC

Yuan HY, Kagale S, Ferrie AMR. Multifaceted roles of transcription factors during plant embryogenesis. Front Plant Sci. 2024;14. 10.3389/fpls.2023.1322728. PubMed PMC

Hong JC. General Aspects of Plant Transcription Factor Families. In: Gonzalez DH, editor. Plant Transcription Factors. Boston: Academic Press; 2016. p. 35–56.

Boer DR, Freire-Rios A, van den Berg WM, Saaki T, Manfield I, Kepinski S, et al. Structural Basis for DNA Binding Specificity by the Auxin-Dependent ARF Transcription Factors. Cell. 2014;156(3):577–89. PubMed

Strader L, Weijers D, Wagner D. Plant transcription factors - being in the right place with the right company. Curr Opin Plant Biol. 2022;65:102136. PubMed PMC

Duan K, Ding X, Zhang Q, Zhu H, Pan A, Huang J. AtCopeg1, the unique gene originated from AtCopia95 retrotransposon family, is sensitive to external hormones and abiotic stresses. Plant Cell Rep. 2008;27(6):1065–73. PubMed

Matsunaga W, Kobayashi A, Kato A, Ito H. The effects of heat induction and the siRNA biogenesis pathway on the transgenerational transposition of ONSEN, a copia-like retrotransposon in Arabidopsis thaliana. Plant Cell Physiol. 2011;53(5):824–33. PubMed

Jiao Y, Deng XW. A genome-wide transcriptional activity survey of rice transposable element-related genes. Genome Biol. 2007;8(2):R28. PubMed PMC

Mascagni F, Vangelisti A, Usai G, Giordani T, Cavallini A, Natali L. A computational genome-wide analysis of long terminal repeats retrotransposon expression in sunflower roots (Helianthus annuus L.). Genetica. 2020;148(1):13–23. PubMed

Ito H. Environmental stress and transposons in plants. Genes Genet Syst. 2022;97:169–75. PubMed

Cavrak VV, Lettner N, Jamge S, Kosarewicz A, Bayer LM, Scheid MO. How a retrotransposon exploits the plant’s heat stress response for its activation. PLoS Genet. 2014;10(1):e1004115. PubMed PMC

Ito H, Gaubert H, Bucher E, Mirouze M, Vaillant I, Paszkowski J. An siRNA pathway prevents transgenerational retrotransposition in plants subjected to stress. Nature. 2011;472(7341):115–9. PubMed

Makarevitch I, Waters AJ, West PT, Stitzer M, Hirsch CN, Ross-Ibarra J, et al. Transposable elements contribute to activation of maize genes in response to abiotic stress. PLoS Genet. 2015;11(1):e1004915. PubMed PMC

Deneweth J, Van de Peer Y, Vermeirssen V. Nearby transposable elements impact plant stress gene regulatory networks: a meta-analysis in A. thaliana and S. lycopersicum. BMC Genomics. 2022;23(18):18–23. PubMed PMC

Lu F, Cui X, Zhang S, Jenuwein T, Cao X. Arabidopsis REF6 is a histone H3 lysine 27 demethylase. Nat Genet. 2011;43(7):715–9. PubMed

Hénaff E, Vives C, Desvoyes B, Chaurasia A, Payet J, Gutiérrez C, et al. Extensive amplification of the e2f transcription factor binding sites by transposons during evolution of brassica species. Plant J. 2014;77:852–62. PubMed

Qiu Y, Köhler C. Mobility connects: transposable elements wire new transcriptional networks by transferring transcription factor binding motifs. Biochem Soc Trans. 2020;48(3):1005–17. PubMed PMC

Quadrana L. The contribution of transposable elements to transcriptional novelty in plants: the FLC affair. Transcription. 2020;11(3–4):192–8. PubMed PMC

Dong Z, Xiao Y, Govindarajulu R, Feil R, Siddoway ML, Nielsen T, et al. The regulatory landscape of a core maize domestication module controlling bud dormancy and growth repression. Nat Commun. 2019;10(1):3810. PubMed PMC

Cermak T, Kubat Z, Hobza R, et al. Survey of repetitive sequences in Silene latifolia with respect to their distribution on sex chromosomes. Chromosom Res. 2008;16:961–76. PubMed

Filatov D, Howell EC, Groutides C, Armstrong SJ. Recent spread of a retrotransposon in the Silene latifolia genome, apart from the Y chromosome. Genetics. 2009;181:811–7. PubMed PMC

Hobza R, Cegan R, Jesionek W, Kejnovsky E, Vyskot B, Kubat Z. Impact of repetitive elements on the Y chromosome formation in plants. Genes. 2017;8(11):302. PubMed PMC

Jesionek W, Bodlakova M, Kubat Z, et al. Fundamentally different repetitive element composition of sex chromosomes in Rumex acetosa. Ann Bot. 2021;127:33–47. PubMed PMC

Kubat Z, Zluvova J, Vogel I, et al. Possible mechanisms responsible for absence of a retrotransposon family on a plant Y chromosome. New Phytol. 2014;202:662–78. PubMed

Steflova P, Tokan V, Vogel I, et al. Contrasting patterns of transposable element and satellite distribution on sex chromosomes (XY1Y2) in the dioecious plant Rumex acetosa. Genome Biol Evol. 2013;5:769–82. PubMed PMC

Pontis J, Pulver C, Playfoot CJ, Planet E, Grun D, Offner S, et al. Primate-specific transposable elements shape transcriptional networks during human development. Nat Commun. 2022;13(1):7178. PubMed PMC

Gebrie A. Transposable elements as essential elements in the control of gene expression. Mobile DNA. 2023;14(1):9. PubMed PMC

Najít záznam

Citační ukazatele

Nahrávání dat ...

    Možnosti archivace