transXpress: a Snakemake pipeline for streamlined de novo transcriptome assembly and annotation
Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
9331
Gordon and Betty Moore Foundation
2020-221485
Chan Zuckerberg Foundation
F32-ES032276
NIEHS NIH HHS - United States
21-11563M
Grantová Agentura České Republiky
891397
H2020 Marie Skłodowska-Curie Actions
MCB-1818132
National Science Foundation
F32 ES032276
NIEHS NIH HHS - United States
PubMed
37016291
PubMed Central
PMC10074830
DOI
10.1186/s12859-023-05254-8
PII: 10.1186/s12859-023-05254-8
Knihovny.cz E-zdroje
- Klíčová slova
- De novo transcriptome assembly, Differential expression analysis, High-performance computing, Non-model organisms, RNA-seq, Reproducible software, Transcriptome annotation,
- MeSH
- anotace sekvence MeSH
- sekvenční analýza RNA metody MeSH
- sekvenování transkriptomu MeSH
- software * MeSH
- stanovení celkové genové exprese MeSH
- transkriptom * MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: RNA-seq followed by de novo transcriptome assembly has been a transformative technique in biological research of non-model organisms, but the computational processing of RNA-seq data entails many different software tools. The complexity of these de novo transcriptomics workflows therefore presents a major barrier for researchers to adopt best-practice methods and up-to-date versions of software. RESULTS: Here we present a streamlined and universal de novo transcriptome assembly and annotation pipeline, transXpress, implemented in Snakemake. transXpress supports two popular assembly programs, Trinity and rnaSPAdes, and allows parallel execution on heterogeneous cluster computing hardware. CONCLUSIONS: transXpress simplifies the use of best-practice methods and up-to-date software for de novo transcriptome assembly, and produces standardized output files that can be mined using SequenceServer to facilitate rapid discovery of new genes and proteins in non-model organisms.
Department of Biology Massachusetts Institute of Technology Cambridge MA 02139 USA
Scripps Institution of Oceanography UC San Diego 9500 Gilman Dr La Jolla CA 92093 USA
Whitehead Institute for Biomedical Research 455 Main Street Cambridge MA 02142 USA
Zobrazit více v PubMed
Torrens-Spence MP, Fallon TR, Weng JK. A workflow for studying specialized metabolism in nonmodel eukaryotic organisms. In: O’Connor SE, editor. Methods in enzymology. Academic Press; 2016. pp. 69–97. PubMed
Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631–656. doi: 10.1038/s41576-019-0150-2. PubMed DOI
RNA-Seq datasets in NCBI SRA. https://www.ncbi.nlm.nih.gov/sra/?term=TRANSCRIPTOMIC%5BSource%5D. Accessed 24 Oct 2022.
NCBI TSA. https://www.ncbi.nlm.nih.gov/Traces/wgs/?view=TSA. Accessed 24 Oct 2022.
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13. doi: 10.1186/s13059-016-0881-8. PubMed DOI PMC
Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genom. 2010;11:663. doi: 10.1186/1471-2164-11-663. PubMed DOI PMC
Melicher D, Torson AS, Dworkin I, Bowsher JH. A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach. BMC Genom. 2014;15:188. doi: 10.1186/1471-2164-15-188. PubMed DOI PMC
Ortiz R, Gera P, Rivera C, Santos JC. Pincho: a modular approach to high quality de novo transcriptomics. Genes. 2021;12:953. doi: 10.3390/genes12070953. PubMed DOI PMC
Lataretu M, Hölzer M. RNAflow: an effective and simple RNA-Seq differential gene expression pipeline using nextflow. Genes. 2020;11:1487. doi: 10.3390/genes11121487. PubMed DOI PMC
Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020;38:276–278. doi: 10.1038/s41587-020-0439-x. PubMed DOI
Federico A, Karagiannis T, Karri K, Kishore D, Koga Y, Campbell JD, et al. Pipeliner: a nextflow-based framework for the definition of sequencing data processing pipelines. Front Genet. 2019;10:614. doi: 10.3389/fgene.2019.00614. PubMed DOI PMC
Cornwell M, Vangala M, Taing L, Herbert Z, Köster J, Li B, et al. VIPER: visualization pipeline for RNA-seq, a snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinform. 2018;19:135. doi: 10.1186/s12859-018-2139-9. PubMed DOI PMC
Zhang X, Jonassen I. RASflow: an RNA-Seq analysis workflow with snakemake. BMC Bioinform. 2020;21:110. doi: 10.1186/s12859-020-3433-x. PubMed DOI PMC
Wang D. hppRNA—a snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples. Brief Bioinform. 2018;19:622–626. PubMed
Wolfien M, Rimmbach C, Schmitz U, Jung JJ, Krebs S, Steinhoff G, et al. TRAPLINE: a standardized and automated pipeline for RNA sequencing data analysis, evaluation and annotation. BMC Bioinform. 2016;17:21. doi: 10.1186/s12859-015-0873-9. PubMed DOI PMC
Zhao S, Xi L, Quan J, Xi H, Zhang Y, von Schack D, et al. QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization. BMC Genom. 2016;17:39. doi: 10.1186/s12864-015-2356-9. PubMed DOI PMC
Orjuela S, Huang R, Hembach KM, Robinson MD, Soneson C. ARMOR: an automated reproducible modular workflow for preprocessing and differential analysis of RNA-seq data. G3. 2019;9:2089–2096. doi: 10.1534/g3.119.400185. PubMed DOI PMC
Gadepalli VS, Ozer HG, Yilmaz AS, Pietrzak M, Webb A. BISR-RNAseq: an efficient and scalable RNAseq analysis workflow with interactive report generation. BMC Bioinform. 2019;20(Suppl 24):670. doi: 10.1186/s12859-019-3251-1. PubMed DOI PMC
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, et al. RNA-seq analysis is easy as 1–2–3 with limma, Glimma and edgeR. F1000Res. 2016;5. PubMed PMC
Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. doi: 10.1093/bioinformatics/bts480. PubMed DOI
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–319. doi: 10.1038/nbt.3820. PubMed DOI
Goecks J, Nekrutenko A, Taylor J, Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86. PubMed DOI PMC
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15:475–476. doi: 10.1038/s41592-018-0046-7. PubMed DOI PMC
transXpress GitHub page. https://github.com/transXpress/transXpress. Accessed 30 Nov 2022.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. PubMed DOI PMC
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–1512. doi: 10.1038/nprot.2013.084. PubMed DOI PMC
Smith-Unna R, Boursnell C, Patro R, Hibberd JM, Kelly S. TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Res. 2016;26:1134–1144. doi: 10.1101/gr.196469.115. PubMed DOI PMC
Babraham bioinformatics—FastQC A quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed 11 Oct 2021.
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–3048. doi: 10.1093/bioinformatics/btw354. PubMed DOI PMC
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. PubMed DOI PMC
Geniza M, Jaiswal P. Tools for building de novo transcriptome assembly. Curr Plant Biol. 2017;11–12:41–45. doi: 10.1016/j.cpb.2017.12.004. DOI
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. Gigascience. 2019;8:100. doi: 10.1093/gigascience/giz100. PubMed DOI PMC
Hölzer M, Marz M. De novo transcriptome assembly: a comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience. 2019;8:039. doi: 10.1093/gigascience/giz039. PubMed DOI PMC
Ren X, Liu T, Dong J, Sun L, Yang J, Zhu Y, et al. Evaluating de Bruijn graph assemblers on 454 transcriptomic data. PLoS ONE. 2012;7:e51188. doi: 10.1371/journal.pone.0051188. PubMed DOI PMC
Trinity Wiki—assembly statistics. https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Contig-Nx-and-ExN50-stats. Accessed 24 Oct 2022.
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol. 2021;38:4647–4654. doi: 10.1093/molbev/msab199. PubMed DOI PMC
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34:525–527. doi: 10.1038/nbt.3519. PubMed DOI
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. PubMed DOI PMC
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. PubMed DOI PMC
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. PubMed DOI PMC
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. doi: 10.1186/1471-2105-10-421. PubMed DOI PMC
UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. PubMed DOI PMC
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. PubMed DOI
Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. doi: 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L. PubMed DOI
Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. PubMed DOI PMC
Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2021;49:D192–200. doi: 10.1093/nar/gkaa1047. PubMed DOI PMC
Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40:1023–1025. doi: 10.1038/s41587-021-01156-3. PubMed DOI PMC
Almagro Armenteros JJ, Salvatore M, Emanuelsson O, Winther O, von Heijne G, Elofsson A, et al. Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance. 2019;2:5. doi: 10.26508/lsa.201900429. PubMed DOI PMC
Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998;6:175–182. PubMed
Priyam A, Woodcroft BJ, Rai V, Moghul I, Mungala A, Ter F, et al. Sequenceserver: a modern graphical user interface for custom BLAST databases. Mol Biol Evol. 2019 doi: 10.1093/molbev/msz185. PubMed DOI PMC
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. PubMed DOI
Dantu PK, Prasad M, Ranjan R. Elucidating biosynthetic pathway of piperine using comparative transcriptome analysis of leaves, root and spike in Piper longum L. bioRxiv. 2021; 2021.01.03.425108.
Salehi B, Zakaria ZA, Gyawali R, Ibrahim SA, Rajkovic J, Shinwari ZK, et al. Piper species: a comprehensive review on their phytochemistry. Biol Act Appl Mol. 2019;24:1364. PubMed PMC
Choudhary N, Singh V. A census of P. longum’s phytochemicals and their network pharmacological evaluation for identifying novel drug-like molecules against various diseases, with a special focus on neurological disorders. PLoS ONE. 2018;13:e0191006. doi: 10.1371/journal.pone.0191006. PubMed DOI PMC
Hu L, Xu Z, Wang M, Fan R, Yuan D, Wu B, et al. The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis. Nat Commun. 2019;10:1–11. doi: 10.1038/s41467-019-12607-6. PubMed DOI PMC
Čalounová T. Piper longum transcriptomes generated using transXpress. 10.5281/zenodo.7380017. 2022.