A deep learning genome-mining strategy for biosynthetic gene cluster prediction
Jazyk angličtina Země Velká Británie, Anglie Médium print
Typ dokumentu časopisecké články, práce podpořená grantem
PubMed
31400112
PubMed Central
PMC6765103
DOI
10.1093/nar/gkz654
PII: 5545735
Knihovny.cz E-zdroje
- MeSH
- biosyntetické dráhy genetika MeSH
- data mining metody MeSH
- deep learning MeSH
- genom bakteriální genetika MeSH
- genom MeSH
- multigenová rodina genetika MeSH
- výpočetní biologie metody MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Natural products represent a rich reservoir of small molecule drug candidates utilized as antimicrobial drugs, anticancer therapies, and immunomodulatory agents. These molecules are microbial secondary metabolites synthesized by co-localized genes termed Biosynthetic Gene Clusters (BGCs). The increase in full microbial genomes and similar resources has led to development of BGC prediction algorithms, although their precision and ability to identify novel BGC classes could be improved. Here we present a deep learning strategy (DeepBGC) that offers reduced false positive rates in BGC identification and an improved ability to extrapolate and identify novel BGC classes compared to existing machine-learning tools. We supplemented this with random forest classifiers that accurately predicted BGC product classes and potential chemical activity. Application of DeepBGC to bacterial genomes uncovered previously undetectable putative BGCs that may code for natural products with novel biologic activities. The improved accuracy and classification ability of DeepBGC represents a major addition to in-silico BGC identification.
AI and Big Data Analytics MSD Czech Republic s r o Prague Czech Republic
Big Data Solutions MSD Czech Republic s r o Prague Czech Republic
Bioinformatics and Cheminformatics Solutions MSD Czech Republic s r o Prague Czech Republic
Data Science MSD Czech Republic s r o Prague Czech Republic
Exploratory Science Center Merck and Co Inc Cambridge Massachusetts USA
Genetics and Pharmacogenomics Merck and Co Inc Boston MA USA
Infectious Diseases and Vaccine Research MRL Merck and Co Inc West Point PA USA
Zobrazit více v PubMed
Newman D.J., Cragg G.M.. Natural products as sources of new drugs over the 30 years from 1981 to 2010. J. Nat. Prod. 2012; 75:311–335. PubMed PMC
Milshteyn A., Schneider J.S., Brady S.F.. Mining the metabiome: identifying novel natural products from microbial communities. Chem. Biol. 2014; 21:1211–1223. PubMed PMC
Ventola C.L. The antibiotic resistance crisis: part 1: causes and threats. P T. 2015; 40:277–283. PubMed PMC
Pendleton J.N., Gorman S.P., Gilmore B.F.. Clinical relevance of the ESKAPE pathogens. Expert Rev. Anti. Infect. Ther. 2013; 11:297–308. PubMed
Zhang H., Chen J.. Current status and future directions of cancer immunotherapy. J. Cancer. 2018; 9:1773–1781. PubMed PMC
Shen B. A new golden age of natural products drug discovery. Cell. 2015; 163:1297–1300. PubMed PMC
DeCorte B.L. Underexplored opportunities for natural products in drug discovery. J. Med. Chem. 2016; 59:9295–9304. PubMed
Harvey A.L., Edrada-Ebel R., Quinn R.J.. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discov. 2015; 14:111–129. PubMed
Hopwood D.A., Merrick M.J.. Genetics of antibiotic production. Bacteriol. Rev. 1977; 41:595–635. PubMed PMC
Martin J.F. Clusters of genes for the biosynthesis of antibiotics: regulatory genes and overproduction of pharmaceuticals. J. Ind. Microbiol. 1992; 9:73–90. PubMed
Martín M.F., Liras P.. Organization and expression of genes involved in the biosynthesis of antibiotics and other secondary metabolites. Annu. Rev. Microbiol. 1989; 43:173–206. PubMed
Medema M.H., Fischbach M.A.. Computational approaches to natural product discovery. Nat. Chem. Biol. 2015; 11:639–648. PubMed PMC
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. PubMed
Medema M.H., Blin K., Cimermancic P., de Jager V., Zakrzewski P., Fischbach M.A., Weber T., Takano E., Breitling R.. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 2011; 39:W339–W346. PubMed PMC
Weber T., Rausch C., Lopez P., Hoof I., Gaykova V., Huson D.H., Wohlleben W.. CLUSEAN: a computer-based framework for the automated analysis of bacterial secondary metabolite biosynthetic gene clusters. J. Biotechnol. 2009; 140:13–17. PubMed
Cimermancic P., Medema M.H., Claesen J., Kurita K., Wieland Brown L.C., Mavrommatis K., Pati A., Godfrey P.A., Koehrsen M., Clardy J. et al. .. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014; 158:412–421. PubMed PMC
Eddy S.R. Profile hidden Markov models. Bioinformatics. 1998; 14:755–763. PubMed
Skinnider M.A., Merwin N.J., Johnston C.W., Magarvey N.A.. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes. Nucleic Acids Res. 2017; 45:W49–W54. PubMed PMC
Yoon B.-J. Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics. 2009; 10:402–415. PubMed PMC
Choo K.H., Tong J.C., Zhang L.. Recent applications of Hidden Markov Models in computational biology. Genomics. Proteomics Bioinformatics. 2004; 2:84–96. PubMed PMC
Eddy S.R. What is a hidden Markov model. Nat. Biotechnol. 2004; 22:1315–1316. PubMed
Finn R.D., Coggill P., Eberhardt R.Y., Eddy S.R., Mistry J., Mitchell A.L., Potter S.C., Punta M., Qureshi M., Sangrador-Vegas A. et al. .. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016; 44:D279–D285. PubMed PMC
Hochreiter S., Heusel M., Obermayer K.. Fast model-based protein homology detection without alignment. Bioinformatics. 2007; 23:1728–1736. PubMed
Hochreiter S., Schmidhuber J.. Long Short-Term memory. Neural Comput. 1997; 9:1735–1780. PubMed
Schuster M., Paliwal K.K.. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997; 45:2673–2681.
O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D. et al. .. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. PubMed PMC
Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119. PubMed PMC
Mikolov T., Chen K., Corrado G., Dean J.. 2013; Efficient Estimation of Word Representations in Vector Space.
Medema M.H., Kottmann R., Yilmaz P., Cummings M., Biggins J.B., Blin K., de Bruijn I., Chooi Y.H., Claesen J., Coates R.C. et al. .. Minimum information about a biosynthetic gene cluster. Nat. Chem. Biol. 2015; 11:625–631. PubMed PMC
Ziemert N., Alanjary M., Weber T.. The evolution of genome mining in microbes - a review. Nat. Prod. Rep. 2016; 33:988–1005. PubMed
Chavadi S.S., Stirrett K.L., Edupuganti U.R., Vergnolle O., Sadhanandan G., Marchiano E., Martin C., Qiu W.-G., Soll C.E., Quadri L.E.N.. Mutational and phylogenetic analyses of the mycobacterial mbt gene cluster. J. Bacteriol. 2011; 193:5905–5913. PubMed PMC
Quadri L.E., Sello J., Keating T.A., Weinreb P.H., Walsh C.T.. Identification of a Mycobacterium tuberculosis gene cluster encoding the biosynthetic enzymes for assembly of the virulence-conferring siderophore mycobactin. Chem. Biol. 1998; 5:631–645. PubMed
Li W., He J., Xie L., Chen T., Xie J.. Comparative genomic insights into the biosynthesis and regulation of mycobacterial siderophores. Cell Physiol. Biochem. 2013; 31:1–13. PubMed
Harris N.C., Sato M., Herman N.A., Twigg F., Cai W., Liu J., Zhu X., Downey J., Khalaf R., Martin J. et al. .. Biosynthesis of isonitrile lipopeptides by conserved nonribosomal peptide synthetase gene clusters in Actinobacteria. Proc. Natl. Acad. Sci. U.S.A. 2017; 114:7025–7030. PubMed PMC
Tobias N.J., Doig K.D., Medema M.H., Chen H., Haring V., Moore R., Seemann T., Stinear T.P.. Complete genome sequence of the frog pathogen Mycobacterium ulcerans ecovar Liflandii. J. Bacteriol. 2013; 195:556–564. PubMed PMC
Armstrong R.N. Mechanistic diversity in a metalloenzyme superfamily. Biochemistry. 2000; 39:13625–13632. PubMed
Anantharaman V., Aravind L.. New connections in the prokaryotic toxin-antitoxin network: relationship with the eukaryotic nonsense-mediated RNA decay system. Genome Biol. 2003; 4:R81. PubMed PMC
LeCun Y., Bengio Y., Hinton G.. Deep learning. Nature. 2015; 521:436–444. PubMed
Asgari E., Mofrad M.R.K.. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One. 2015; 10:e0141287. PubMed PMC
Kim S., Lee H., Kim K., Kang J.. Mut2Vec: distributed representation of cancerous mutations. BMC Med. Genomics. 2018; 11:33. PubMed PMC