SoluProt: prediction of soluble protein expression in Escherichia coli
Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium print
Typ dokumentu časopisecké články
Grantová podpora
857560
Czech Ministry of Education
LQ1602
20-15915Y
Czech Grant Agency
857560
European Commission
AI Methods for Cybersecurity and Control Systems project
FIT-S-20-6293
Brno University of Technology
e-INFRA LM2018140
e-Infrastruktura CZ
LM2018131
ELIXIR-CZ
Czech Ministry of Education
PubMed
33416864
PubMed Central
PMC8034534
DOI
10.1093/bioinformatics/btaa1102
PII: 6070085
Knihovny.cz E-zdroje
- Publikační typ
- časopisecké články MeSH
MOTIVATION: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritization of highly soluble proteins. RESULTS: A new tool for sequence-based prediction of soluble protein expression in E.coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt's accuracy of 58.5% and AUC of 0.62 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies. SoluProt is freely available as a standalone program and a user-friendly webserver at https://loschmidt.chemi.muni.cz/soluprot/. AVAILABILITY AND IMPLEMENTATION: https://loschmidt.chemi.muni.cz/soluprot/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
International Clinical Research Center St Anne's University Hospital Brno Brno 656 91 Czech Republic
Zobrazit více v PubMed
Agostini F. et al. (2014) ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics, 30, 2975–2977. PubMed PMC
Agostini F. et al. (2012) Sequence-based prediction of protein solubility. J. Mol. Biol., 421, 237–241. PubMed
Berman H.M. et al. (2017) Protein Structure Initiative – TargetTrack 2000-2017 – all data files. Zenodo. doi:10.5281/zenodo.821654.
Berman H.M. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. PubMed PMC
Bhandari B.K. et al. (2020) Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics, 36, 4691–4698. PubMed PMC
Burley S.K. et al. (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res., 47, D464. PubMed PMC
Carballo-Amador M.A. et al. (2019) Surface patches on recombinant erythropoietin predict protein solubility: engineering proteins to minimise aggregation. BMC Biotechnology, 19, 26. PubMed PMC
Carlson E.D. et al. (2012) Cell-free protein synthesis: applications come of age. Biotechnol. Adv., 30, 1185–1194. PubMed PMC
Chan P. et al. (2013) Soluble expression of proteins correlates with a lack of positively-charged surface. Sci. Rep., 3, 3333. PubMed PMC
Cilia E. et al. (2014) The DynaMine webserver: predicting protein dynamics from sequence. Nucleic Acids Res., 42, W264–W270. PubMed PMC
Cock P.J.A. et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422–1423. PubMed PMC
Costa S. et al. (2014) Fusion tags for protein solubility, purification and immunogenicity in Escherichia coli: the novel Fh8 system. Front. Microbiol., 5, 63. PubMed PMC
Davis G.D. et al. (1999) New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol. Bioeng., 65, 382–388. PubMed
Diaz A.A. et al. (2010) Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol. Bioeng., 105, 374–383. PubMed
Edgar R.C. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461. PubMed
Friedman J.H. (2001) Greedy function approximation: a gradient boosting machine. Ann. Stat., 29, 1189–1232.
Hebditch M. et al. (2017) Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics, 33, 3098–3100. PubMed PMC
Hirose S., Noguchi T. (2013) ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics, 13, 1444–1456. PubMed
Hon J. et al. (2020) EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities. Nucleic Acids Res., 48, W104–W109. PubMed PMC
Khurana S. et al. (2018) DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics, 34, 2605–2613. PubMed PMC
Kramer R.M. et al. (2012) Toward a molecular understanding of protein solubility: increased negative surface charge correlates with increased solubility. Biophys. J., 102, 1907–1915. PubMed PMC
Krogh A. et al. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. PubMed
Magnan C.N. et al. (2009) SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics, 25, 2200–2207. PubMed
McKinney W. (2010) Data Structures for Statistical Computing in Python. In: Proceedings of the 9th Python in Science Conference. SciPy Organizers, Austin, Texas, pp. 56–61.
Musil M. et al. (2019) Computational design of stable and soluble biocatalysts. ACS Catal., 9, 1033–1054.
Niwa T. et al. (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc. Natl. Acad. Sci. USA, 106, 4201–4206. PubMed PMC
Niwa T. et al. (2012) Global analysis of chaperone effects using a reconstituted cell-free translation system. Proc. Natl. Acad. Sci. USA, 109, 8937–8942. PubMed PMC
Pedregosa F. et al. (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.
Piovesan D. et al. (2017) FELLS: fast estimator of latent local structure. Bioinformatics, 33, 1889–1891. PubMed
Price W.N. et al. (2011) Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inf. Exp., 1, 6. PubMed PMC
Raimondi D. et al. (2020) Insight into the protein solubility driving forces with neural attention. PLoS Comput. Biol., 16, e1007722. PubMed PMC
Rosano G.L., Ceccarelli E.A. (2014) Recombinant protein expression in Escherichia coli: advances and challenges. Front. Microbiol., 5, 172. PubMed PMC
Sankar K. et al. (2018) AggScore: prediction of aggregation-prone regions in proteins based on the distribution of surface patches. Proteins, 86, 1147–1156. PubMed
Shimizu Y. et al. (2001) Cell-free translation reconstituted with purified components. Nat. Biotechnol., 19, 751–755. PubMed
Smialowski P. et al. (2012) PROSO II - a new method for protein solubility prediction. FEBS J., 279, 2192–2200. PubMed
Sormanni P. et al. (2015) The CamSol method of rational design of protein mutants with enhanced solubility. J. Mol. Biol., 427, 478–490. PubMed
Steinegger M., Söding J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol., 35, 1026–1028. PubMed
Tibshirani R. (1996) Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodological), 58, 267–288.
Tsirigos K.D. et al. (2015) The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides. Nucleic Acids Res., 43, W401–W407. PubMed PMC
Vanacek P. et al. (2018) Exploration of enzyme diversity by integrating bioinformatics with expression analysis and biochemical characterization. ACS Catal., 8, 2402–2412.
Walsh I. et al. (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics, 28, 503–509. PubMed
Wilkinson D.L., Harrison R.G. (1991) Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology (N.Y.), 9, 443–448 PubMed
Machine Learning-Guided Protein Engineering