SoluProt: prediction of soluble protein expression in Escherichia coli

. 2021 Apr 09 ; 37 (1) : 23-28.

Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium print

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid33416864

Grantová podpora
857560 Czech Ministry of Education
LQ1602
20-15915Y Czech Grant Agency
857560 European Commission
AI Methods for Cybersecurity and Control Systems project
FIT-S-20-6293 Brno University of Technology
e-INFRA LM2018140 e-Infrastruktura CZ
LM2018131 ELIXIR-CZ
Czech Ministry of Education

MOTIVATION: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritization of highly soluble proteins. RESULTS: A new tool for sequence-based prediction of soluble protein expression in E.coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt's accuracy of 58.5% and AUC of 0.62 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies. SoluProt is freely available as a standalone program and a user-friendly webserver at https://loschmidt.chemi.muni.cz/soluprot/. AVAILABILITY AND IMPLEMENTATION: https://loschmidt.chemi.muni.cz/soluprot/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Zobrazit více v PubMed

Agostini F.  et al. (2014) ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics, 30, 2975–2977. PubMed PMC

Agostini F.  et al. (2012) Sequence-based prediction of protein solubility. J. Mol. Biol., 421, 237–241. PubMed

Berman H.M.  et al. (2017) Protein Structure Initiative – TargetTrack 2000-2017 – all data files. Zenodo. doi:10.5281/zenodo.821654.

Berman H.M. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. PubMed PMC

Bhandari B.K.  et al. (2020) Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics, 36, 4691–4698. PubMed PMC

Burley S.K.  et al. (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res., 47, D464. PubMed PMC

Carballo-Amador M.A.  et al. (2019) Surface patches on recombinant erythropoietin predict protein solubility: engineering proteins to minimise aggregation. BMC Biotechnology, 19, 26. PubMed PMC

Carlson E.D.  et al. (2012) Cell-free protein synthesis: applications come of age. Biotechnol. Adv., 30, 1185–1194. PubMed PMC

Chan P.  et al. (2013) Soluble expression of proteins correlates with a lack of positively-charged surface. Sci. Rep., 3, 3333. PubMed PMC

Cilia E.  et al. (2014) The DynaMine webserver: predicting protein dynamics from sequence. Nucleic Acids Res., 42, W264–W270. PubMed PMC

Cock P.J.A.  et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422–1423. PubMed PMC

Costa S.  et al. (2014) Fusion tags for protein solubility, purification and immunogenicity in Escherichia coli: the novel Fh8 system. Front. Microbiol., 5, 63. PubMed PMC

Davis G.D.  et al. (1999) New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol. Bioeng., 65, 382–388. PubMed

Diaz A.A.  et al. (2010) Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol. Bioeng., 105, 374–383. PubMed

Edgar R.C. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461. PubMed

Friedman J.H. (2001) Greedy function approximation: a gradient boosting machine. Ann. Stat., 29, 1189–1232.

Hebditch M.  et al. (2017) Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics, 33, 3098–3100. PubMed PMC

Hirose S., Noguchi T. (2013) ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics, 13, 1444–1456. PubMed

Hon J.  et al. (2020) EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities. Nucleic Acids Res., 48, W104–W109. PubMed PMC

Khurana S.  et al. (2018) DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics, 34, 2605–2613. PubMed PMC

Kramer R.M.  et al. (2012) Toward a molecular understanding of protein solubility: increased negative surface charge correlates with increased solubility. Biophys. J., 102, 1907–1915. PubMed PMC

Krogh A.  et al. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. PubMed

Magnan C.N.  et al. (2009) SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics, 25, 2200–2207. PubMed

McKinney W. (2010) Data Structures for Statistical Computing in Python. In: Proceedings of the 9th Python in Science Conference. SciPy Organizers, Austin, Texas, pp. 56–61.

Musil M.  et al. (2019) Computational design of stable and soluble biocatalysts. ACS Catal., 9, 1033–1054.

Niwa T.  et al. (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc. Natl. Acad. Sci. USA, 106, 4201–4206. PubMed PMC

Niwa T.  et al. (2012) Global analysis of chaperone effects using a reconstituted cell-free translation system. Proc. Natl. Acad. Sci. USA, 109, 8937–8942. PubMed PMC

Pedregosa F.  et al. (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res., 12, 2825–2830.

Piovesan D.  et al. (2017) FELLS: fast estimator of latent local structure. Bioinformatics, 33, 1889–1891. PubMed

Price W.N.  et al. (2011) Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inf. Exp., 1, 6. PubMed PMC

Raimondi D.  et al. (2020) Insight into the protein solubility driving forces with neural attention. PLoS Comput. Biol., 16, e1007722. PubMed PMC

Rosano G.L., Ceccarelli E.A. (2014) Recombinant protein expression in Escherichia coli: advances and challenges. Front. Microbiol., 5, 172. PubMed PMC

Sankar K.  et al. (2018) AggScore: prediction of aggregation-prone regions in proteins based on the distribution of surface patches. Proteins, 86, 1147–1156. PubMed

Shimizu Y.  et al. (2001) Cell-free translation reconstituted with purified components. Nat. Biotechnol., 19, 751–755. PubMed

Smialowski P.  et al. (2012) PROSO II - a new method for protein solubility prediction. FEBS J., 279, 2192–2200. PubMed

Sormanni P.  et al. (2015) The CamSol method of rational design of protein mutants with enhanced solubility. J. Mol. Biol., 427, 478–490. PubMed

Steinegger M., Söding J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol., 35, 1026–1028. PubMed

Tibshirani R. (1996) Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodological), 58, 267–288.

Tsirigos K.D.  et al. (2015) The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides. Nucleic Acids Res., 43, W401–W407. PubMed PMC

Vanacek P.  et al. (2018) Exploration of enzyme diversity by integrating bioinformatics with expression analysis and biochemical characterization. ACS Catal., 8, 2402–2412.

Walsh I.  et al. (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics, 28, 503–509. PubMed

Wilkinson D.L., Harrison R.G. (1991) Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology (N.Y.), 9, 443–448 PubMed

Najít záznam

Citační ukazatele

Nahrávání dat ...

    Možnosti archivace