SoluProtMutDB: A manually curated database of protein solubility changes upon mutations
Status PubMed-not-MEDLINE Jazyk angličtina Země Nizozemsko Médium electronic-ecollection
Typ dokumentu časopisecké články
PubMed
36420168
PubMed Central
PMC9678803
DOI
10.1016/j.csbj.2022.11.009
PII: S2001-0370(22)00502-5
Knihovny.cz E-zdroje
- Klíčová slova
- Machine learning, Mutational database, Protein aggregation, Protein engineering, Protein yield, Soluble expression,
- Publikační typ
- časopisecké články MeSH
Protein solubility is an attractive engineering target primarily due to its relation to yields in protein production and manufacturing. Moreover, better knowledge of the mutational effects on protein solubility could connect several serious human diseases with protein aggregation. However, we have limited understanding of the protein structural determinants of solubility, and the available data have mostly been scattered in the literature. Here, we present SoluProtMutDB - the first database containing data on protein solubility changes upon mutations. Our database accommodates 33 000 measurements of 17 000 protein variants in 103 different proteins. The database can serve as an essential source of information for the researchers designing improved protein variants or those developing machine learning tools to predict the effects of mutations on solubility. The database comprises all the previously published solubility datasets and thousands of new data points from recent publications, including deep mutational scanning experiments. Moreover, it features many available experimental conditions known to affect protein solubility. The datasets have been manually curated with substantial corrections, improving suitability for machine learning applications. The database is available at loschmidt.chemi.muni.cz/soluprotmutdb.
Zobrazit více v PubMed
Stourac J., Dubrava J., Musil M., Horackova J., Damborsky J., Mazurenko S., Bednar D. FireProtDB: database of manually curated protein stability data. Nucleic Acids Res. 2020;49(D1):D319–D324. doi: 10.1093/nar/gkaa981. PubMed DOI PMC
Kulandaisamy A., Sakthivel R., Gromiha M.M. MPTherm: database for membrane protein thermodynamics for understanding folding and stability. Briefings Bioinform. 2020;22(2):2119–2125. doi: 10.1093/bib/bbaa064. PubMed DOI
Wang X., Zhang X., Peng C., Shi Y., Li H., Xu Z., Zhu W. D3distalmutation: a database to explore the effect of distal mutations on enzyme activity. J Chem Inf Model. 2021;61(5):2499–2508. doi: 10.1021/acs.jcim.1c00318. PubMed DOI
Shire S.J., Shahrokh Z., Liu J. Challenges in the development of high protein concentration formulations. J Pharm Sci. 2004;93(6):1390–1402. doi: 10.1002/jps.20079. URL https://www.sciencedirect.com/science/article/pii/S0022354916315234. PubMed DOI
Vázquez-Rey M., Lang D.A. Aggregates in monoclonal antibody manufacturing processes, Biotechnol Bioeng 108 (7) (2011) 1494–1508, eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bit.23155. doi:10.1002/bit.23155. https://onlinelibrary.wiley.com/doi/abs/10.1002/bit.23155. PubMed DOI
W. Chen, X. Chen, Z. Hu, H. Lin, F. Zhou, L. Luo, X. Zhang, X. Zhong, Y. Yang, C. Wu, Z. Lin, S. Ye, Y. Liu, F. t. S.G.O. Ccpmoh, A Missense Mutation in CRYBB2 Leads to Progressive Congenital Membranous Cataract by Impacting the Solubility and Function of PubMed PMC
Tian Y., Deutsch C., Krishnamoorthy B. Scoring function to predict solubility mutagenesis. Algorith Mol Biol. 2010;5(1):33. doi: 10.1186/1748-7188-5-33. PubMed DOI PMC
Sormanni P., Aprile F.A., Vendruscolo M. The camsol method of rational design of protein mutants with enhanced solubility. J Mol Biol. 2015;427(2):478–490. doi: 10.1016/j.jmb.2014.09.026. PubMed DOI
Zambrano R., Jamroz M., Szczasiuk A., Pujols J., Kmiecik S., Ventura S. AGGRESCAN3d (a3d): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015;43(W1):W306–W313. doi: 10.1093/nar/gkv359. PubMed DOI PMC
Yang Y., Niroula A., Shen B., Vihinen M. PON-sol: prediction of effects of amino acid substitutions on protein solubility. Bioinformatics. 2016;32(13):2032–2034. doi: 10.1093/bioinformatics/btw066. PubMed DOI
Yang Y., Zeng L., Vihinen M. Pon-sol2: Prediction of effects of variants on protein solubility. Int J Mol Sci. 2021;22(15) doi: 10.3390/ijms22158027. URL https://www.mdpi.com/1422-0067/22/15/8027. PubMed DOI PMC
Klesmith J.R., Bacik J.-P., Wrenbeck E.E., Michalczyk R., Whitehead T.A. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning, Proc of the Natl Acad of Sci USA 114 (9) (2017) 2265–2270. arXiv:https://www.pnas.org/content/114/9/2265.full.pdf, doi:10.1073/pnas.1614437114. https://www.pnas.org/content/114/9/2265. PubMed PMC
Wrenbeck E., Bedewitz M., Klesmith J., Noshin S., Barry C., Whitehead T. An automated data-driven pipeline for improving heterologous enzyme expression. ACS Synthet Biol. 2019;8(02) doi: 10.1021/acssynbio.8b00486. PubMed DOI PMC
Mazurenko S., Prokop Z., Damborsky J. ACS Catal. Vol. 10. publisher: American Chemical Society; 2020. Machine Learning in Enzyme Engineering; pp. 1210–1223. DOI
T.U. Consortium, UniProt: the universal protein knowledgebase in 2021, Nucleic Acids Res 49 (D1) (2020) D480–D489. doi:10.1093/nar/gkaa1100. URL 10.1093/nar/gkaa1100. PubMed DOI PMC
Sumbalova L., Stourac J., Martinek T., Bednar D., Damborsky J. HotSpot wizard 3.0: web server for automated design of mutations and smart libraries based on sequence input information, Nucleic Acids Res 46 (W1) (2018) W356–W362. 10.1093/nar/gky417. PubMed DOI PMC
Kaur J., Kumar A., Kaur J. Strategies for optimization of heterologous protein expression in E. coli: Roadblocks and reinforcements. Int J Biol Macromol. 2018;106:803–822. doi: 10.1016/j.ijbiomac.2017.08.080. PubMed DOI
Slanská K. Study of protein solubility [online] Master’s thesis, Faculty of Science, Masaryk University, Brno (2021). URL Availableat<https://is.muni.cz/th/e3jlf/>
Bendl J., Stourac J., Sebestova E., Vavra O., Musil M., Brezovsky J., Damborsky J. HotSpot Wizard 2.0: automated design of site-specific mutations and smart libraries in protein engineering, Nucleic Acids Res 44 (Web Server issue) (2016) W479–W487. doi:10.1093/nar/gkw416. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987947/. PubMed PMC
Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. doi: 10.1186/1471-2105-10-421. PubMed DOI PMC
Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H. UniProt Consortium, UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics (Oxford, England) 2015;31(6):926–932. doi: 10.1093/bioinformatics/btu739. PubMed DOI PMC
Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics (Oxford, England) 2010;26(19):2460–2461. doi: 10.1093/bioinformatics/btq461. PubMed DOI
Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., Söding J., Thompson J.D., Higgins D.G. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7:539. doi: 10.1038/msb.2011.75. PubMed DOI PMC
Capra J.A., Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics (Oxford, England) 2007;23(15):1875–1882. doi: 10.1093/bioinformatics/btm270. PubMed DOI
Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. doi: 10.1002/bip.360221211. PubMed DOI
Shrake A., Rupley J.A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol. 1973;79(2):351–371. doi: 10.1016/0022-2836(73)90011-9. PubMed DOI
Reetz M.T., Carballeira J.D., Vogel A. Iterative Saturation Mutagenesis on the Basis of B Factors as a Strategy for Increasing Protein Thermostability, Angewandte Chem Int Ed 45(46) (2006) 7745–7751, eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/anie.200602795. doi:10.1002/anie.200602795. https://onlinelibrary.wiley.com/doi/abs/10.1002/anie.200602795. PubMed DOI
Le Guilloux V., Schmidtke P., Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinform. 2009;10:168. doi: 10.1186/1471-2105-10-168. PubMed DOI PMC
Chovancova E., Pavelka A., Benes P., Strnad O., Brezovsky J., Kozlikova B., Gora A., Sustr V., Klvana M., Medek P., Biedermannova L., Sochor J., Damborsky J. CAVER 3.0: a tool for the analysis of transport pathways in dynamic protein structures. PLoS Comput Biol. 2012;8(10) doi: 10.1371/journal.pcbi.1002708. PubMed DOI PMC
Velankar S., Dana J.M., Jacobsen J., van Ginkel G., Gane P.J., Luo J., Oldfield T.J., O’Donovan C., Martin M.-J., Kleywegt G.J. SIFTS: Structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2012;41(D1):D483–D489. doi: 10.1093/nar/gks1258. PubMed DOI PMC
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR guiding principles for scientific data management and stewardship, Sci Data 3(1) (Mar. 2016). doi:10.1038/sdata.2016.18. URL 10.1038/sdata.2016.18. DOI
Watkins X., Garcia L.J., Pundir S., Martin M.J. the UniProt Consortium, Protvista: visualization of protein sequence annotations. Bioinformatics. 2017;33(13):2040–2041. doi: 10.1093/bioinformatics/btx120. PubMed DOI PMC
Sehnal D., Bittrich S., Deshpande M., Svobodova R., Berka K., Bazgier V., Velankar S., Burley S.K., Koca J., Rose A.S. Mol* viewer: modern web app for 3d visualization and analysis of large biomolecular structures, Nucleic Acids Res 49(W1) (2021) W431–W437. 10.1093/nar/gkab314. PubMed DOI PMC
Pucci F., Schwersensky M., Rooman M. Artificial intelligence challenges for predicting the impact of mutations on protein stability. Curr Opin Struct Biol. 2022;72:161–168. doi: 10.1016/j.sbi.2021.11.001. URL https://www.sciencedirect.com/science/article/pii/S0959440X21001445. PubMed DOI
Fang J. A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Briefings Bioinform. 2020;21(4):1285–1292. doi: 10.1093/bib/bbz071. PubMed DOI PMC
Sanavia T., Birolo G., Montanucci L., Turina P., Capriotti E., Fariselli P. Limitations and challenges in protein stability prediction upon genome variations: towards future applications in precision medicine. Comput Struct Biotechnol J. 2020;18:1968–1979. doi: 10.1016/j.csbj.2020.07.011. PubMed DOI PMC
Gustafsson C., Govindarajan S., Minshull J. Codon bias and heterologous protein expression. Trends Biotechnol. 2004;22(7):346–353. doi: 10.1016/j.tibtech.2004.04.006. URL https://www.sciencedirect.com/science/article/pii/S0167779904001118. PubMed DOI
Kuroda Y. Biophysical studies of protein solubility and amorphous aggregation by systematic mutational analysis and a helical polymerization model. Biophys Rev. 2018;10(2):473–480. doi: 10.1007/s12551-017-0342-y. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5899702/ PubMed DOI PMC
Kozlowski L.P. Proteome-pI: proteome isoelectric point database. Nucleic Acids Res. 2017;45(D1):D1112–D1116. doi: 10.1093/nar/gkw978. PubMed DOI PMC
AggreProt: a web server for predicting and engineering aggregation prone regions in proteins