PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations
Jazyk angličtina Země Spojené státy americké Médium print-electronic
Typ dokumentu časopisecké články, práce podpořená grantem
PubMed
24453961
PubMed Central
PMC3894168
DOI
10.1371/journal.pcbi.1003440
PII: PCOMPBIOL-D-13-01477
Knihovny.cz E-zdroje
- MeSH
- algoritmy MeSH
- databáze proteinů MeSH
- fylogeneze MeSH
- genetická variace MeSH
- genetické nemoci vrozené genetika MeSH
- genom lidský MeSH
- internet MeSH
- jednonukleotidový polymorfismus * MeSH
- lidé MeSH
- mutace * MeSH
- počítačová simulace MeSH
- software MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp.
Zobrazit více v PubMed
Collins FS, Brooks LD, Chakravarti A (1998) A DNA polymorphism discovery resource for research on human genetic variation. Genome Res 8: 1229–1231 PubMed
Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073doi:10.1038/nature09534 PubMed DOI PMC
Collins FS, Guyer MS, Charkravarti A (1997) Variations on a theme: cataloging human DNA sequence variation. Science 278: 1580–1581 PubMed
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273: 1516–1517 PubMed
Studer RA, Dessailly BH, Orengo CA (2013) Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 449: 581–594doi:10.1042/BJ20121221 PubMed DOI
Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22: 231–238doi:10.1038/10290 PubMed DOI
Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, et al. (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet 22: 239–247doi:10.1038/10297 PubMed DOI
Tranchevent L-C, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, et al. (2011) A guide to web tools to prioritize candidate genes. Brief Bioinform 12: 22–32doi:10.1093/bib/bbq007 PubMed DOI
Capriotti E, Nehrt NL, Kann MG, Bromberg Y (2012) Bioinformatics for personal genome interpretation. Brief Bioinform 13: 495–512doi:10.1093/bib/bbr070 PubMed DOI PMC
Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, et al. (2009) Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinforma Oxf Engl 25: 2744–2750doi:10.1093/bioinformatics/btp528 PubMed DOI PMC
Bao L, Zhou M, Cui Y (2005) nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res 33: W480–W482doi:10.1093/nar/gki372 PubMed DOI PMC
Ramensky V, Bork P, Sunyaev S (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30: 3894–3900doi:10.1093/nar/gkf493 PubMed DOI PMC
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, et al. (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249doi:10.1038/nmeth0410-248 PubMed DOI PMC
Bromberg Y, Rost B (2007) SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 35: 3823–3835doi:10.1093/nar/gkm238 PubMed DOI PMC
Stone EA, Sidow A (2005) Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res 15: 978–986doi:10.1101/gr.3804205 PubMed DOI PMC
Thomas PD, Kejariwal A (2004) Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: Evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci U S A 101: 15398–15403doi:10.1073/pnas.0404380101 PubMed DOI PMC
Capriotti E, Calabrese R, Casadio R (2006) Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22: 2729–2734doi:10.1093/bioinformatics/btl423 PubMed DOI
Ng PC, Henikoff S (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814 PubMed PMC
Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30: 1237–1244doi:10.1002/humu.21047 PubMed DOI
Karchin R (2009) Next generation tools for the annotation of human SNPs. Brief Bioinform 10: 35–52doi:10.1093/bib/bbn047 PubMed DOI PMC
Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80doi:10.1146/annurev.genom.7.080505.115630 PubMed DOI
Castaldi PJ, Dahabreh IJ, Ioannidis JPA (2011) An empirical assessment of validation practices for molecular classifiers. Brief Bioinform 12: 189–202doi:10.1093/bib/bbq073 PubMed DOI PMC
Baldi P, Brunak S (2001) Bioinformatics: The machine learning approach. CambridgeMA: MIT Press. 492 p.
Simon R (2005) Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol Off J Am Soc Clin Oncol 23: 7332–7341doi:10.1200/JCO.2005.02.8712 PubMed DOI
Thusberg J, Olatubosun A, Vihinen M (2011) Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat 32: 358–368doi:10.1002/humu.21445 PubMed DOI
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6: 21–45doi:10.1109/MCAS.2006.1688199 DOI
González-Pérez A, López-Bigas N (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88: 440–449doi:10.1016/j.ajhg.2011.03.004 PubMed DOI PMC
Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M (2012) PON-P: Integrated predictor for pathogenicity of missense variants. Hum Mutat 33: 1166–1174doi:10.1002/humu.22102 PubMed DOI
Capriotti E, Altman RB, Bromberg Y (2013) Collective judgment predicts disease-associated single nucleotide variants. BMC Genomics 14: S2.doi:10.1186/1471-2164-14-S3-S2 PubMed DOI PMC
Kawabata T, Ota M, Nishikawa K (1999) The Protein Mutant Database. Nucleic Acids Res 27: 355–357 PubMed PMC
The UniProt Consortium (2011) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40: D71–D75doi:10.1093/nar/gkr981 PubMed DOI PMC
Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, et al. (2001) Prediction of deleterious human alleles. Hum Mol Genet 10: 591–597 PubMed
Pavelka A, Chovancova E, Damborsky J (2009) HotSpot Wizard: a web server for identification of hot spots in protein engineering. Nucleic Acids Res 37: W376–W383doi:10.1093/nar/gkp410 PubMed DOI PMC
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 PubMed PMC
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, et al. (2010) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 38: D5–D16doi:10.1093/nar/gkp967 PubMed DOI PMC
Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682doi:10.1093/bioinformatics/btq003 PubMed DOI PMC
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113.doi:10.1186/1471-2105-5-113 PubMed DOI PMC
Friedman N, Ninio M, Pe'er I, Pupko T (2002) A structural EM algorithm for phylogenetic inference. J Comput Biol J Comput Mol Cell Biol 9: 331–353doi:10.1089/10665270252935494 PubMed DOI
Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, et al. (2012) The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinforma Chapter 1: Unit1.13.doi:10.1002/0471250953.bi0113s39 PubMed DOI
Giardine B, Riemer C, Hefferon T, Thomas D, Hsu F, et al. (2007) PhenCode: connecting ENCODE data with mutations and phenotype. Hum Mutat 28: 554–562doi:10.1002/humu.20484 PubMed DOI
Piirilä H, Väliaho J, Vihinen M (2006) Immunodeficiency mutation databases (IDbases). Hum Mutat 27: 1200–1208doi:10.1002/humu.20405 PubMed DOI
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34: D187–D191doi:10.1093/nar/gkj161 PubMed DOI PMC
Yampolsky LY, Stoltzfus A (2005) The exchangeability of amino acids in proteins. Genetics 170: 1459–1472doi:10.1534/genetics.104.039107 PubMed DOI PMC
Aehle W, Cascao-Pereira LG, Estell DA, Goedegebuur F, Kellis JJT, et al.. (2010) Compositions and methods comprising serine protease variants.
Cuevas WA, Estell DE, Hadi SH, Lee S-K, Ramer SW, et al.. (2009) Geobacillus Stearothermophilus Alpha-Amylase (AmyS) Variants with Improved Properties.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11: 10–18doi:10.1145/1656274.1656278 DOI
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. UAI'95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. pp. 338–345. Available: http://dl.acm.org/citation.cfm?id=2074158.2074196 Accessed 25 June 2013.
Cessie L, Houwelingen V (1992) Ridge estimators in logistic regression. Appl Stat 41: 191–201doi:10.2307/2347628 DOI
Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37: 277–296doi:10.1023/A:1007662407062 DOI
Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2: 27:1–27:27doi:10.1145/1961189.1961199 DOI
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66doi:10.1023/A:1022689900470 DOI
Breiman L (2001) Random forests. Mach Learn 45: 5–32doi:10.1023/A:1010933404324 DOI
Chandonia J-M, Hon G, Walker NS, Lo Conte L, Koehl P, et al. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res 32: D189–192doi:10.1093/nar/gkh034 PubMed DOI PMC
Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, et al. (2003) PANTHER: A Library of protein families and subfamilies indexed by function. Genome Res 13: 2129–2141doi:10.1101/gr.772403 PubMed DOI PMC
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, et al. (2006) Machine learning in bioinformatics. Brief Bioinform 7: 86–112doi:10.1093/bib/bbk007 PubMed DOI
Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16: 412–424doi:10.1093/bioinformatics/16.5.412 PubMed DOI
Cooper GM, Shendure J (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12: 628–640doi:10.1038/nrg3046 PubMed DOI
Bleasby AJ, Akrigg D, Attwood TK (1994) OWL–a non-redundant composite protein sequence database. Nucleic Acids Res 22: 3574–3577 PubMed PMC
Sim N-L, Kumar P, Hu J, Henikoff S, Schneider G, et al. (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40: W452–W457doi:10.1093/nar/gks539 PubMed DOI PMC
A computational workflow for analysis of missense mutations in precision oncology
Structural and Functional Impact of Seven Missense Variants of Phenylalanine Hydroxylase
Alagille Syndrome Mimicking Biliary Atresia in Early Infancy