Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd
Status PubMed-not-MEDLINE Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
R01 NS099068
NINDS NIH HHS - United States
U54 HL127624
NHLBI NIH HHS - United States
T32 HL007824
NHLBI NIH HHS - United States
R01 GM098316
NIGMS NIH HHS - United States
U54 CA189201
NCI NIH HHS - United States
PubMed
27667448
PubMed Central
PMC5052684
DOI
10.1038/ncomms12846
PII: ncomms12846
Knihovny.cz E-zdroje
- Publikační typ
- časopisecké články MeSH
Gene expression data are accumulating exponentially in public repositories. Reanalysis and integration of themed collections from these studies may provide new insights, but requires further human curation. Here we report a crowdsourcing project to annotate and reanalyse a large number of gene expression profiles from Gene Expression Omnibus (GEO). Through a massive open online course on Coursera, over 70 participants from over 25 countries identify and annotate 2,460 single-gene perturbation signatures, 839 disease versus normal signatures, and 906 drug perturbation signatures. All these signatures are unique and are manually validated for quality. Global analysis of these signatures confirms known associations and identifies novel associations between genes, diseases and drugs. The manually curated signatures are used as a training set to develop classifiers for extracting similar signatures from the entire GEO repository. We develop a web portal to serve these signatures for query, download and visualization.
Anna Blamansingel 216 Amsterdam 102 SW Netherlands
Center for Space Medicine Baylor College of Medicine 1 Baylor Plaza Houston Texas 77030 USA
Daylesford the Fairway Weybridge Surrey KT13 0RZ UK
Department of Biology Faculty of Medicine Masaryk University Brno 625 00 Czech Republic
Department of Neurosurgery Stanford School of Medicine Stanford California 94304 USA
Department of Research Institute of Liver and Biliary Sciences D1 Vasant Kunj New Delhi 110070 India
IBM India Pvt Ltd Bengaluru 560045 India
IMIM Hospital Del Mar PRBB Barcelona Dr Aiguader Barcelona 88 08003 Spain
The Ragon Institute of MGH MIT and Harvard 400 Technology Square Cambridge Massachusetts 02139 USA
Zobrazit více v PubMed
Williams G. A searchable cross-platform gene expression database reveals connections between drug treatments and disease. BMC Genom. 13, 12 (2012). PubMed PMC
Fujibuchi W., Kiseleva L., Taniguchi T., Harada H. & Horton P. CellMontage: similar expression profile search server. Bioinformatics 23, 3103–3104 (2007). PubMed
Zinman G. E., Naiman S., Kanfi Y., Cohen H. & Bar-Joseph Z. ExpressionBlast: mining large, unstructured expression databases. Nat. Methods 10, 925–926 (2013). PubMed
Hu G. & Agarwal P. Human disease-drug network based on genomic expression profiles. PLoS ONE 4, e6536 (2009). PubMed PMC
Good B. M. & Su A. I. Crowdsourcing for bioinformatics. Bioinformatics 29, 1925–1933 (2013). PubMed PMC
Khare R., Good B. M., Leaman R., Su A. I. & Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief. Bioinf. 17, 23–32 (2015). PubMed PMC
Candido dos Reis F. J. PubMed PMC
Benjamin M. G., Max N., Chunlei W. U. & Andrew I. S. in Biocomputing 2015 282–293World Scientific (2014).
Gottlieb A., Hoehndorf R., Dumontier M. & Altman R. B. Ranking adverse drug reactions with crowdsourcing. J. Med. Internet Res. 17, e80 (2015). PubMed PMC
Storey J. D. & Tibshirani R. in
Anders S. Analysing RNA-Seq data with the DESeq package. Mol. Biol. 43, 1–17 (2010).
Li J., Bushel P. R., Chu T.-M. & Wolfinger R. D. in
Leek J. T. & Storey J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007). PubMed PMC
Sagiv E. PubMed
Soucek L. PubMed
Nilsson E. C. PubMed
Hwang S. J. PubMed
Sohda T. PubMed
Savage D. G. & Antman K. H. Imatinib mesylate—a new oral targeted therapy. N. Engl. J. Med. 346, 683–693 (2002). PubMed
Martínez-Ramírez A. PubMed
Antunes C. M. F. PubMed
Weiderpass E. PubMed
Grady D., Gebretsadik T., Kerlikowske K., Ernster V. & Petitti D. Hormone replacement therapy and endometrial cancer risk: a meta-analysis. Obstet. Gynecol. 85, 304–313 (1995). PubMed
Kahlert S. PubMed
Sirianni R. PubMed
Pollak M. Insulin and insulin-like growth factor signalling in neoplasia. Nat. Rev. Cancer 8, 915–928 (2008). PubMed
Schmandt R. E., Iglesias D. A., Co N. N. & Lu K. H. Understanding obesity and endometrial cancer risk: opportunities for prevention. Am. J. Obstet. Gynecol. 205, 518–525 (2011). PubMed PMC
Michalik L., Desvergne B. & Wahli W. Peroxisome-proliferator-activated receptors and cancers: complex stories. Nat. Rev. Cancer 4, 61–70 (2004). PubMed
Tsuchida A. PubMed
Mu N., Zhu Y., Wang Y., Zhang H. & Xue F. Insulin resistance: a significant risk factor of endometrial cancer. Gynecol. Oncol. 125, 751–757 (2012). PubMed
Tupler R. & Gabellini D. Molecular basis of facioscapulohumeral muscular dystrophy. CMLS Cell Mol. Life Sci. 61, 557–566 (2004). PubMed PMC
Tawil R. & Van Der Maarel S. M. Facioscapulohumeral muscular dystrophy. Muscle Nerve 34, 1–15 (2006). PubMed
Lamb J. PubMed
The Cancer Genome Atlas Research, N.. PubMed PMC
Settles B. Active learning literature survey. University of Wisconsin, Madison 52, 11 (2010).
Yan Y., Fung G. M., Rosales R. & Dy J. G. in
Mozafari B., Sarkar P., Franklin M., Jordan M. & Madden S. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proc. VLDB Endow. 8, 125–136 (2014).
Leek J. T., Johnson W. E., Parker H. S., Jaffe A. E. & Storey J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012). PubMed PMC
Wang Z., Clark N. & Ma'ayan A. Dynamics of the discovery process of protein-protein interactions from low content studies. BMC Syst. Biol. 9, 26 (2015). PubMed PMC
Pletscher-Frankild S., Pallejà A., Tsafou K., Binder J. X. & Jensen L. J. DISEASES: text mining and data integration of disease–gene associations. Methods 74, 83–89 (2015). PubMed
Rogers D. & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010). PubMed
DeLong E. R., DeLong D. M. & Clarke-Pearson D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988). PubMed
Fellbaum C. WordNet Wiley Online Library (1998).
Van Rijsbergen C. J., Robertson S. E. & Porter M. F.
Manning C. D., Raghavan P. & Schütze H. Introduction to information retrieval Vol. 1, (Cambridge university press Cambridge (2008).
Van der Maaten L. & Hinton G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 85 (2008).
Pedregosa F.
Breiman L. Random forests. Mach. Learn. 45, 5–32 (2001).
Geurts P., Ernst D. & Wehenkel L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Friedman J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Breiman L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Zadrozny B. & Elkan C. in ICML, vol. 1, 609–616Citeseer (2001).
Ester M., Kriegel H.-P., Sander J. & Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In
Nunes T., Campos D., Matos S. & Oliveira J. L. BeCAS: biomedical concept recognition services and visualization. Bioinformatics 29, 1915–1916 (2013). PubMed