BACKGROUND: All currently available methods of network/association inference from microarray gene expression measurements implicitly assume that such measurements represent the actual expression levels of different genes within each cell included in the biological sample under study. Contrary to this common belief, modern microarray technology produces signals aggregated over a random number of individual cells, a "nitty-gritty" aspect of such arrays, thereby causing a random effect that distorts the correlation structure of intra-cellular gene expression levels. RESULTS: This paper provides a theoretical consideration of the random effect of signal aggregation and its implications for correlation analysis and network inference. An attempt is made to quantitatively assess the magnitude of this effect from real data. Some preliminary ideas are offered to mitigate the consequences of random signal aggregation in the analysis of gene expression data. CONCLUSION: Resulting from the summation of expression intensities over a random number of individual cells, the observed signals may not adequately reflect the true dependence structure of intra-cellular gene expression levels needed as a source of information for network reconstruction. Whether the reported effect is extrime or not, the important point, is to reconize and incorporate such signal source for proper inference. The usefulness of inference on genetic regulatory structures from microarray data depends critically on the ability of investigators to overcome this obstacle in a scientifically sound way. REVIEWERS: This article was reviewed by Byung Soo KIM, Jeanne Kowalski and Geoff McLachlan.
- MeSH
- lidé MeSH
- modely genetické MeSH
- neparametrická statistika MeSH
- sekvenční analýza hybridizací s uspořádaným souborem oligonukleotidů metody statistika a číselné údaje MeSH
- stanovení celkové genové exprese metody statistika a číselné údaje MeSH
- výpočetní biologie metody statistika a číselné údaje MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- zvířata MeSH
- Publikační typ
- práce podpořená grantem MeSH
- přehledy MeSH
- Research Support, N.I.H., Extramural MeSH
The currently practiced methods of significance testing in microarray gene expression profiling are highly unstable and tend to be very low in power. These undesirable properties are due to the nature of multiple testing procedures, as well as extremely strong and long-ranged correlations between gene expression levels. In an earlier publication, we identified a special structure in gene expression data that produces a sequence of weakly dependent random variables. This structure, termed the delta-sequence, lies at the heart of a new methodology for selecting differentially expressed genes in nonoverlapping gene pairs. The proposed method has two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, and in the stability of the results of testing; and (2) its outcomes are entirely free from the log-additive array-specific technical noise. We demonstrate the usefulness of this approach in conjunction with the nonparametric empirical Bayes method. The proposed modification of the empirical Bayes method leads to significant improvements in its performance. The new paradigm arising from the existence of the delta-sequence in biological data offers considerable scope for future developments in this area of methodological research.
We study the estimation of statistical moments of interspike intervals based on observation of spike counts in many independent short time windows. This scenario corresponds to the situation in which a target neuron occurs. It receives information from many neurons and has to respond within a short time interval. The precision of the estimation procedures is examined. As the model for neuronal activity, two examples of stationary point processes are considered: renewal process and doubly stochastic Poisson process. Both moment and maximum likelihood estimators are investigated. Not only the mean but also the coefficient of variation is estimated. In accordance with our expectations, numerical studies confirm that the estimation of mean interspike interval is more reliable than the estimation of coefficient of variation. The error of estimation increases with increasing mean interspike interval, which is equivalent to decreasing the size of window (less events are observed in a window) and with decreasing the number of neurons (lower number of windows).
A test-statistic typically employed in the gene set enrichment analysis (GSEA) prevents this method from being genuinely multivariate. In particular, this statistic is insensitive to changes in the correlation structure of the gene sets of interest. The present paper considers the utility of an alternative test-statistic in designing the confirmatory component of the GSEA. This statistic is based on a pertinent distance between joint distributions of expression levels of genes included in the set of interest. The null distribution of the proposed test-statistic, known as the multivariate N-statistic, is obtained by permuting group labels. Our simulation studies and analysis of biological data confirm the conjecture that the N-statistic is a much better choice for multivariate significance testing within the framework of the GSEA. We also discuss some other aspects of the GSEA paradigm and suggest new avenues for future research.