Metalearning, an important part of artificial intelligence, represents a promising approach for the task of automatic selection of appropriate methods or algorithms. This paper is interested in recommending a suitable estimator for nonlinear regression modeling, particularly in recommending either the standard nonlinear least squares estimator or one of such available alternative estimators, which is highly robust with respect to the presence of outliers in the data. The authors hold the opinion that theoretical considerations will never be able to formulate such recommendations for the nonlinear regression context. Instead, metalearning is explored here as an original approach suitable for this task. In this paper, four different approaches for automatic method selection for nonlinear regression are proposed and computations over a training database of 643 real publicly available datasets are performed. Particularly, while the metalearning results may be harmed by the imbalanced number of groups, an effective approach yields much improved results, performing a novel combination of supervised feature selection by random forest and oversampling by synthetic minority oversampling technique (SMOTE). As a by-product, the computations bring arguments in favor of the very recent nonlinear least weighted squares estimator, which turns out to outperform other (and much more renowned) estimators in a quite large percentage of datasets.
- MeSH
- algoritmy * MeSH
- metoda nejmenších čtverců MeSH
- umělá inteligence * MeSH
- Publikační typ
- časopisecké články MeSH
- MeSH
- biomedicínský výzkum MeSH
- biostatistika * metody MeSH
- multivariační analýza MeSH
- Publikační typ
- práce podpořená grantem MeSH
The aim of this paper is to overview challenges and principles of Big Data analysis in biomedicine. Recent multivariate statistical approaches to complexity reduction represent a useful (and often irreplaceable) methodology allowing performing a reliable Big Data analysis. Attention is paid to principal component analysis, partial least squares, and variable selection based on maximizing conditional entropy. Some important problems as well as ideas of complexity reduction are illustrated on examples from biomedical research tasks. These include high-dimensional data in the form of facial images or gene expression measurements from a cardiovascular genetic study.
- MeSH
- analýza dat MeSH
- analýza hlavních komponent metody MeSH
- big data * MeSH
- biostatistika * metody MeSH
- kardiovaskulární nemoci genetika prevence a kontrola MeSH
- lidé MeSH
- metoda nejmenších čtverců MeSH
- riziko MeSH
- rozpoznání obličeje MeSH
- systémy pro podporu klinického rozhodování MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- práce podpořená grantem MeSH
Systémy pro podporu klinického rozhodování jsou důležitými telemedicínskými nástroji se schopností pomáhat lékařům při procesu rozhodování při stanovení diagnózy, terapie či prognózy pacientů. Navrhli a implementovali jsme prototyp systému pro podporu diagnostického rozhodování, který má podobu internetové klasifikační služby. Specifikem tohoto systému je sofistikovaná statistická komponenta, která umožňuje pracovat i s velkým počtem příznaků. Optimalizuje totiž výběr těch příznaků, které jsou nejdůležitější pro určení diagnózy. Její chování jsme ověřili při analýze dat genových expresí z kardiovaskulární genetické studie. Článek diskutuje principy mnohorozměrného statistického uvažování a ukazuje obtíže analýzy vysoce dimenzionálních dat, kdy počet pozorovaných proměnných (příznaků) převyšuje počet pozorování (pacientů).
Clinical decision support systems represent important telemedicine tools with the ability to help physicians within the decision process leading to determining diagnosis, therapy or prognosis of patients. We proposed and implemented a prototype of a clinical decision support systém, which has the form of an internet classification service. A specific property of this system is a sophisticated statistical component, which allows to handle also a large number of symptoms and signs. It namely optimizes the selection of such symptoms and signs which are the most relevant for determining the diagnosis. The performance of the prototype was verified on an analysis of gene expression data from a cardiovascular genetic study. The paper discusses principles of multivariate statistical thinking and reveals challenges of analyzing high-dimensional data with the number of observed variables (symptoms and signs) largely exceeding the number of observations (patients).
Decision support systems represent very complicated systems offering assistance with the decision making process. Learning the classification rule of a decision support system requires to solve complex statistical task, most commonly by means of classification analysis. However, the regression methodology may be useful in this context as well. This paper has the aim to overview various regression methods, discuss their properties and show examples within clinical decision making.
- MeSH
- interpretace statistických dat MeSH
- klinické rozhodování metody MeSH
- lineární modely MeSH
- logistické modely MeSH
- metoda nejmenších čtverců MeSH
- neuronové sítě MeSH
- regresní analýza * MeSH
- statistické modely * MeSH
- statistika jako téma MeSH
- support vector machine MeSH
- systémy pro podporu klinického rozhodování MeSH
Množství dostupných dat, která jsou relevantní pro podporu klinického rozhodování, roste mnohem rychleji, než naše schopnost je analyzovat a interpretovat. Proto dosud není plně využit potenciál dat přispět ke stanovení správné diagnózy, terapie a prognózy jednotlivého pacienta. Měřená data mohou zajistit konkrétní přínos pro konkrétního pacienta, což však platí jen v případě, že jejich biostatistická analýza je provedena spolehlivě a pečlivě. To vyžaduje řešit výzvy, které se mohou jevit nesrozumitelnými pro nestatistiky. Cílem tohoto článku je diskutovat principy statistické analýzy velkých dat ve výzkumu i rutinních aplikacích v klinické medicíně, se zvláštním zřetelem na specifické aspekty psychiatrie. Biostatistická analýza dat ve speciálním oboru vyžaduje své specifické přístupy a odlišné zkušenosti oproti jiným klinickým oblastem, jak dokládají komplikace při analýze psychiatrických dat. Analýza velkých dat v psychiatrickém výzkumu i rutinních aplikacích je velmi vzdálena pouhé servisní činnosti využívající standardní metody mnohorozměrné statistiky a/nebo strojového učení.
The amount of available data relevant for clinical decision support is rising not only rapidly but at the same time much faster than our ability to analyze and interpret them. Thus, the potential of the data to contribute to determining the diagnosis, therapy and prognosis of an individual patient is not appropriately exploited. The hopes to obtain benefit from the data for an individual patient must be accompanied by a reliable and diligent biostatistical analysis which faces serious challenges not always clear to non-statisticians. The aim of this paper is to discuss principles of statistical analysis of big data in research and routine applications in clinical medicine, focusing on particular aspects of psychiatry. The paper brings arguments in favor of the idea that the biostatistical analysis of data in a specialty field requires different approaches and different experience compared to other clinical fields. This is illustrated by a description of common complications of the analysis of psychiatric data. Challenges of the analysis of big data in both psychiatric research and routine practice are explained, which are far from a routine service activity exploiting standard methods of multivariate statistics and/or machine learning. Important research questions, which are important in the current psychiatric research, are presented and discussed from the biostatistical point of view.
Gregor Mendel is generally acknowledged not only as the founder of genetics but also as the author of the first mathematical result in biology. Although his education had been questioned for a long time, he was profoundly educated in botany as well as physics and in those parts of mathematics (combinatorics, probability theory) applied in his later pea plants experiments. Nevertheless, there remain debates in statistical literature about the reasons why are Mendel’s results in such a too good accordance with expected values [22, 28]. The main aim of this paper is to propose new two-stage statistical models, which are in a better accordance with Mendel’s data than a classical model, where the latter considers a fixed sample size. If Mendel realized his experiments following such two-stage algorithm, which cannot be however proven, the results would purify Mendel’s legacy and remove the suspicions that he modified the results. Mendel’s experiments are described from a statistical point of view and his data are shown to be close to randomly generated data from the novel models. Such model is found as the most suitable, which is remarkably simpler according to the model of [28], while the new model yields only slightly weaker results. The paper also discusses Mendel’s legacy from the point of view of biostatistics.
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.