imputation
Dotaz
Zobrazit nápovědu
OBJECTIVES: We aimed to compare various methods for imputing disease activity in longitudinally collected observational data of patients with axial spondyloarthritis (axSpA). METHODS: We conducted a simulation study on data from 8583 axSpA patients from ten European registries. Disease activity was assessed by the Axial Spondyloarthritis Disease Activity Score (ASDAS) and the corresponding low disease activity (LDA; ASDAS<2.1) state at baseline, 6 and 12 months. We focused on cross-sectional methods which impute missing values of an individual at a particular time point based on the available information from other individuals at that time point. We applied nine single and five multiple imputation methods, covering mean, regression and hot deck methods. The performance of each imputation method was evaluated via relative bias and coverage of 95% confidence intervals for the mean ASDAS and the derived proportion of patients in LDA. RESULTS: Hot deck imputation methods outperformed mean and regression methods, particularly when assessing LDA. Multiple imputation procedures provided better coverage than the corresponding single imputation ones. However, none of the evaluated methods produced unbiased estimates with adequate coverage across all time points, with performance for missing baseline data being worse than for missing follow-up data. Predictive mean and weighted predictive mean hot deck imputation procedures consistently provided results with low bias. CONCLUSIONS: This study contributes to the available methods for imputing disease activity in observational research. Hot deck imputation using predictive mean matching exhibited the highest robustness and is thus our suggested approach.
- Klíčová slova
- Axial Spondyloarthritis, Epidemiology, Interleukin-17, Tumour Necrosis Factor Inhibitors,
- MeSH
- axiální spondyloartritida * epidemiologie diagnóza MeSH
- dospělí MeSH
- lidé MeSH
- pozorovací studie jako téma * MeSH
- průřezové studie MeSH
- registrace MeSH
- spondylartritida * diagnóza MeSH
- stupeň závažnosti nemoci MeSH
- Check Tag
- dospělí MeSH
- lidé MeSH
- mužské pohlaví MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Evropa epidemiologie MeSH
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
- MeSH
- genotyp MeSH
- genotypizační techniky metody MeSH
- jednonukleotidový polymorfismus MeSH
- modely genetické * MeSH
- populační genetika MeSH
- selekce (genetika) * MeSH
- smrk genetika MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- srovnávací studie MeSH
- Geografické názvy
- Britská Kolumbie MeSH
OBJECTIVE: The idiopathic inflammatory myopathies (IIMs) are heterogeneous diseases thought to be initiated by immune activation in genetically predisposed individuals. We imputed variants from the ImmunoChip array using a large reference panel to fine-map associations and identify novel associations in IIM. METHODS: We analyzed 2,565 Caucasian IIM patient samples collected through the Myositis Genetics Consortium (MYOGEN) and 10,260 ethnically matched control samples. We imputed 1,648,116 variants from the ImmunoChip array using the Haplotype Reference Consortium panel and conducted association analysis on IIM and clinical and serologic subgroups. RESULTS: The HLA locus was consistently the most significantly associated region. Four non-HLA regions reached genome-wide significance, SDK2 and LINC00924 (both novel) and STAT4 in the whole IIM cohort, with evidence of independent variants in STAT4, and NAB1 in the polymyositis (PM) subgroup. We also found suggestive evidence of association with loci previously associated with other autoimmune rheumatic diseases (TEC and LTBR). We identified more significant associations than those previously reported in IIM for STAT4 and DGKQ in the total cohort, for NAB1 and FAM167A-BLK loci in PM, and for CCR5 in inclusion body myositis. We found enrichment of variants among DNase I hypersensitivity sites and histone marks associated with active transcription within blood cells. CONCLUSION: We found novel and strong associations in IIM and PM and localized signals to single genes and immune cell types.
INTRODUCTION: Recent advances in machine learning provide new possibilities to process and analyse observational patient data to predict patient outcomes. In this paper, we introduce a data processing pipeline for cardiogenic shock (CS) prediction from the MIMIC III database of intensive cardiac care unit patients with acute coronary syndrome. The ability to identify high-risk patients could possibly allow taking pre-emptive measures and thus prevent the development of CS. METHODS: We mainly focus on techniques for the imputation of missing data by generating a pipeline for imputation and comparing the performance of various multivariate imputation algorithms, including k-nearest neighbours, two singular value decomposition (SVD)-based methods, and Multiple Imputation by Chained Equations. After imputation, we select the final subjects and variables from the imputed dataset and showcase the performance of the gradient-boosted framework that uses a tree-based classifier for cardiogenic shock prediction. RESULTS: We achieved good classification performance thanks to data cleaning and imputation (cross-validated mean area under the curve 0.805) without hyperparameter optimization. CONCLUSION: We believe our pre-processing pipeline would prove helpful also for other classification and regression experiments.
- Klíčová slova
- cardiogenic shock, classification, machine learning, missing data imputation, prediction model, processing pipeline,
- Publikační typ
- časopisecké články MeSH
OBJECTIVE: Lung cancer exhibits unpredictable recurrence in low-stage tumors and variable responses to different therapeutic interventions. Predicting relapse in early-stage lung cancer can facilitate precision medicine and improve patient survivability. While existing machine learning models rely on clinical data, incorporating genomic information could enhance their efficiency. This study aims to impute and integrate specific types of genomic data with clinical data to improve the accuracy of machine learning models for predicting relapse in early-stage, non-small cell lung cancer patients. METHODS: The study utilized a publicly available TCGA lung cancer cohort and imputed genetic pathway scores into the Spanish Lung Cancer Group (SLCG) data, specifically in 1348 early-stage patients. Initially, tumor recurrence was predicted without imputed pathway scores. Subsequently, the SLCG data were augmented with pathway scores imputed from TCGA. The integrative approach aimed to enhance relapse risk prediction performance. RESULTS: The integrative approach achieved improved relapse risk prediction with the following evaluation metrics: an area under the precision-recall curve (PR-AUC) score of 0.75, an area under the ROC (ROC-AUC) score of 0.80, an F1 score of 0.61, and a Precision of 0.80. The prediction explanation model SHAP (SHapley Additive exPlanations) was employed to explain the machine learning model's predictions. CONCLUSION: We conclude that our explainable predictive model is a promising tool for oncologists that addresses an unmet clinical need of post-treatment patient stratification based on the relapse risk while also improving the predictive power by incorporating proxy genomic data not available for specific patients.
- Klíčová slova
- Classification, Explanation, Imputation, Recurrence, Regression, Supervised,
- MeSH
- lidé MeSH
- lokální recidiva nádoru genetika MeSH
- malobuněčný karcinom plic * MeSH
- nádory plic * diagnóza genetika MeSH
- nemalobuněčný karcinom plic * diagnóza genetika MeSH
- plíce MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
OBJECTIVE: To compare several methods of missing data imputation for function (Health Assessment Questionnaire) and for disease activity (Disease Activity Score-28 and Clinical Disease Activity Index) in rheumatoid arthritis (RA) patients. METHODS: One thousand RA patients from observational cohort studies with complete data for function and disease activity at baseline, 6, 12 and 24 months were selected to conduct a simulation study. Values were deleted at random or following a predicted attrition bias. Three types of imputation were performed: (1) methods imputing forward in time (last observation carried forward; linear forward extrapolation); (2) methods considering data both forward and backward in time (nearest available observation-NAO; linear extrapolation; polynomial extrapolation); and (3) methods using multi-individual models (linear mixed effects cubic regression-LME3; multiple imputation by chained equation-MICE). The performance of each estimation method was assessed using the difference between the mean outcome value, the remission and low disease activity rates after imputation of the missing values and the true value. RESULTS: When imputing missing baseline values, all methods underestimated equally the true value, but LME3 and MICE correctly estimated remission and low disease activity rates. When imputing missing follow-up values at 6, 12, or 24 months, NAO provided the least biassed estimate of the mean disease activity and corresponding remission rate. These results were not affected by the presence of attrition bias. CONCLUSION: When imputing function and disease activity in large registers of active RA patients, researchers can consider the use of a simple method such as NAO for missing follow-up data, and the use of mixed-effects regression or multiple imputation for baseline data.
- Klíčová slova
- DAS28, disease activity, epidemiology, outcomes research, rheumatoid arthritis,
- MeSH
- algoritmy MeSH
- indukce remise MeSH
- interpretace statistických dat * MeSH
- kohortové studie MeSH
- lidé MeSH
- lineární modely MeSH
- následné studie MeSH
- počítačová simulace MeSH
- revmatoidní artritida epidemiologie MeSH
- stupeň závažnosti nemoci MeSH
- výzkumný projekt statistika a číselné údaje MeSH
- zkreslení výsledků (epidemiologie) MeSH
- Check Tag
- lidé MeSH
- mužské pohlaví MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- pozorovací studie MeSH
- práce podpořená grantem MeSH
- srovnávací studie MeSH
BACKGROUND: Observational data on composite scores often comes with missing component information. When a complete-case (CC) analysis of composite scores is unbiased, preferable approaches of dealing with missing component information should also be unbiased and provide a more precise estimate. We assessed the performance of several methods compared to CC analysis in estimating the means of common composite scores used in axial spondyloarthritis research. METHODS: Individual mean imputation (IMI), the modified formula method (MF), overall mean imputation (OMI), and multiple imputation of missing component values (MI) were assessed either analytically or by means of simulations from available data collected across Europe. Their performance in estimating the means of the Bath Ankylosing Spondylitis Disease Activity Index (BASDAI), the Bath Ankylosing Spondylitis Functional Index (BASFI), and the Ankylosing Spondylitis Disease Activity Score based on C-reactive protein (ASDAS-CRP) in cases where component information was set missing completely at random was compared to the CC approach based on bias, variance, and coverage. RESULTS: Like the MF method, IMI uses a modified formula for observations with missing components resulting in modified composite scores. In the case of an unbiased CC approach, these two methods yielded representative samples of the distribution arising from a mixture of the original and modified composite scores, which, however, could not be considered the same as the distribution of the original score. The IMI and MF method are, thus, intrinsically biased. OMI provided an unbiased mean but displayed a complex dependence structure among observations that, if not accounted for, resulted in severe coverage issues. MI improved precision compared to CC and gave unbiased means and proper coverage as long as the extent of missingness was not too large. CONCLUSIONS: MI of missing component values was the only method found successful in retaining CC's unbiasedness and in providing increased precision for estimating the means of BASDAI, BASFI, and ASDAS-CRP. However, since MI is susceptible to incorrect implementation and its performance may become questionable with increasing missingness, we consider the implementation of an error-free CC approach a valid and valuable option. TRIAL REGISTRATION: Not applicable as study uses data from patient registries.
- Klíčová slova
- Axial spondyloarthritis, Complete-case analysis, Composite score, Missing components, Multiple imputation,
- MeSH
- axiální spondyloartritida * diagnóza MeSH
- C-reaktivní protein analýza MeSH
- interpretace statistických dat MeSH
- lidé MeSH
- stupeň závažnosti nemoci MeSH
- výzkumný projekt MeSH
- zkreslení výsledků (epidemiologie) MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Evropa MeSH
- Názvy látek
- C-reaktivní protein MeSH
Genome-wide association studies (GWAS) have identified common pancreatic cancer susceptibility variants at 13 chromosomal loci in individuals of European descent. To identify new susceptibility variants, we performed imputation based on 1000 Genomes (1000G) Project data and association analysis using 5,107 case and 8,845 control subjects from 27 cohort and case-control studies that participated in the PanScan I-III GWAS. This analysis, in combination with a two-staged replication in an additional 6,076 case and 7,555 control subjects from the PANcreatic Disease ReseArch (PANDoRA) and Pancreatic Cancer Case-Control (PanC4) Consortia uncovered 3 new pancreatic cancer risk signals marked by single nucleotide polymorphisms (SNPs) rs2816938 at chromosome 1q32.1 (per allele odds ratio (OR) = 1.20, P = 4.88x10 -15), rs10094872 at 8q24.21 (OR = 1.15, P = 3.22x10 -9) and rs35226131 at 5p15.33 (OR = 0.71, P = 1.70x10 -8). These SNPs represent independent risk variants at previously identified pancreatic cancer risk loci on chr1q32.1 ( NR5A2), chr8q24.21 ( MYC) and chr5p15.33 ( CLPTM1L- TERT) as per analyses conditioned on previously reported susceptibility variants. We assessed expression of candidate genes at the three risk loci in histologically normal ( n = 10) and tumor ( n = 8) derived pancreatic tissue samples and observed a marked reduction of NR5A2 expression (chr1q32.1) in the tumors (fold change -7.6, P = 5.7x10 -8). This finding was validated in a second set of paired ( n = 20) histologically normal and tumor derived pancreatic tissue samples (average fold change for three NR5A2 isoforms -31.3 to -95.7, P = 7.5x10 -4-2.0x10 -3). Our study has identified new susceptibility variants independently conferring pancreatic cancer risk that merit functional follow-up to identify target genes and explain the underlying biology.
- Klíčová slova
- GWAS, NR5A2, fine-mapping, imputation, pancreatic cancer,
- MeSH
- celogenomová asociační studie metody MeSH
- datové soubory jako téma MeSH
- genetická predispozice k nemoci genetika MeSH
- genotyp MeSH
- jednonukleotidový polymorfismus genetika MeSH
- lidé MeSH
- lidské chromozomy, pár 1 genetika MeSH
- lidské chromozomy, pár 5 genetika MeSH
- lidské chromozomy, pár 8 genetika MeSH
- nádory slinivky břišní genetika MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
KEY MESSAGE: The machine learning algorithm extreme gradient boosting can be employed to address the issue of long data gaps in individual trees, without the need for additional tree-growth data or climatic variables. ABSTRACT: The susceptibility of dendrometer devices to technical failures often makes time-series analyses challenging. Resulting data gaps decrease sample size and complicate time-series comparison and integration. Existing methods either focus on bridging smaller gaps, are dependent on data from other trees or rely on climate parameters. In this study, we test eight machine learning (ML) algorithms to fill gaps in dendrometer data of individual trees in urban and non-urban environments. Among these algorithms, extreme gradient boosting (XGB) demonstrates the best skill to bridge artificially created gaps throughout the growing seasons of individual trees. The individual tree models are suited to fill gaps up to 30 consecutive days and perform particularly well at the start and end of the growing season. The method is independent of climate input variables or dendrometer data from neighbouring trees. The varying limitations among existing approaches call for cross-comparison of multiple methods and visual control. Our findings indicate that ML is a valid approach to fill gaps in individual trees, which can be of particular importance in situations of limited inter-tree co-variance, such as in urban environments. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00468-024-02573-y.
- Klíčová slova
- Acer platanoides, Dendroecology, Imputation, Platanus x hispanica, Tree growth, Urban trees,
- Publikační typ
- časopisecké články MeSH
INTRODUCTION: Physical fitness benefits health. However, there is a research gap on how physical fitness, particularly aerobic endurance capacity and muscle power, is influenced by residential altitude, blood parameters, weight, and other cofactors in a population living at low to moderate altitudes (300-2100 masl). MATERIALS AND METHODS: We explored how endurance and muscle power performance changes with residential altitude, Body Mass Index (BMI), hemoglobin and creatinine levels among 108,677 Swiss men aged 18-22 years (covering >90% of Swiss birth cohorts) conscripted to the Swiss Armed Forces between 2007 and 2012. The test battery included a blood test of about 65%, a physical evaluation of about 85%, and the BMI of all conscripts. RESULTS: Residential altitude was significantly associated with endurance (p < 0.001) but not with muscle power performance (p = 0.858) after adjusting for all available cofactors. Higher BMI showed the greatest negative association with both endurance and muscle power performance. For muscle power performance, the association with creatinine levels was significant. Elevated C-reactive protein (CRP) and hemoglobin levels were stronger contributors in explaining endurance than muscle power performance. CONCLUSION: We found a significant association between low to moderate residential altitude and aerobic endurance capacity even after adjustment for hemoglobin, creatinine, BMI and sociodemographic factors. Non-assessed factors such as vitamin D levels, air pollution, and lifestyle aspects may explain the presented remaining association partially and could also be associated with residential altitude. Monitoring the health and fitness of young people and their determinants is important and of practical concern for disease prevention and public health implications.
- Klíčová slova
- C-reactive protein, Switzerland, VO2max, general additive models, hemoglobin, multiple imputation,
- Publikační typ
- časopisecké články MeSH