Dataset imbalance Dotaz Zobrazit nápovědu
Over the past decade, the use of deep learning has been widely increasing in the medical image diagnosis field. Deep learning-based methods' (DLMs) performance strongly relies on training data. Therefore, researchers often focus on collecting as much data as possible from different medical facilities or developing approaches to avoid the impact of inter-category imbalance (ICI), which means a difference in data quantity among categories. However, due to the ICI within each medical facility, medical data are often isolated and acquired in different settings among medical facilities, known as the issue of intra-source imbalance (ISI) characteristic. This imbalance also impacts the performance of DLMs but receives negligible attention. In this study, we study the impact of the ISI on DLMs by comparison of the version of a deep learning model that was trained separately by an intra-source imbalanced chest X-ray (CXR) dataset and an intra-source balanced CXR dataset for COVID-19 diagnosis. The finding is that using the intra-source imbalanced dataset causes a serious training bias, although the dataset has a good inter-category balance. In contrast, the deep learning model performed a reliable diagnosis when trained on the intra-source balanced dataset. Therefore, our study reports clear evidence that the intra-source balance is vital for training data to minimize the risk of poor performance of DLMs.
- MeSH
- COVID-19 * diagnostické zobrazování MeSH
- deep learning * MeSH
- hrudník MeSH
- lidé MeSH
- rentgenové záření MeSH
- testování na COVID-19 MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
PURPOSE: Radiotherapy outcome modelling often suffers from class imbalance in the modelled endpoints. One of the main options to address this issue is by introducing new synthetically generated datapoints, using generative models, such as Denoising Diffusion Probabilistic Models (DDPM). In this study, we implemented DDPM to improve performance of a tumor local control model, trained on imbalanced dataset, and compare this approach with other common techniques. METHODS: A dataset of 535 NSCLC patients treated with SBRT (50 Gy/5 fractions) was used to train a deep learning outcome model for tumor local control prediction. The dataset included complete treatment planning data (planning CT images, 3D planning dose distribution and patient demographics) with sparsely distributed endpoints (6-7 % experiencing local failure). Consequently, we trained a novel conditional 3D DDPM model to generate synthetic treatment planning data. Synthetically generated treatment planning datapoints were used to supplement the real training dataset and the improvement in the model's performance was studied. Obtained results were also compared to other common techniques for class imbalanced training, such as Oversampling, Undersampling, Augmentation, Class Weights, SMOTE and ADASYN. RESULTS: Synthetic DDPM-generated data were visually trustworthy, with Fréchet inception distance (FID) below 50. Extending the training dataset with the synthetic data improved the model's performance by more than 10%, while other techniques exhibited only about 4% improvement. CONCLUSIONS: DDPM introduces a novel approach to class-imbalanced outcome modelling problems. The model generates realistic synthetic radiotherapy planning data, with a strong potential to increase performance and robustness of outcome models.
INTRODUCTION: Precise localization of the epileptogenic zone is critical for successful epilepsy surgery. However, imbalanced datasets in terms of epileptic vs. normal electrode contacts and a lack of standardized evaluation guidelines hinder the consistent evaluation of automatic machine learning localization models. METHODS: This study addresses these challenges by analyzing class imbalance in clinical datasets and evaluating common assessment metrics. Data from 139 drug-resistant epilepsy patients across two Institutions were analyzed. Metric behaviors were examined using clinical and simulated data. RESULTS: Complementary use of Area Under the Receiver Operating Characteristic (AUROC) and Area Under the Precision-Recall Curve (AUPRC) provides an optimal evaluation approach. This must be paired with an analysis of class imbalance and its impact due to significant variations found in clinical datasets. CONCLUSIONS: The proposed framework offers a comprehensive and reliable method for evaluating machine learning models in epileptogenic zone localization, improving their precision and clinical relevance. SIGNIFICANCE: Adopting this framework will improve the comparability and multicenter testing of machine learning models in epileptogenic zone localization, enhancing their reliability and ultimately leading to better surgical outcomes for epilepsy patients.
- MeSH
- dospělí MeSH
- elektrokortikografie metody normy MeSH
- lidé středního věku MeSH
- lidé MeSH
- mladiství MeSH
- mladý dospělý MeSH
- refrakterní epilepsie * chirurgie patofyziologie MeSH
- strojové učení * MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mladiství MeSH
- mladý dospělý MeSH
- mužské pohlaví MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
Imbalanced datasets are prominent in real-world problems. In such problems, the data samples in one class are significantly higher than in the other classes, even though the other classes might be more important. The standard classification algorithms may classify all the data into the majority class, and this is a significant drawback of most standard learning algorithms, so imbalanced datasets need to be handled carefully. One of the traditional algorithms, twin support vector machines (TSVM), performed well on balanced data classification but poorly on imbalanced datasets classification. In order to improve the TSVM algorithm's classification ability for imbalanced datasets, recently, driven by the universum twin support vector machine (UTSVM), a reduced universum twin support vector machine for class imbalance learning (RUTSVM) was proposed. The dual problem and finding classifiers involve matrix inverse computation, which is one of RUTSVM's key drawbacks. In this paper, we improve the RUTSVM and propose an improved reduced universum twin support vector machine for class imbalance learning (IRUTSVM). We offer alternative Lagrangian functions to tackle the primal problems of RUTSVM in the suggested IRUTSVM approach by inserting one of the terms in the objective function into the constraints. As a result, we obtain new dual formulation for each optimization problem so that we need not compute inverse matrices neither in the training process nor in finding the classifiers. Moreover, the smaller size of the rectangular kernel matrices is used to reduce the computational time. Extensive testing is carried out on a variety of synthetic and real-world imbalanced datasets, and the findings show that the IRUTSVM algorithm outperforms the TSVM, UTSVM, and RUTSVM algorithms in terms of generalization performance.
- MeSH
- algoritmy * MeSH
- support vector machine * MeSH
- Publikační typ
- časopisecké články MeSH
Streptococcus pneumoniae is an opportunistic human pathogen that encodes a single eukaryotic-type Ser/Thr protein kinase StkP and its functional counterpart, the protein phosphatase PhpP. These signaling enzymes play critical roles in coordinating cell division and growth in pneumococci. In this study, we determined the proteome and phosphoproteome profiles of relevant mutants. Comparison of those with the wild-type provided a representative dataset of novel phosphoacceptor sites and StkP-dependent substrates. StkP phosphorylates key proteins involved in cell division and cell wall biosynthesis in both the unencapsulated laboratory strain Rx1 and the encapsulated virulent strain D39. Furthermore, we show that StkP plays an important role in triggering an adaptive response induced by a cell wall-directed antibiotic. Phosphorylation of the sensor histidine kinase WalK and downregulation of proteins of the WalRK core regulon suggest crosstalk between StkP and the WalRK two-component system. Analysis of proteomic profiles led to the identification of gene clusters regulated by catabolite control mechanisms, indicating a tight coupling of carbon metabolism and cell wall homeostasis. The imbalance of steady-state protein phosphorylation in the mutants as well as after antibiotic treatment is accompanied by an accumulation of the global Spx regulator, indicating a Spx-mediated envelope stress response. In summary, StkP relays the perceived signal of cell wall status to key cell division and regulatory proteins, controlling the cell cycle and cell wall homeostasis.
- MeSH
- antibakteriální látky farmakologie MeSH
- bakteriální proteiny metabolismus MeSH
- buněčná stěna účinky léků fyziologie MeSH
- fosfoproteiny metabolismus MeSH
- fosforylace MeSH
- fyziologický stres * MeSH
- proteinkinasy metabolismus MeSH
- proteom MeSH
- Streptococcus pneumoniae účinky léků fyziologie MeSH
- Publikační typ
- časopisecké články MeSH
- dataset MeSH
- práce podpořená grantem MeSH
Burn management has significantly advanced in the past 75 years, resulting in improved mortality rates. However, there are still over one million burn victims in the United States each year, with over 3,000 burn-related deaths annually. The impacts of individual patient, hospital, and regional demographics on length of stay (LOS) and total cost have yet to be fully explored in a large nationally representative cohort. Thus, this study aimed to examine various hospital and patient characteristics using a sample of over 20,000 patients. Inpatient data from the National Inpatient Sample from 2008 to 2015 were analyzed, and only patients with an ICD-9 code for second- or third-degree burns were included. In addition, a major operating room procedure must have been indicated on the discharge summary for patients to be included in the final dataset, ensuring that only severe burns requiring complex care were analyzed. Analysis of covariance models was used to evaluate the impact of various patient, hospital, and regional variables on both LOS and cost. The study found that skin grafts and fasciotomy significantly increased the cost of hospitalization. Having burns on the face, neck, and trunk significantly increased costs for patients with second-degree burns, while burns on the trunk resulted in the longest LOS for patients with third-degree burns. Infections in the hospital and additional procedures, such as flaps and skin grafts, also led to longer stays. The study also found that the prevalence of postoperative complications, such as electrolyte imbalance, was high among patients with burn surgery.
- MeSH
- délka pobytu MeSH
- fasciotomie MeSH
- hospitalizace MeSH
- lidé MeSH
- popálení * chirurgie MeSH
- retrospektivní studie MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Spojené státy americké MeSH
Burn management has significantly advanced in the past 75 years, resulting in improved mortality rates. However, there are still over one million burn victims in the US each year, with over 3,000 burn-related deaths annually. The impacts of individual patient, hospital, and regional demographics on length of stay (LOS) and total cost have yet to be fully explored in a large nationally representative cohort. Thus, this study aimed to examine various hospital and patient characteristics using a sample of over 20,000 patients. Inpatient data from the National Inpatient Sample (NIS) from 2008-2015 was analyzed, and only patients with an ICD-9 code for second or third-degree burns were included. Additionally, a major operating room procedure must have been indicated on the discharge summary for patients to be included in the final dataset, ensuring that only severe burns requiring complex care were analyzed. Analysis of Covariance (ANCOVA) models were used to evaluate the impact of various patient, hospital, and regional variables on both LOS and cost. The study found that skin grafts and fasciotomy significantly increased the cost of hospitalization. Having burns on the face, neck, and trunk significantly increased costs for patients with second-degree burns, while burns on the trunk resulted in the longest LOS for patients with third-degree burns. Infections in the hospital and additional procedures, such as flaps and skin grafts, also led to longer stays. The study also found that the prevalence of post-operative complications, such as electrolyte imbalance, was high among burn surgery patients.
- Publikační typ
- časopisecké články MeSH
The current study assessed the performance of the fully automated RT-PCR-based IdyllaTM GeneFusion Assay, which simultaneously covers the advanced non-small cell lung carcinoma (aNSCLC) actionable ALK, ROS1, RET, and MET exon 14 rearrangements, in a routine clinical setting involving 12 European clinical centers. The IdyllaTM GeneFusion Assay detects fusions using fusion-specific as well as expression imbalance detection, the latter enabling detection of uncommon fusions not covered by fusion-specific assays. In total, 326 archival aNSCLC formalin-fixed paraffin-embedded (FFPE) samples were included of which 44% were resected specimen, 46% tissue biopsies, and 9% cytological specimen. With a total of 179 biomarker-positive cases (i.e., 85 ALK, 33 ROS1, 20 RET fusions and 41 MET exon 14 skipping), this is one of the largest fusion-positive datasets ever tested. The results of the IdyllaTM GeneFusion Assay were compared with earlier results of routine reference technologies including fluorescence in situ hybridization, immunohistochemistry, reverse-transcription polymerase chain reaction, and next-generation sequencing, establishing a high sensitivity/specificity of 96.1%/99.6% for ALK, 96.7%/99.0% for ROS1, 100%/99.3% for RET fusion, and 92.5%/99.6% for MET exon 14 skipping, and a low failure rate (0.9%). The IdyllaTM GeneFusion Assay was found to be a reliable, sensitive, and specific tool for routine detection of ALK, ROS1, RET fusions and MET exon 14 skipping. Given its short turnaround time of about 3 h, it is a time-efficient upfront screening tool in FFPE samples, supporting rapid clinical decision making. Moreover, expression-imbalance-based detection of potentially novel fusions may be easily verified with other routine technologies without delaying treatment initiation.
- MeSH
- anaplastická lymfomová kináza * genetika MeSH
- exony * genetika MeSH
- fúzní onkogenní proteiny * genetika MeSH
- genová přestavba MeSH
- hybridizace in situ fluorescenční metody MeSH
- lidé MeSH
- multiplexová polymerázová řetězová reakce MeSH
- nádorové biomarkery genetika analýza MeSH
- nádory plic * genetika patologie MeSH
- nemalobuněčný karcinom plic * genetika patologie MeSH
- protoonkogenní proteiny c-met * genetika MeSH
- protoonkogenní proteiny c-ret * genetika MeSH
- protoonkogenní proteiny * genetika MeSH
- tyrosinkinasy * genetika MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- hodnotící studie MeSH
- multicentrická studie MeSH
Automated sentiment analysis is becoming increasingly recognized due to the growing importance of social media and e-commerce platform review websites. Deep neural networks outperform traditional lexicon-based and machine learning methods by effectively exploiting contextual word embeddings to generate dense document representation. However, this representation model is not fully adequate to capture topical semantics and the sentiment polarity of words. To overcome these problems, a novel sentiment analysis model is proposed that utilizes richer document representations of word-emotion associations and topic models, which is the main computational novelty of this study. The sentiment analysis model integrates word embeddings with lexicon-based sentiment and emotion indicators, including negations and emoticons, and to further improve its performance, a topic modeling component is utilized together with a bag-of-words model based on a supervised term weighting scheme. The effectiveness of the proposed model is evaluated using large datasets of Amazon product reviews and hotel reviews. Experimental results prove that the proposed document representation is valid for the sentiment analysis of product and hotel reviews, irrespective of their class imbalance. The results also show that the proposed model improves on existing machine learning methods.
- MeSH
- algoritmy * MeSH
- emoce MeSH
- lidé MeSH
- neuronové sítě * MeSH
- sémantika MeSH
- strojové učení MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH