Data clustering
Dotaz
Zobrazit nápovědu
The Aquila Optimizer (AO) is a newly proposed, highly capable metaheuristic algorithm based on the hunting and search behavior of the Aquila bird. However, the AO faces some challenges when dealing with high-dimensional optimization problems due to its narrow exploration capabilities and a tendency to converge prematurely to local optima, which can decrease its performance in complex scenarios. This paper presents a modified form of the previously proposed AO, the Locality Opposition-Based Learning Aquila Optimizer (LOBLAO), aimed at resolving such issues and improving the performance of tasks related to global optimization and data clustering in particular. The proposed LOBLAO incorporates two key advancements: the Opposition-Based Learning (OBL) strategy, which enhances solution diversity and balances exploration and exploitation, and the Mutation Search Strategy (MSS), which mitigates the risk of local optima and ensures robust exploration of the search space. Comprehensive experiments on benchmark test functions and data clustering problems demonstrate the efficacy of LOBLAO. The results reveal that LOBLAO outperforms the original AO and several state-of-the-art optimization algorithms, showcasing superior performance in tackling high-dimensional datasets. In particular, LOBLAO achieved the best average ranking of 1.625 across multiple clustering problems, underscoring its robustness and versatility. These findings highlight the significant potential of LOBLAO to solve diverse and challenging optimization problems, establishing it as a valuable tool for researchers and practitioners.
- Klíčová slova
- Aquila optimizer, Data clustering problems, Meta-heuristics optimization algorithms, Opposition-based learning, Optimization problems,
- Publikační typ
- časopisecké články MeSH
BACKGROUND AND OBJECTIVES: Recent studies fueled doubts as to whether all currently defined central disorders of hypersomnolence are stable entities, especially narcolepsy type 2 and idiopathic hypersomnia. New reliable biomarkers are needed, and the question arises of whether current diagnostic criteria of hypersomnolence disorders should be reassessed. The main aim of this data-driven observational study was to see whether data-driven algorithms would segregate narcolepsy type 1 and identify more reliable subgrouping of individuals without cataplexy with new clinical biomarkers. METHODS: We used agglomerative hierarchical clustering, an unsupervised machine learning algorithm, to identify distinct hypersomnolence clusters in the large-scale European Narcolepsy Network database. We included 97 variables, covering all aspects of central hypersomnolence disorders such as symptoms, demographics, objective and subjective sleep measures, and laboratory biomarkers. We specifically focused on subgrouping of patients without cataplexy. The number of clusters was chosen to be the minimal number for which patients without cataplexy were put in distinct groups. RESULTS: We included 1,078 unmedicated adolescents and adults. Seven clusters were identified, of which 4 clusters included predominantly individuals with cataplexy. The 2 most distinct clusters consisted of 158 and 157 patients, were dominated by those without cataplexy, and among other variables, significantly differed in presence of sleep drunkenness, subjective difficulty awakening, and weekend-week sleep length difference. Patients formally diagnosed as having narcolepsy type 2 and idiopathic hypersomnia were evenly mixed in these 2 clusters. DISCUSSION: Using a data-driven approach in the largest study on central disorders of hypersomnolence to date, our study identified distinct patient subgroups within the central disorders of hypersomnolence population. Our results contest inclusion of sleep-onset REM periods in diagnostic criteria for people without cataplexy and provide promising new variables for reliable diagnostic categories that better resemble different patient phenotypes. Cluster-guided classification will result in a more solid hypersomnolence classification system that is less vulnerable to instability of single features.
- MeSH
- idiopatická hypersomnie * diagnóza MeSH
- kataplexie * diagnóza MeSH
- lidé MeSH
- mladiství MeSH
- narkolepsie * diagnóza farmakoterapie MeSH
- poruchy nadměrné spavosti * diagnóza epidemiologie MeSH
- shluková analýza MeSH
- Check Tag
- lidé MeSH
- mladiství MeSH
- Publikační typ
- časopisecké články MeSH
- pozorovací studie MeSH
- práce podpořená grantem MeSH
Myeloid-derived suppressor cells (MDSCs) are important regulators of immune processes during sepsis in mice. However, confirming these observations in humans has been challenging due to the lack of defined preparation protocols and phenotyping schemes for MDSC subsets. Thus, it remains unclear how MDSCs are involved in acute sepsis and whether they have a role in the long-term complications seen in survivors. Here, we combined comprehensive flow cytometry phenotyping with unsupervised clustering using self-organizing maps to identify the three recently defined human MDSC subsets in blood from severe sepsis patients, long-term sepsis survivors, and age-matched controls. We demonstrated the expansion of monocytic M-MDSCs and polymorphonuclear PMN-MDSCs, but not early-stage (e)-MDSCs during acute sepsis. High levels of PMN-MDSCs were also present in long-term survivors many months after discharge, suggesting a possible role in sepsis-related complications. Altogether, by employing unsupervised clustering of flow cytometric data we have confirmed the likely involvement of human MDSC subsets in acute sepsis, and revealed their expansion in sepsis survivors at late time points. The application of this strategy in future studies and in the clinical/diagnostic context would enable rapid progress toward a full understanding of the roles of MDSC in sepsis and other inflammatory conditions.
- Klíčová slova
- Flow cytometry, Multidimensional clustering, Myeloid-derived suppressor cells, Sepsis, Septic shock,
- MeSH
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- monocyty imunologie MeSH
- myeloidní supresorové buňky imunologie MeSH
- průtoková cytometrie metody MeSH
- senioři nad 80 let MeSH
- senioři MeSH
- sepse imunologie MeSH
- shluková analýza MeSH
- zánět imunologie MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mužské pohlaví MeSH
- senioři nad 80 let MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
BACKGROUND: The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization. RESULTS: We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families. CONCLUSIONS: Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.
- MeSH
- DNA rostlinná genetika MeSH
- genom rostlinný MeSH
- Glycine max genetika MeSH
- hrách setý genetika MeSH
- mapování chromozomů MeSH
- repetitivní sekvence nukleových kyselin * MeSH
- sekvenční analýza DNA * MeSH
- shluková analýza MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Názvy látek
- DNA rostlinná MeSH
BACKGROUND: Antineutrophil cytoplasmic antibody (ANCA)-associated vasculitis is a heterogenous autoimmune disease. While traditionally stratified into two conditions, granulomatosis with polyangiitis (GPA) and microscopic polyangiitis (MPA), the subclassification of ANCA-associated vasculitis is subject to continued debate. Here we aim to identify phenotypically distinct subgroups and develop a data-driven subclassification of ANCA-associated vasculitis, using a large real-world dataset. METHODS: In the collaborative data reuse project FAIRVASC (Findable, Accessible, Interoperable, Reusable, Vasculitis), registry records of patients with ANCA-associated vasculitis were retrieved from six European vasculitis registries: the Czech Registry of ANCA-associated vasculitis (Czech Republic), the French Vasculitis Study Group Registry (FVSG; France), the Joint Vasculitis Registry in German-speaking Countries (GeVas; Germany), the Polish Vasculitis Registry (POLVAS; Poland), the Irish Rare Kidney Disease Registry (RKD; Ireland), and the Skåne Vasculitis Cohort (Sweden). We performed model-based clustering of 17 mixed-type clinical variables using a parsimonious mixture of two latent Gaussian variable models. Clinical validation of the optimal cluster solution was made through summary statistics of the clusters' demography, phenotypic and serological characteristics, and outcome. The predictive value of models featuring the cluster affiliations were compared with classifications based on clinical diagnosis and ANCA specificity. People with lived experience were involved throughout the FAIRVASVC project. FINDINGS: A total of 3868 patients diagnosed with ANCA-associated vasculitis between Nov 1, 1966, and March 1, 2023, were included in the study across the six registries (Czech Registry n=371, FVSG n=1780, GeVas n=135, POLVAS n=792, RKD n=439, and Skåne Vasculitis Cohort n=351). There were 2434 (62·9%) patients with GPA and 1434 (37·1%) with MPA. Mean age at diagnosis was 57·2 years (SD 16·4); 2006 (51·9%) of 3867 patients were men and 1861 (48·1%) were women. We identified five clusters, with distinct phenotype, biochemical presentation, and disease outcome. Three clusters were characterised by kidney involvement: one severe kidney cluster (555 [14·3%] of 3868 patients) with high C-reactive protein (CRP) and serum creatinine concentrations, and variable ANCA specificity (SK cluster); one myeloperoxidase (MPO)-ANCA-positive kidney involvement cluster (782 [20·2%]) with limited extrarenal disease (MPO-K cluster); and one proteinase 3 (PR3)-ANCA-positive kidney involvement cluster (683 [17·7%]) with widespread extrarenal disease (PR3-K cluster). Two clusters were characterised by relative absence of kidney involvement: one was a predominantly PR3-ANCA-positive cluster (1202 [31·1%]) with inflammatory multisystem disease (IMS cluster), and one was a cluster (646 [16·7%]) with predominantly ear-nose-throat involvement and low CRP, with mainly younger patients (YR cluster). Compared with models fitted with clinical diagnosis or ANCA status, cluster-assigned models demonstrated improved predictive power with respect to both patient and kidney survival. INTERPRETATION: Our study reinforces the view that ANCA-associated vasculitis is not merely a binary construct. Data-driven subclassification of ANCA-associated vasculitis exhibits higher predictive value than current approaches for key outcomes. FUNDING: European Union's Horizon 2020 research and innovation programme under the European Joint Programme on Rare Diseases.
- MeSH
- ANCA-asociované vaskulitidy * klasifikace diagnóza epidemiologie krev imunologie MeSH
- dospělí MeSH
- kohortové studie MeSH
- lidé středního věku MeSH
- lidé MeSH
- mikroskopická polyangiitida klasifikace epidemiologie krev diagnóza imunologie MeSH
- registrace * statistika a číselné údaje MeSH
- senioři MeSH
- shluková analýza MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mužské pohlaví MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Evropa epidemiologie MeSH
The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed.
Markov Random Walks (MRW) has proven to be an effective way to understand spectral clustering and embedding. However, due to less global structural measure, conventional MRW (e.g., the Gaussian kernel MRW) cannot be applied to handle data points drawn from a mixture of subspaces. In this paper, we introduce a regularized MRW learning model, using a low-rank penalty to constrain the global subspace structure, for subspace clustering and estimation. In our framework, both the local pairwise similarity and the global subspace structure can be learnt from the transition probabilities of MRW. We prove that under some suitable conditions, our proposed local/global criteria can exactly capture the multiple subspace structure and learn a low-dimensional embedding for the data, in which giving the true segmentation of subspaces. To improve robustness in real situations, we also propose an extension of the MRW learning model based on integrating transition matrix learning and error correction in a unified framework. Experimental results on both synthetic data and real applications demonstrate that our proposed MRW learning model and its robust extension outperform the state-of-the-art subspace clustering methods.
- Klíčová slova
- Dimensionality reduction, Markov random walks, Spectral clustering, Subspace clustering and estimation, Transition probability learning,
- MeSH
- algoritmy MeSH
- emoce fyziologie MeSH
- lidé MeSH
- limbický systém fyziologie MeSH
- modely neurologické MeSH
- neuronové sítě * MeSH
- rozpoznávání automatizované metody MeSH
- shluková analýza MeSH
- teoretické modely MeSH
- učení MeSH
- umělá inteligence MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- přehledy MeSH
PURPOSE: Chronic obstructive pulmonary disease (COPD) is a prevalent and preventable condition that typically worsens over time. Acute exacerbations of COPD significantly impact disease progression, underscoring the importance of prevention efforts. This observational study aimed to achieve two main objectives: (1) identify patients at risk of exacerbations using an ensemble of clustering algorithms, and (2) classify patients into distinct clusters based on disease severity. METHODS: Data from portable medical devices were analyzed post-hoc using hyperparameter optimization with Self-Organizing Maps (SOM), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Isolation Forest, and Support Vector Machine (SVM) algorithms, to detect flare-ups. Principal Component Analysis (PCA) followed by KMeans clustering was applied to categorize patients by severity. RESULTS: 25 patients were included within the study population, data from 17 patients had the required reliability. Five patients were identified in the highest deterioration group, with one clinically confirmed exacerbation accurately detected by our ensemble algorithm. Then, PCA and KMeans clustering grouped patients into three clusters based on severity: Cluster 0 started with the least severe characteristics but experienced decline, Cluster 1 consistently showed the most severe characteristics, and Cluster 2 showed slight improvement. CONCLUSION: Our approach effectively identified patients at risk of exacerbations and classified them by disease severity. Although promising, the approach would need to be verified on a larger sample with a larger number of recorded clinically verified exacerbations.
- Klíčová slova
- COPD, Clustering, Data analysis, Machine learning,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.
- Klíčová slova
- Julia, clustering, dimensionality reduction, high-performance computing, self-organizing maps, single-cell cytometry,
- MeSH
- algoritmy * MeSH
- myši MeSH
- programovací jazyk * MeSH
- shluková analýza MeSH
- software MeSH
- zvířata MeSH
- Check Tag
- myši MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
We propose a novel, transparent and very simple algorithm to analyze middle-range correlations in genomic nucleotide sequences. Analysis by this algorithm of the EMBL Nucleotide Sequence Database demonstrates that all four nucleotides cluster in the genomic nucleotide sequences of eukaryotes on the scale of several hundred base pairs. In prokaryotes, the clustering is weak but still evident. The non-dominant three bases are deficient in the clusters, while A is the most deficient nucleotide in the clusters of C, and vice versa, and G is the most deficient nucleotide in the clusters of T, and vice versa. The algorithm also detects CG islands, extending over 1 kb, in vertebrate sequences. In plants, the CG islands are shown to be much smaller, if they exist at all. A clustering tendency is also exhibited by the TA doublet. Other doublets do not cluster. We observe no strong correlation between nucleotides separated in genomes by > 1 kb.
- MeSH
- algoritmy * MeSH
- DNA MeSH
- genom * MeSH
- lidé MeSH
- molekulární sekvence - údaje MeSH
- nukleotidy genetika MeSH
- sekvence nukleotidů MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
- Názvy látek
- DNA MeSH
- nukleotidy MeSH