Preprocessing
Dotaz
Zobrazit nápovědu
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
- Klíčová slova
- compositionality, data preprocessing, human microbiome, machine learning, metagenomics data, normalization,
- Publikační typ
- časopisecké články MeSH
- přehledy MeSH
The paper is focused on an examination of the use of entropy in the field of web usage mining. Entropy creates an alternative possibility of determining the ratio of auxiliary pages in the session identification using the Reference Length method. The experiment was conducted on two different web portals. The first log file was obtained from a course of virtual learning environment web portal. The second log file was received from the web portal with anonymous access. A comparison of the results of entropy estimation of the ratio of auxiliary pages and a sitemap estimation of the ratio of auxiliary pages showed that in the case of sitemap abundance, entropy could be a full-valued substitution for the estimate of the ratio of auxiliary pages.
- Klíčová slova
- Reference Length, data preprocessing, information entropy, session identification, web usage mining,
- Publikační typ
- časopisecké články MeSH
The Vehicular Reference Misbehavior Dataset (VeReMi) is a vital resource for advancing Intelligent Transportation Systems (ITS) and the Internet of Vehicles (IoV). However, its large size (∼7 GB) and inherent class imbalance pose significant challenges for machine learning model development. This paper presents a preprocessing framework to enhance VeReMi's usability and relevance. Through 10 % down-sampling, the dataset was reduced to ∼724MB, making it computationally manageable. Biases were addressed by balancing benign and malicious samples through synthesis and identifying benign instances using predefined criteria. A refined feature set, including key attributes like rcvTime, pos_0, pos_1, and attack_type (renamed attacker_type), was selected to improve machine learning compatibility. This preprocessing pipeline effectively maintains data integrity and preserves the representativeness of malicious patterns. The optimized dataset is well-suited for ITS and IoV applications, such as anomaly detection and network security, underscoring the crucial role of preprocessing in overcoming real-world constraints and enhancing model performance.
- Klíčová slova
- Anomaly detection, Cybersecurity, Data preprocessing, Dataset optimization, Intelligent transportation systems (ITS), Internet of vehicles (IoV), Intrusion detection systems (IDS), Machine learning (ML), Network security, Vehicular reference misbehavior dataset (VeReMi),
- Publikační typ
- časopisecké články MeSH
An introductory review of hardware aspects of on-line experimental data processing reveals that the combination of a specialized (hard-wired) preprocessing unit coupled with a programmable laboratory computer is an optimal set up for an electrophysiological laboratory. The paper deals with a proposed modular system, which makes the assembly of a large number of different preprocessing units possible. Some practical applications of the preprocessing units coupled with a LINC (D.E.C.) computer are presented in conclusion.
AIMS: Some types of monoclonal gammopathies are typified by a very limited availability of aberrant cells. Modern research use high throughput technologies and an integrated approach for detailed characterisation of abnormal cells. This strategy requires relatively high amounts of starting material which cannot be obtained from every diagnosis without causing inconvenience to the patient. The aim of this methodological paper is to reflect our long experience with laboratory work and describe the best protocols for sample collection, sorting and further preprocessing in terms of the available number of cells and intended downstream application in monoclonal gammopathies research. Potential pitfalls are also discussed. METHODS: Comparison and optimisation of freezing and sorting protocols for plasma cells in monoclonal gammopathies, followed by testing of various nucleic acid isolation and amplification techniques to establish a guideline for sample processing in haemato-oncology research. RESULTS: We show the average numbers of aberrant cells that can be obtained from various monoclonal gammopathies (monoclonal gammopathy of undetermined significance/light chain amyloidosis/multiple myeloma (MM)/MM circulating plasma cells/ minimal residual disease MM-10 123/22 846/305 501/68 641/4000 aberrant plasma cells of 48/30/10/16/37×106 bone marrow mononuclear cells) and the expected yield of nucleic acids provided from multiple isolation kits (DNA/RNA yield from 1 to 200×103 cells was 2.14-427/0.12-123 ng). CONCLUSIONS: Tested kits for parallel isolation deliver outputs comparable with kits specialised for just one type of molecule. We also present our positive experience with the whole genome amplification method, which can serve as a very powerful tool to gain complex information from a very small cell population.
- Klíčová slova
- DNA, HAEMATO-ONCOLOGY, METHODOLOGY, MYELOMA,
- MeSH
- DNA izolace a purifikace MeSH
- konzervace krve metody MeSH
- krevní bankovnictví metody MeSH
- kryoprezervace metody MeSH
- lidé MeSH
- odběr vzorku krve metody MeSH
- paraproteinemie krev MeSH
- reagenční diagnostické soupravy MeSH
- RNA izolace a purifikace MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- DNA MeSH
- reagenční diagnostické soupravy MeSH
- RNA MeSH
MOTIVATION: Meticulous selection of chromatographic peak detection parameters and algorithms is a crucial step in preprocessing liquid chromatography-mass spectrometry (LC-MS) data. However, as mass-to-charge ratio and retention time shifts are larger between batches than within batches, finding apt parameters for all samples of a large-scale multi-batch experiment with the aim of minimizing information loss becomes a challenging task. Preprocessing independent batches individually can curtail said problems but requires a method for aligning and combining them for further downstream analysis. RESULTS: We present two methods for aligning and combining individually preprocessed batches in multi-batch LC-MS experiments. Our developed methods were tested on six sets of simulated and six sets of real datasets. Furthermore, by estimating the probabilities of peak insertion, deletion and swap between batches in authentic datasets, we demonstrate that retention order swaps are not rare in untargeted LC-MS data. AVAILABILITY AND IMPLEMENTATION: kmersAlignment and rtcorrectedAlignment algorithms are made available as an R package with raw data at https://metabocombiner.img.cas.cz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Electroencephalography (EEG) is among the most widely diffused, inexpensive, and adopted neuroimaging techniques. Nonetheless, EEG requires measurements against a reference site(s), which is typically chosen by the experimenter, and specific pre-processing steps precede analyses. It is therefore valuable to obtain quantities that are minimally affected by reference and pre-processing choices. Here, we show that the topological structure of embedding spaces, constructed either from multi-channel EEG timeseries or from their temporal structure, are subject-specific and robust to re-referencing and pre-processing pipelines. By contrast, the shape of correlation spaces, that is, discrete spaces where each point represents an electrode and the distance between them that is in turn related to the correlation between the respective timeseries, was neither significantly subject-specific nor robust to changes of reference. Our results suggest that the shape of spaces describing the observed configurations of EEG signals holds information about the individual specificity of the underlying individual's brain dynamics, and that temporal correlations constrain to a large degree the set of possible dynamics. In turn, these encode the differences between subjects' space of resting state EEG signals. Finally, our results and proposed methodology provide tools to explore the individual topographical landscapes and how they are explored dynamically. We propose therefore to augment conventional topographic analyses with an additional-topological-level of analysis, and to consider them jointly. More generally, these results provide a roadmap for the incorporation of topological analyses within EEG pipelines.
- Klíčová slova
- Computational modelling, Network, Reference electrode, Resting-state electroencephalography, Topography, Topology,
- MeSH
- elektrody MeSH
- elektroencefalografie * metody MeSH
- hlava MeSH
- lidé MeSH
- mozek * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Crack identification plays an essential role in the health diagnosis of various concrete structures. Among different intelligent algorithms, the convolutional neural networks (CNNs) has been demonstrated as a promising tool capable of efficiently identifying the existence and evolution of concrete cracks by adaptively recognizing crack features from a large amount of concrete surface images. However, the accuracy as well as the versatility of conventional CNNs in crack identification is largely limited, due to the influence of noise contained in the background of the concrete surface images. The noise originates from highly diverse sources, such as light spots, blurs, surface roughness/wear/stains. With the aim of enhancing the accuracy, noise immunity, and versatility of CNN-based crack identification methods, a framework of enhanced intelligent identification of concrete cracks is established in this study, based on a hybrid utilization of conventional CNNs with a multi-layered image preprocessing strategy (MLP), of which the key components are homomorphic filtering and the Otsu thresholding method. Relying on the comparison and fine-tuning of classic CNN structures, networks for detection of crack position and identification of crack type are built, trained, and tested, based on a dataset composed of a large number of concrete crack images. The effectiveness and efficiency of the proposed framework involving the MLP and the CNN in crack identification are examined by comparative studies, with and without the implementation of the MLP strategy. Crack identification accuracy subject to different sources and levels of noise influence is investigated.
- Klíčová slova
- concrete crack identification, convolutional neural network, homomorphic filtering, signal processing, structural health monitoring,
- Publikační typ
- časopisecké články MeSH
Functional connectivity analysis of resting-state fMRI data has recently become one of the most common approaches to characterizing individual brain function. It has been widely suggested that the functional connectivity matrix is a useful approximate representation of the brain's connectivity, potentially providing behaviorally or clinically relevant markers. However, functional connectivity estimates are known to be detrimentally affected by various artifacts, including those due to in-scanner head motion. Moreover, as individual functional connections generally covary only very weakly with head motion estimates, motion influence is difficult to quantify robustly, and prone to be neglected in practice. Although the use of individual estimates of head motion, or group-level correlation of motion and functional connectivity has been suggested, a sufficiently sensitive measure of individual functional connectivity quality has not yet been established. We propose a new intuitive summary index, Typicality of Functional Connectivity, to capture deviations from standard brain functional connectivity patterns. In a resting-state fMRI dataset of 245 healthy subjects, this measure was significantly correlated with individual head motion metrics. The results were further robustly reproduced across atlas granularity, preprocessing options, and other datasets, including 1,081 subjects from the Human Connectome Project. In principle, Typicality of Functional Connectivity should be sensitive also to other types of artifacts, processing errors, and possibly also brain pathology, allowing extensive use in data quality screening and quantification in functional connectivity studies as well as methodological investigations.
- Klíčová slova
- atlas, functional connectivity, motion, quality, rs-fMRI,
- MeSH
- artefakty MeSH
- atlasy jako téma * MeSH
- datové soubory jako téma * MeSH
- dospělí MeSH
- hlava - pohyby MeSH
- konektom * metody normy MeSH
- lidé MeSH
- magnetická rezonanční tomografie * metody normy MeSH
- mladý dospělý MeSH
- mozek diagnostické zobrazování fyziologie MeSH
- počítačové zpracování obrazu * metody normy MeSH
- Check Tag
- dospělí MeSH
- lidé MeSH
- mladý dospělý MeSH
- mužské pohlaví MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH