Data pre-processing
Dotaz
Zobrazit nápovědu
Spectroscopic data often contain artifacts or noise related to the sample characteristics, instrumental variations, or experimental design flaws. Therefore, classifying the raw data is not recommended and might lead to biased results. Nevertheless, most issues may be addressed through appropriate data pre-processing. Effective pre-processing is particularly crucial in critical applications like liquid biopsy for disease detection, where even minor performance improvements may impact patient outcomes. Unfortunately, there is no consensus regarding optimal pre-processing, complicating cross-study comparisons. This study presents a comprehensive evaluation of various pre-processing methods and their combinations to assess their influence on classification results. The goal was to identify whether some pre-processing methods are associated with higher classification outcomes and find an optimal strategy for the given data. Data from Raman optical activity and infrared and Raman spectroscopy were processed, applying tens of thousands of possible pre-processing pipelines. The resulting data were classified using three algorithms to distinguish between subjects with liver cirrhosis and those who had developed hepatocellular carcinoma. Results highlighted that some specific pre-processing methods often ranked among the best classification results, such as the Rolling Ball for correcting the baseline of Raman spectra or the Doubly Reweighted Penalized Least Squares and Mixture model in the case of Raman optical activity. On the other hand, the selection of filtering and/or normalization approach usually did not have a significant impact. Nonetheless, the pre-processing of top-scoring pipelines also depended on the classifier utilized. The best pipelines yielded an AUROC of 0.775-0.823, varying with the evaluated spectroscopic data and classifier.
- Klíčová slova
- Chiroptical spectroscopy, Classification, Data pre-processing, Diagnostics, Liquid biopsy, Machine learning, Vibrational spectroscopy,
- MeSH
- algoritmy MeSH
- hepatocelulární karcinom * diagnóza patologie MeSH
- jaterní cirhóza diagnóza patologie MeSH
- lidé MeSH
- metoda nejmenších čtverců MeSH
- nádory jater * diagnóza patologie MeSH
- Ramanova spektroskopie * metody MeSH
- spektrofotometrie infračervená metody MeSH
- tekutá biopsie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Red wine is a common target of fraudulent acts considering its high market value and popularity. Although there has been much effort to assess the geographical and varietal origin of wine, this is not the case for wine vintage. Vintage is a crucial parameter for the market price, especially in the case of reputable wines. Considering the season-to-season variations affecting wine quality and the ever-occurring unstable climatological conditions due to climate change, developing analytical strategies to accurately assess wine vintage is topical and of high interest. RESULTS: In this study, we successfully employed ultraviolet-visible spectroscopy, fluorescence spectroscopy and mid-infrared spectroscopy to identify the vintage of a protected designation of origin red wine produced during four different vintages (n = 36). Class-based clustering and great discriminatory performance was achieved for the majority of the developed multivariate models and the impact of the applied spectral pre-processing was significant. Importantly, the tested scatter correction methods resulted in the best cross-validation parameters (goodness of fit, R2Y > 0.9 and goodness of prediction, Q2Y > 0.8) with calculated recognition and prediction abilities in the range 77-100% and 65-96%, respectively, when using partial least squares discriminant analysis. In addition, in the case of fluorescence spectroscopy, a batch effect was revealed, which was compensated by the spectral pre-processing methods. Spectral feature selection was performed in all cases to use only the analytically important spectral signals and omit model overfitting. CONCLUSIONS: The developed method is simple, cost-efficient and non-destructive, indicating its high potential for industrial applications as a rapid screening tool. © 2025 The Author(s). Journal of the Science of Food and Agriculture published by John Wiley & Sons Ltd on behalf of Society of Chemical Industry.
- Klíčová slova
- absorption spectroscopy, attenuated total reflectance Fourier transform infrared spectroscopy, chemometrics, spectral pre‐processing, wine authenticity,
- MeSH
- diskriminační analýza MeSH
- fluorescenční spektrometrie metody MeSH
- roční období MeSH
- spektrální analýza * metody MeSH
- víno * analýza MeSH
- Vitis * chemie růst a vývoj MeSH
- Publikační typ
- časopisecké články MeSH
- hodnotící studie MeSH
Photonic signals are broadly exploited in communication and sensing and they typically exhibit Poisson-like statistics. In a common scenario where the intensity of the photonic signals is low and one needs to remove a nonstationary trend of the signals for any further analysis, one faces an obstacle: due to the dependence between the mean and variance typical for a Poisson-like process, information about the trend remains in the variance even after the trend has been subtracted, possibly yielding artifactual results in further analyses. Commonly available detrending or normalizing methods cannot cope with this issue. To alleviate this issue we developed a suitable pre-processing method for the signals that originate from a Poisson-like process. In this paper, a Poisson pre-processing method for nonstationary time series with Poisson distribution is developed and tested on computer-generated model data and experimental data of chemiluminescence from human neutrophils and mung seeds. The presented method transforms a nonstationary Poisson signal into a stationary signal with a Poisson distribution while preserving the type of photocount distribution and phase-space structure of the signal. The importance of the suggested pre-processing method is shown in Fano factor and Hurst exponent analysis of both computer-generated model signals and experimental photonic signals. It is demonstrated that our pre-processing method is superior to standard detrending-based methods whenever further signal analysis is sensitive to variance of the signal.
- MeSH
- fotony * MeSH
- lidé MeSH
- neutrofily metabolismus MeSH
- počítačová simulace MeSH
- Poissonovo rozdělení * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
Schizophrenia is a severe neuropsychiatric disease whose diagnosis, unfortunately, lacks an objective diagnostic tool supporting a thorough psychiatric examination of the patient. We took advantage of today's computational abilities, structural magnetic resonance imaging, and modern machine learning methods, such as stacked autoencoders (SAE) and 3D convolutional neural networks (3D CNN), to teach them to classify 52 patients with schizophrenia and 52 healthy controls. The main aim of this study was to explore whether complex feature extraction methods can help improve the accuracy of deep learning-based classifiers compared to minimally preprocessed data. Our experiments employed three commonly used preprocessing steps to extract three different feature types. They included voxel-based morphometry, deformation-based morphometry, and simple spatial normalization of brain tissue. In addition to classifier models, features and their combination, other model parameters such as network depth, number of neurons, number of convolutional filters, and input data size were also investigated. Autoencoders were trained on feature pools of 1000 and 5000 voxels selected by Mann-Whitney tests, and 3D CNNs were trained on whole images. The most successful model architecture (autoencoders) achieved the highest average accuracy of 69.62% (sensitivity 68.85%, specificity 70.38%). The results of all experiments were statistically compared (the Mann-Whitney test). In conclusion, SAE outperformed 3D CNN, while preprocessing using VBM helped SAE improve the results.
- Klíčová slova
- 3D CNN, autoencoders, classification, deep learning, deformation-based morphometry, schizophrenia, voxel-based morphometry,
- Publikační typ
- časopisecké články MeSH
INTRODUCTION: Recent advances in machine learning provide new possibilities to process and analyse observational patient data to predict patient outcomes. In this paper, we introduce a data processing pipeline for cardiogenic shock (CS) prediction from the MIMIC III database of intensive cardiac care unit patients with acute coronary syndrome. The ability to identify high-risk patients could possibly allow taking pre-emptive measures and thus prevent the development of CS. METHODS: We mainly focus on techniques for the imputation of missing data by generating a pipeline for imputation and comparing the performance of various multivariate imputation algorithms, including k-nearest neighbours, two singular value decomposition (SVD)-based methods, and Multiple Imputation by Chained Equations. After imputation, we select the final subjects and variables from the imputed dataset and showcase the performance of the gradient-boosted framework that uses a tree-based classifier for cardiogenic shock prediction. RESULTS: We achieved good classification performance thanks to data cleaning and imputation (cross-validated mean area under the curve 0.805) without hyperparameter optimization. CONCLUSION: We believe our pre-processing pipeline would prove helpful also for other classification and regression experiments.
- Klíčová slova
- cardiogenic shock, classification, machine learning, missing data imputation, prediction model, processing pipeline,
- Publikační typ
- časopisecké články MeSH
Clinical metabolomics aims at finding statistically significant differences in metabolic statuses of patient and control groups with the intention of understanding pathobiochemical processes and identification of clinically useful biomarkers of particular diseases. After the raw measurements are integrated and pre-processed as intensities of chromatographic peaks, the differences between controls and patients are evaluated by both univariate and multivariate statistical methods. The traditional univariate approach relies on t-tests (or their nonparametric alternatives) and the results from multiple testing are misleadingly compared merely by p-values using the so-called volcano plot. This paper proposes a Bayesian counterpart to the widespread univariate analysis, taking into account the compositional character of a metabolome. Since each metabolome is a collection of some small-molecule metabolites in a biological material, the relative structure of metabolomic data, which is inherently contained in ratios between metabolites, is of the main interest. Therefore, a proper choice of logratio coordinates is an essential step for any statistical analysis of such data. In addition, a concept of b-values is introduced together with a Bayesian version of the volcano plot incorporating distance levels of the posterior highest density intervals from zero. The theoretical background of the contribution is illustrated using two data sets containing samples of patients suffering from 3-hydroxy-3-methylglutaryl-CoA lyase deficiency and medium-chain acyl-CoA dehydrogenase deficiency. To evaluate the stability of the proposed method as well as the benefits of the compositional approach, two simulations designed to mimic a loss of samples and a systematical measurement error, respectively, are added.
- Klíčová slova
- Bayesian inference, Compositional data, High-dimensional data, Multiple hypotheses testing, Untargeted metabolomics, Volcano plot,
- MeSH
- acetyl-CoA-C-acetyltransferasa nedostatek metabolismus MeSH
- acyl-CoA-dehydrogenasa nedostatek metabolismus MeSH
- Bayesova věta * MeSH
- datové soubory jako téma MeSH
- lidé MeSH
- metabolomika * MeSH
- vrozené poruchy metabolismu aminokyselin metabolismus MeSH
- vrozené poruchy metabolismu tuků metabolismus MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Názvy látek
- acetyl-CoA-C-acetyltransferasa MeSH
- acyl-CoA-dehydrogenasa MeSH
OBJECTIVE: This scoping review aims to identify, catalogue, and characterize previously reported tools, techniques, methods, and processes that have been recommended or used by evidence synthesizers to detect fraudulent or erroneous data and mitigate its impact. INTRODUCTION: Decision-making for policy and practice should always be underpinned by the best available evidence-typically peer-reviewed scientific literature. Evidence synthesis literature should be collated and organized using the appropriate evidence synthesis methodology, best exemplified by the role systematic reviews play in evidence-based health care. However, with the rise of "predatory journals," fraudulent or erroneous data may be invading this literature, which may negatively affect evidence syntheses that use this data. This, in turn, may compromise decision-making processes. INCLUSION CRITERIA: This review will include peer-reviewed articles, commentaries, books, and editorials that describe at least 1 tool, technique, method, or process with the explicit purpose of identifying or mitigating the impact of fraudulent or erroneous data for any evidence synthesis, in any topic area. Manuals, handbooks, and guidance from major organizations, universities, and libraries will also be considered. METHODS: This review will be conducted using the JBI methodology for scoping reviews and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR). Databases and relevant organizational websites will be searched for eligible studies. Title and abstract, and, subsequently, full-text screening will be conducted in duplicate. Data from identified full texts will be extracted using a pre-determined checklist, while the findings will be summarized descriptively and presented in tables. REVIEW REGISTRATION: Open Science Framework https://osf.io/u8yrn.
- MeSH
- lidé MeSH
- podvod * prevence a kontrola MeSH
- rozhodování MeSH
- scoping review jako téma MeSH
- vědecký podvod * MeSH
- výzkumný projekt MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- MeSH
- automatizované zpracování dat * MeSH
- lidé MeSH
- software MeSH
- zubní záznamy * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
MEMS (micro-electro-mechanical system)-based inertial sensors, i.e., accelerometers and angular rate sensors, are commonly used as a cost-effective solution for the purposes of navigation in a broad spectrum of terrestrial and aerospace applications. These tri-axial inertial sensors form an inertial measurement unit (IMU), which is a core unit of navigation systems. Even if MEMS sensors have an advantage in their size, cost, weight and power consumption, they suffer from bias instability, noisy output and insufficient resolution. Furthermore, the sensor's behavior can be significantly affected by strong vibration when it operates in harsh environments. All of these constitute conditions require treatment through data processing. As long as the navigation solution is primarily based on using only inertial data, this paper proposes a novel concept in adaptive data pre-processing by using a variable bandwidth filtering. This approach utilizes sinusoidal estimation to continuously adapt the filtering bandwidth of the accelerometer's data in order to reduce the effects of vibration and sensor noise before attitude estimation is processed. Low frequency vibration generally limits the conditions under which the accelerometers can be used to aid the attitude estimation process, which is primarily based on angular rate data and, thus, decreases its accuracy. In contrast, the proposed pre-processing technique enables using accelerometers as an aiding source by effective data smoothing, even when they are affected by low frequency vibration. Verification of the proposed concept is performed on simulation and real-flight data obtained on an ultra-light aircraft. The results of both types of experiments confirm the suitability of the concept for inertial data pre-processing.
- MeSH
- design vybavení MeSH
- geografické informační systémy MeSH
- letadla normy MeSH
- letecké a kosmické lékařství přístrojové vybavení MeSH
- lidé MeSH
- mikroelektromechanické systémy přístrojové vybavení MeSH
- software MeSH
- technologie dálkového snímání * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
One of the biggest challenges of training deep neural network is the need for massive data annotation. To train the neural network for object detection, millions of annotated training images are required. However, currently, there are no large-scale thermal image datasets that could be used to train the state of the art neural networks, while voluminous RGB image datasets are available. This paper presents a method that allows to create hundreds of thousands of annotated thermal images using the RGB pre-trained object detector. A dataset created in this way can be used to train object detectors with improved performance. The main gain of this work is the novel method for fully automatic thermal image labeling. The proposed system uses the RGB camera, thermal camera, 3D LiDAR, and the pre-trained neural network that detects objects in the RGB domain. Using this setup, it is possible to run the fully automated process that annotates the thermal images and creates the automatically annotated thermal training dataset. As the result, we created a dataset containing hundreds of thousands of annotated objects. This approach allows to train deep learning models with similar performance as the common human-annotation-based methods do. This paper also proposes several improvements to fine-tune the results with minimal human intervention. Finally, the evaluation of the proposed solution shows that the method gives significantly better results than training the neural network with standard small-scale hand-annotated thermal image datasets.
- Klíčová slova
- IR, RGB, YOLO, data annotation, deep convolutional neural networks, object detector, thermal, transfer learning,
- Publikační typ
- časopisecké články MeSH