Anomaly Detection Algorithm for Real-World Data and Evidence in Clinical Research: Implementation, Evaluation, and Validation Study
Status PubMed-not-MEDLINE Jazyk angličtina Země Kanada Médium electronic
Typ dokumentu časopisecké články
PubMed
33851576
PubMed Central
PMC8140384
DOI
10.2196/27172
PII: v9i5e27172
Knihovny.cz E-zdroje
- Klíčová slova
- EDC system, anomaly detection, clinical research data, data quality, real-world evidence, registry database,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Statistical analysis, which has become an integral part of evidence-based medicine, relies heavily on data quality that is of critical importance in modern clinical research. Input data are not only at risk of being falsified or fabricated, but also at risk of being mishandled by investigators. OBJECTIVE: The urgent need to assure the highest data quality possible has led to the implementation of various auditing strategies designed to monitor clinical trials and detect errors of different origin that frequently occur in the field. The objective of this study was to describe a machine learning-based algorithm to detect anomalous patterns in data created as a consequence of carelessness, systematic error, or intentionally by entering fabricated values. METHODS: A particular electronic data capture (EDC) system, which is used for data management in clinical registries, is presented including its architecture and data structure. This EDC system features an algorithm based on machine learning designed to detect anomalous patterns in quantitative data. The detection algorithm combines clustering with a series of 7 distance metrics that serve to determine the strength of an anomaly. For the detection process, the thresholds and combinations of the metrics were used and the detection performance was evaluated and validated in the experiments involving simulated anomalous data and real-world data. RESULTS: Five different clinical registries related to neuroscience were presented-all of them running in the given EDC system. Two of the registries were selected for the evaluation experiments and served also to validate the detection performance on an independent data set. The best performing combination of the distance metrics was that of Canberra, Manhattan, and Mahalanobis, whereas Cosine and Chebyshev metrics had been excluded from further analysis due to the lowest performance when used as single distance metric-based classifiers. CONCLUSIONS: The experimental results demonstrate that the algorithm is universal in nature, and as such may be implemented in other EDC systems, and is capable of anomalous data detection with a sensitivity exceeding 85%.
Faculty of Medicine Masaryk University Brno Czech Republic
Institute of Biostatistics and Analyses Ltd Brno Czech Republic
Zobrazit více v PubMed
Solomon DJ, Henry RC, Hogan JG, Van Amburg GH, Taylor J. Evaluation and implementation of public health registries. Public Health Rep. 1991;106(2):142–50. PubMed PMC
Hoque DME, Kumari V, Hoque M, Ruseckaite R, Romero L, Evans SM. Impact of clinical registries on quality of patient care and clinical outcomes: A systematic review. PLoS One. 2017 Sep 8;12(9):e0183667. doi: 10.1371/journal.pone.0183667. PubMed DOI PMC
Lu Z. Technical challenges in designing post-marketing eCRFs to address clinical safety and pharmacovigilance needs. Contemp Clin Trials. 2010 Jan;31(1):108–18. doi: 10.1016/j.cct.2009.11.004. PubMed DOI
Arts Danielle G T, De Keizer Nicolette F, Scheffer Gert-Jan. Defining and improving data quality in medical registries: a literature review, case study, and generic framework. J Am Med Inform Assoc. 2002;9(6):600–11. doi: 10.1197/jamia.m1087. PubMed DOI PMC
O’Reilly GM, Gabbe B, Moore L, Cameron PA. Classifying, measuring and improving the quality of data in trauma registries: A review of the literature. Injury. 2016 Mar;47(3):559–567. doi: 10.1016/j.injury.2016.01.007. PubMed DOI
Houston L, Probst Y, Martin A. Assessing data quality and the variability of source data verification auditing methods in clinical research settings. Journal of Biomedical Informatics. 2018 Jul;83:25–32. doi: 10.1016/j.jbi.2018.05.010. PubMed DOI
Timmermans C, Doffagne E, Venet D, Desmet L, Legrand C, Burzykowski T, Buyse M. Statistical monitoring of data quality and consistency in the Stomach Cancer Adjuvant Multi-institutional Trial Group Trial. Gastric Cancer. 2015 Aug 23;19(1):24–30. doi: 10.1007/s10120-015-0533-9. PubMed DOI
George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015 Feb;5(2):161–173. doi: 10.4155/cli.14.116. PubMed DOI PMC
Walther B, Hossin S, Townend J, Abernethy N, Parker D, Jeffries D. Comparison of electronic data capture (EDC) with the standard data capture method for clinical trial data. PLoS One. 2011;6(9):e25348. doi: 10.1371/journal.pone.0025348. PubMed DOI PMC
van Dam J, Omondi Onyango K, Midamba B, Groosman N, Hooper N, Spector J, Pillai G(, Ogutu B. Open-source mobile digital platform for clinical trial data collection in low-resource settings. BMJ Innov. 2017 Jan 06;3(1):26–31. doi: 10.1136/bmjinnov-2016-000164. PubMed DOI PMC
Gazali. Kaur S, Singh I. Artificial intelligence based clinical data management systems: A review. Informatics in Medicine Unlocked. 2017;9:219–229. doi: 10.1016/j.imu.2017.09.003. DOI
Bruland P, Doods J, Brix T, Dugas M, Storck M. Connecting healthcare and clinical research: Workflow optimizations through seamless integration of EHR, pseudonymization services and EDC systems. International Journal of Medical Informatics. 2018 Nov;119:103–108. doi: 10.1016/j.ijmedinf.2018.09.007. PubMed DOI
Zhengwu Lu Electronic Data-Capturing Technology for Clinical Trials: Experience with a Global Postmarketing Study. IEEE Eng. Med. Biol. Mag. 2010 Mar;29(2):95–102. doi: 10.1109/memb.2009.935726. PubMed DOI
Brandt CA, Argraves S, Money R, Ananth G, Trocky NM, Nadkarni PM. Informatics tools to improve clinical research study implementation. Contemporary Clinical Trials. 2006 Apr;27(2):112–122. doi: 10.1016/j.cct.2005.11.013. PubMed DOI
Gaspar J, Catumbela E, Marques B, Freitas A. Systematic review of outliers detection techniques in medical data - preliminary study. Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2011); HEALTHINF; 2011; Rome, Italy. 2011. pp. 575–582. DOI
Sakamoto J. A Hercule Poirot of clinical research. Gastric Cancer. 2015 Oct 19;19(1):21–23. doi: 10.1007/s10120-015-0555-3. PubMed DOI
Lei D, Zhu Q, Chen J, Lin H, Yang P. Automatic K-Means Clustering Algorithm for Outlier Detection. Information Engineering and Applications. Lecture Notes in Electrical Engineering. 2012;154:363–372. doi: 10.1007/978-1-4471-2386-6_47. DOI
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3. PubMed DOI
Smiti A. When machine learning meets medical world: Current status and future challenges. Computer Science Review. 2020 Aug;37:100280. doi: 10.1016/j.cosrev.2020.100280. DOI
Knepper D, Lindblad AS, Sharma G, Gensler GR, Manukyan Z, Matthews AG, Seifu Y. Statistical Monitoring in Clinical Trials: Best Practices for Detecting Data Anomalies Suggestive of Fabrication or Misconduct. Ther Innov Regul Sci. 2016 Dec 30;50(2):144–154. doi: 10.1177/2168479016630576. PubMed DOI
Pimentel MA, Clifton DA, Clifton L, Tarassenko L. A review of novelty detection. Signal Processing. 2014 Jun;99:215–249. doi: 10.1016/j.sigpro.2013.12.026. PubMed DOI PMC
Karczmarek P, Kiersztyn A, Pedrycz W, Al E. K-Means-based isolation forest. Knowledge-Based Systems. 2020 May;195:105659. doi: 10.1016/j.knosys.2020.105659. DOI
Koufakou A, Georgiopoulos M. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Disc. 2009 Nov 11;20(2):259–289. doi: 10.1007/s10618-009-0148-z. DOI
Estiri H, Klann JG, Murphy SN. A clustering approach for detecting implausible observation values in electronic health records data. BMC Med Inform Decis Mak. 2019 Jul 23;19(1):1–16. doi: 10.1186/s12911-019-0852-6. PubMed DOI PMC