Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2019 | Technical advance

A clustering approach for detecting implausible observation values in electronic health records data

Authors: Hossein Estiri, Jeffrey G. Klann, Shawn N. Murphy

Published in: BMC Medical Informatics and Decision Making | Issue 1/2019

Abstract

Background

Identifying implausible clinical observations (e.g., laboratory test and vital sign values) in Electronic Health Record (EHR) data using rule-based procedures is challenging. Anomaly/outlier detection methods can be applied as an alternative algorithmic approach to flagging such implausible values in EHRs.

Methods

The primary objectives of this research were to develop and test an unsupervised clustering-based anomaly/outlier detection approach for detecting implausible observations in EHR data as an alternative algorithmic solution to the existing procedures. Our approach is built upon two underlying hypotheses that, (i) when there are large number of observations, implausible records should be sparse, and therefore (ii) if these data are clustered properly, clusters with sparse populations should represent implausible observations. To test these hypotheses, we applied an unsupervised clustering algorithm to EHR observation data on 50 laboratory tests from Partners HealthCare. We tested different specifications of the clustering approach and computed confusion matrix indices against a set of silver-standard plausibility thresholds. We compared the results from the proposed approach with conventional anomaly detection (CAD) approaches, including standard deviation and Mahalanobis distance.

Results

We found that the clustering approach produced results with exceptional specificity and high sensitivity. Compared with the conventional anomaly detection approaches, our proposed clustering approach resulted in significantly smaller number of false positive cases.

Conclusion

Our contributions include (i) a clustering approach for identifying implausible EHR observations, (ii) evidence that implausible observations are sparse in EHR laboratory test results, (iii) a parallel implementation of the clustering approach on i2b2 star schema, and (3) a set of silver-standard plausibility thresholds for 50 laboratory tests that can be used in other studies for validation. The proposed algorithmic solution can augment human decisions to improve data quality. Therefore, a workflow is needed to complement the algorithm’s job and initiate necessary actions that need to be taken in order to improve the quality of data.

Available only for authorised users

Brown JS, Kahn M, Toh S. Data quality assessment for comparative effectiveness research in distributed data networks. Med Care. 2013;51:S22–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23793049.CrossRef

Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc [Internet]. 2013;20:144–51 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555312/.CrossRef

Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4(1):1244. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051581/. [cited 2016 Sep 16].CrossRef

Ghahramani Z. Unsupervised Learning. In: Bousquet O, von Luxburg U, Rätsch G, editors. Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Berlin, Heidelberg: Springer; 2004.

Hauskrecht M, Batal I, Hong C, Nguyen Q, Cooper GF, Visweswaran S, et al. Outlier-based detection of unusual patient-management actions: an ICU study. J Biomed Inform. 2016;64:211–21.CrossRef

Bouarfa L, Dankelman J. Workflow mining and outlier detection from clinical activity logs. J Biomed Inform. 2012;45(6):1185–90.CrossRef

Presbitero A, Quax R, Krzhizhanovskaya V, Sloot P. Anomaly detection in clinical data of patients undergoing heart surgery. Procedia Comput Sci. 2017;108:99–108.CrossRef

Antonelli D, Bruno G, Chiusano S. Anomaly detection in medical treatment to discover unusual patient management. IIE Trans Healthc Syst Eng. 2013;3(2):69–77.CrossRef

Ray S, Wright A. Detecting anomalies in alert firing within clinical decision support systems using anomaly/outlier detection techniques. Proc. 7th ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. Informatics. New York: ACM; 2016. p. 185–90. Available from: http://doi.acm.org/10.1145/2975167.2975186

10.

Ray S, McEvoy DS, Aaron S, Hickman TT, Wright A. Using statistical anomaly detection models to find clinical decision support malfunctions. J Am Med Informatics Assoc. 2018;25(7):862–71.CrossRef

11.

Wilson B, Tseng CL, Soroka O, Pogach LM, Aron DC. Identification of outliers and positive deviants for healthcare improvement: looking for high performers in hypoglycemia safety in patients with diabetes. BMC Health Serv Res. 2017;17(1):738.CrossRef

12.

Deneshkumar V, Senthamaraikannan K, Manikandan M. Identification of outliers in medical diagnostic system using data mining techniques. Int J Stat Appl. 2014;4(6):241–8.

13.

Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv 2009;41:1–58. Available from: https://dl.acm.org/citation.cfm?id=1541882.CrossRef

14.

Hodge VJ, Austin J. A survey of outlier detection methodologies. Artif Intell Rev. 2004;22(2):85–126.CrossRef

15.

Aggarwal CC, Yu PS. Outlier detection for high dimensional data. ACM SIGMOD Rec. 2001;30(2):37–46.CrossRef

16.

Knorr EM, Ng RT, Tucakov V. Distance-based outliers: algorithms and applications. VLDB J. 2000;8(3-4):237–53.CrossRef

17.

Ben-Gal I. Outlier Detection. In: Maimon O, Rokach L, editors. Data Mining and Knowledge Discovery Handbook. Boston: Springer; 2005.

18.

Gaspar J, Catumbela E, Marques B, Freitas A. A systematic review of outliers detection techniques in medical data - preliminary study. Heal. 2011. Proc Int Conf Heal Informatics. 2011.

19.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction: Springer Ser. Stat; 2009.

20.

Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31:651–66.CrossRef

21.

MacQueen J. Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp. Math. Stat. Probab. Vol. 1 Stat. Berkeley, Calif.: University of California Press; 1967. p. 281–97. Available from: http://projecteuclid.org/euclid.bsmsp/1200512992.

22.

Chawla S, Gionis A. k -means–: a unified approach to clustering and outlier detection. Proc. 2013 SIAM Int. Conf. Data min; 2013. p. 189–97. Available from: http://epubs.siam.org/doi/abs/10.1137/1.9781611972832.21.

23.

Chen B, Tai PC, Harrison R, Pan Y. Novel hybrid hierarchical-K-means clustering method (H-K-means) for microarray analysis. IEEE Comput Syst Bioinforma Conf Work Poster Abstr. 2005;2005:105–8.

24.

Sugar CA, James GM. Finding the number of clusters in a dataset. J. Am. Stat. Assoc. 2003;98:750–63. Available from: http://www.tandfonline.com/doi/abs/10.1198/016214503000000666.CrossRef

25.

Hamerly G, Elkan C. Learning the k in k means. Adv neural Inf Process. 2004;17:1–8. Available from: books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf%5Cnhttp://books.google.com/books?hl=en&lr=&id=0F-9C7K8fQ8C&oi=fnd&pg=PA281&dq=Learning+the+k+in+k-means&ots=TGLvqYQa40&sig=SDu4cZ9TCeU8a5MoG1uMcRLQGFE.

26.

Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31. Available from: http://www.tandfonline.com/doi/abs/10.1198/016214502760047131.CrossRef

27.

Nalichowski R, Keogh D, Chueh HC, Murphy SN. Calculating the benefits of a research patient data repository. AMIA Annu Symp Proc United States. 2006. p. 1044.

28.

Estiri H, Omran BA, Murphy SN. Kluster : an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Res. 2018;13:38–51 Available from: http://linkinghub.elsevier.com/retrieve/pii/S2214579617303611. [cited 2018 Jun 3].CrossRef

29.

De Maesschalck R, Jouan-Rimbaud D, Massart DLL. The Mahalanobis distance. Chemom Intell Lab Syst. 2000;50:1–18.

30.

Filzmoser P. A multivariate outlier detection method. Seventh Int Conf Comput Data Anal Model. 2004.

Title: A clustering approach for detecting implausible observation values in electronic health records data
Authors: Hossein Estiri
Jeffrey G. Klann
Shawn N. Murphy
Publication date: 01-12-2019
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2019
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-019-0852-6

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

A clustering approach for detecting implausible observation values in electronic health records data

Abstract

Background

Methods

Results

Conclusion

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2019

Mobile phone apps for clinical decision support in pregnancy: a scoping review

Decision curve analysis apropos of choice of preferable treatment positioning during breast irradiation

Assessing factors militating against the acceptance and successful implementation of a cloud based health center from the healthcare professionals’ perspective: a survey of hospitals in Benue state, northcentral Nigeria

The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures

Epileptic patients’ willingness to receive cell-phone based medication reminder in Northwest Ethiopia

The index lift in data mining has a close relationship with the association measure relative risk in epidemiological studies