Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2019

Open Access 01-12-2019 | Research article

The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project

Authors: Kelly M. Sunderland, Derek Beaton, Julia Fraser, Donna Kwan, Paula M. McLaughlin, Manuel Montero-Odasso, Alicia J. Peltsch, Frederico Pieruccini-Faria, Demetrios J. Sahlas, Richard H. Swartz, Stephen C. Strother, Malcolm A. Binns, ONDRI Investigators

Published in: BMC Medical Research Methodology | Issue 1/2019

Login to get access

Abstract

Background

Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow.

Methods

We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods – Minimum Covariance Determinant (MCD) and Candès’ Robust Principal Component Analysis (RPCA) – and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification.

Results

Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection.

Conclusions

Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.
Appendix
Available only for authorised users
Literature
6.
go back to reference Mueller SG, Weiner MW, Thal LJ, Peterson RC, Jack C, Jagust W, et al. The Alzheimer’s Disease Neuroimaging Initiative. Neuroimaging Clin N Am. 2005;15:869–xii.CrossRef Mueller SG, Weiner MW, Thal LJ, Peterson RC, Jack C, Jagust W, et al. The Alzheimer’s Disease Neuroimaging Initiative. Neuroimaging Clin N Am. 2005;15:869–xii.CrossRef
7.
go back to reference Marek K, Chowdhury S, Siderowf A, Lasch S, Coffey CS, Caspell-Garcia C, et al. The Parkinson’s progression markers initiative (PPMI) – establishing a PD biomarker cohort. Ann Clin Transl Neurol. 2018;5:1460–77.CrossRef Marek K, Chowdhury S, Siderowf A, Lasch S, Coffey CS, Caspell-Garcia C, et al. The Parkinson’s progression markers initiative (PPMI) – establishing a PD biomarker cohort. Ann Clin Transl Neurol. 2018;5:1460–77.CrossRef
14.
go back to reference Kawado M, Hinotsu S, Matsuyama Y, Yamaguchi T, Hashimoto S, Ohashi Y. A comparison of error detection rates between the reading aloud method and the double data entry method. Control Clin Trials. 2003;24:560–9.CrossRef Kawado M, Hinotsu S, Matsuyama Y, Yamaguchi T, Hashimoto S, Ohashi Y. A comparison of error detection rates between the reading aloud method and the double data entry method. Control Clin Trials. 2003;24:560–9.CrossRef
16.
go back to reference Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:0966–70. Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:0966–70.
17.
go back to reference Rousseeuw PJ, Hubert M. Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1:73–9.CrossRef Rousseeuw PJ, Hubert M. Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1:73–9.CrossRef
18.
go back to reference Cousineau D, Chartier S. Outliers detection and treatment: a review. Int J Psychol Res. 2010;3:58–67. Cousineau D, Chartier S. Outliers detection and treatment: a review. Int J Psychol Res. 2010;3:58–67.
21.
go back to reference Marazzi A, Ruffieux C. The truncated mean of an asymmetric distribution. Comput Stat Data Anal. 1999;32:79–100.CrossRef Marazzi A, Ruffieux C. The truncated mean of an asymmetric distribution. Comput Stat Data Anal. 1999;32:79–100.CrossRef
22.
go back to reference Tukey JW. Exploratory data analysis. Reading, Mass: Addison-Wesley Pub. Co; 1977. Tukey JW. Exploratory data analysis. Reading, Mass: Addison-Wesley Pub. Co; 1977.
24.
go back to reference Aggarwal CC. Outlier Analysis. Dordrecht: Springer; 2013. Aggarwal CC. Outlier Analysis. Dordrecht: Springer; 2013.
26.
go back to reference Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79:871–80. Rousseeuw PJ. Least median of squares regression. J Am Stat Assoc. 1984;79:871–80.
29.
go back to reference Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2:433–59.CrossRef Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2:433–59.CrossRef
30.
go back to reference Verbanck M, Josse J, Husson F. Regularised PCA to denoise and visualise data. Stat Comput. 2013;25:471–86.CrossRef Verbanck M, Josse J, Husson F. Regularised PCA to denoise and visualise data. Stat Comput. 2013;25:471–86.CrossRef
31.
go back to reference Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. J R Stat Soc Ser B Stat Methodol. 2013;75:603–80.CrossRef Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. J R Stat Soc Ser B Stat Methodol. 2013;75:603–80.CrossRef
33.
go back to reference Garthwaite PH, Koch I. Evaluating the contributions of individual variables to a quadratic form. Aust New Zeal J Stat. 2016;58:99–119.CrossRef Garthwaite PH, Koch I. Evaluating the contributions of individual variables to a quadratic form. Aust New Zeal J Stat. 2016;58:99–119.CrossRef
34.
go back to reference Hubert M, Debruyne M. Minimum covariance determinant. Wiley Interdiscip Rev Comput Stat. 2010;2:36–43.CrossRef Hubert M, Debruyne M. Minimum covariance determinant. Wiley Interdiscip Rev Comput Stat. 2010;2:36–43.CrossRef
35.
go back to reference Strauss E, Sherman EMS, Spreen O, Spreen O. A compendium of neuropsychological tests: administration, norms, and commentary. New York: Oxford University Press; 2006. Strauss E, Sherman EMS, Spreen O, Spreen O. A compendium of neuropsychological tests: administration, norms, and commentary. New York: Oxford University Press; 2006.
36.
go back to reference Montero-Odasso M, Pieruccini-Faria F, Bartha R, Black SE, Finger E, Freedman M, et al. Motor phenotype in neurodegenerative disorders: gait and balance platform study design protocol for the Ontario neurodegenerative research initiative (ONDRI). J Alzheimers Dis. 2017:1–15. https://doi.org/10.3233/JAD-170149. Montero-Odasso M, Pieruccini-Faria F, Bartha R, Black SE, Finger E, Freedman M, et al. Motor phenotype in neurodegenerative disorders: gait and balance platform study design protocol for the Ontario neurodegenerative research initiative (ONDRI). J Alzheimers Dis. 2017:1–15. https://​doi.​org/​10.​3233/​JAD-170149.
38.
go back to reference Rattanabannakit C, Risacher SL, Gao S, Lane KA, Brown SA, McDonald BC, et al. The cognitive change index as a measure of self and informant perception of cognitive decline: relation to neuropsychological tests. J Alzheimers Dis. 2016;51:1145–55.CrossRef Rattanabannakit C, Risacher SL, Gao S, Lane KA, Brown SA, McDonald BC, et al. The cognitive change index as a measure of self and informant perception of cognitive decline: relation to neuropsychological tests. J Alzheimers Dis. 2016;51:1145–55.CrossRef
39.
go back to reference Biggan JR, Taylor WE, Moss K, Adumatioge L, Shannon V, Gatchel RJ, Ray CT. Role of ApoE-4 genotype in gait and balance in older adults: A pilot study. Journal of Applied Biobehavioral Research. 2017; 22:e12061.CrossRef Biggan JR, Taylor WE, Moss K, Adumatioge L, Shannon V, Gatchel RJ, Ray CT. Role of ApoE-4 genotype in gait and balance in older adults: A pilot study. Journal of Applied Biobehavioral Research. 2017; 22:e12061.CrossRef
42.
go back to reference Pison G, Van Aelst S, Willems G. Small sample corrections for LTS and MCD. Metrika. 2002;55:111–23.CrossRef Pison G, Van Aelst S, Willems G. Small sample corrections for LTS and MCD. Metrika. 2002;55:111–23.CrossRef
44.
go back to reference Candès EJ, Sing-long CA, Trzasko JD. Unbiased risk estimates for singular value thresholding and spectral estimators, vol. 61; 2012. p. 1–29. Candès EJ, Sing-long CA, Trzasko JD. Unbiased risk estimates for singular value thresholding and spectral estimators, vol. 61; 2012. p. 1–29.
46.
go back to reference Rousseeuw PJ, Hubert M. Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8:1–14.CrossRef Rousseeuw PJ, Hubert M. Anomaly detection by robust statistics. Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8:1–14.CrossRef
50.
go back to reference Hodge VJ, Austin J. A survey of outlier detection methodologies. Artif Intell. 2004; 22:85–126. Hodge VJ, Austin J. A survey of outlier detection methodologies. Artif Intell. 2004; 22:85–126.
51.
go back to reference Gelman A, Hennig C. Beyond subjective and objective in statistics. J R Stat Soc. 2017; 180:967–1033. Gelman A, Hennig C. Beyond subjective and objective in statistics. J R Stat Soc. 2017; 180:967–1033.
52.
go back to reference Beaton D, Sunderland KM, Levine B, Mandzia J, Masellis M, Swartz RH, et al. Generalization of the minimum covariance determinant algorithm for categorical and mixed data types. Preprint. 2019. https://doi.org/10.1101/333005. Beaton D, Sunderland KM, Levine B, Mandzia J, Masellis M, Swartz RH, et al. Generalization of the minimum covariance determinant algorithm for categorical and mixed data types. Preprint. 2019. https://​doi.​org/​10.​1101/​333005.
Metadata
Title
The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
Authors
Kelly M. Sunderland
Derek Beaton
Julia Fraser
Donna Kwan
Paula M. McLaughlin
Manuel Montero-Odasso
Alicia J. Peltsch
Frederico Pieruccini-Faria
Demetrios J. Sahlas
Richard H. Swartz
Stephen C. Strother
Malcolm A. Binns
ONDRI Investigators
Publication date
01-12-2019
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2019
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-019-0737-5

Other articles of this Issue 1/2019

BMC Medical Research Methodology 1/2019 Go to the issue