Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2013

Open Access 01-12-2013 | Research article

Diagnosing problems with imputation models using the Kolmogorov-Smirnov test: a simulation study

Authors: Cattram D Nguyen, John B Carlin, Katherine J Lee

Published in: BMC Medical Research Methodology | Issue 1/2013

Login to get access

Abstract

Background

Multiple imputation (MI) is becoming increasingly popular as a strategy for handling missing data, but there is a scarcity of tools for checking the adequacy of imputation models. The Kolmogorov-Smirnov (KS) test has been identified as a potential diagnostic method for assessing whether the distribution of imputed data deviates substantially from that of the observed data. The aim of this study was to evaluate the performance of the KS test as an imputation diagnostic.

Methods

Using simulation, we examined whether the KS test could reliably identify departures from assumptions made in the imputation model. To do this we examined how the p-values from the KS test behaved when skewed and heavy-tailed data were imputed using a normal imputation model. We varied the amount of missing data, the missing data models and the amount of skewness, and evaluated the performance of KS test in diagnosing issues with the imputation models under these different scenarios.

Results

The KS test was able to flag differences between the observations and imputed values; however, these differences did not always correspond to problems with MI inference for the regression parameter of interest. When there was a strong missing at random dependency, the KS p-values were very small, regardless of whether or not the MI estimates were biased; so that the KS test was not able to discriminate between imputed variables that required further investigation, and those that did not. The p-values were also sensitive to sample size and the proportion of missing data, adding to the challenge of interpreting the results from the KS test.

Conclusions

Given our study results, it is difficult to establish guidelines or recommendations for using the KS test as a diagnostic tool for MI. The investigation of other imputation diagnostics and their incorporation into statistical software are important areas for future research.
Appendix
Available only for authorised users
Literature
1.
go back to reference Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J. 2009, 338: b2393-10.1136/bmj.b2393.CrossRef Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J. 2009, 338: b2393-10.1136/bmj.b2393.CrossRef
2.
go back to reference Mackinnon A: The use and reporting of multiple imputation in medical research – a review. J Intern Med. 2010, 268 (6): 586-593. 10.1111/j.1365-2796.2010.02274.x.CrossRefPubMed Mackinnon A: The use and reporting of multiple imputation in medical research – a review. J Intern Med. 2010, 268 (6): 586-593. 10.1111/j.1365-2796.2010.02274.x.CrossRefPubMed
3.
go back to reference Little RJA, Rubin DB: Statistical analysis with missing data. 2002, Hoboken, N.J.: Wiley, 2 Little RJA, Rubin DB: Statistical analysis with missing data. 2002, Hoboken, N.J.: Wiley, 2
4.
go back to reference Gelman A, Van Mechelen I, Verbeke G, Heitjan DF, Meulders M: Multiple imputation for model checking: Completed-data plots with missing and latent data. Biometrics. 2005, 61 (1): 74-85. 10.1111/j.0006-341X.2005.031010.x.CrossRefPubMed Gelman A, Van Mechelen I, Verbeke G, Heitjan DF, Meulders M: Multiple imputation for model checking: Completed-data plots with missing and latent data. Biometrics. 2005, 61 (1): 74-85. 10.1111/j.0006-341X.2005.031010.x.CrossRefPubMed
5.
go back to reference He Y, Zaslavsky AM: Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat Med. 2012, 31 (1): 1-18. 10.1002/sim.4413.CrossRefPubMed He Y, Zaslavsky AM: Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat Med. 2012, 31 (1): 1-18. 10.1002/sim.4413.CrossRefPubMed
6.
go back to reference Gelman A, King G, Liu CH: Not asked and not answered: Multiple imputation for multiple surveys. J Am Stat Assoc. 1998, 93 (443): 846-857. 10.1080/01621459.1998.10473737.CrossRef Gelman A, King G, Liu CH: Not asked and not answered: Multiple imputation for multiple surveys. J Am Stat Assoc. 1998, 93 (443): 846-857. 10.1080/01621459.1998.10473737.CrossRef
7.
go back to reference Abayomi K, Gelman A, Levy M: Diagnostics for multivariate imputations. J Royal Stat Soc Series C-Appl Stat. 2008, 57: 273-291. 10.1111/j.1467-9876.2007.00613.x.CrossRef Abayomi K, Gelman A, Levy M: Diagnostics for multivariate imputations. J Royal Stat Soc Series C-Appl Stat. 2008, 57: 273-291. 10.1111/j.1467-9876.2007.00613.x.CrossRef
8.
go back to reference White IR, Royston P, Wood AM: Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011, 30 (4): 377-399. 10.1002/sim.4067.CrossRefPubMed White IR, Royston P, Wood AM: Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011, 30 (4): 377-399. 10.1002/sim.4067.CrossRefPubMed
9.
go back to reference Conover WJ: Practical nonparametric statistics. 1980, New York: Wiley, 2d Conover WJ: Practical nonparametric statistics. 1980, New York: Wiley, 2d
10.
go back to reference Eddings W, Marchenko Y: Diagnostics for multiple imputation in Stata. Stata J. 2012, 12: 3- Eddings W, Marchenko Y: Diagnostics for multiple imputation in Stata. Stata J. 2012, 12: 3-
11.
go back to reference Australian Institute of Family Studies: Longitudinal Study of Australian Children Data User Guide – April 2010. 2010, Melbourne: Australian Institute of Family Studies Australian Institute of Family Studies: Longitudinal Study of Australian Children Data User Guide – April 2010. 2010, Melbourne: Australian Institute of Family Studies
12.
go back to reference Goodman R: The strengths and difficulties questionnaire: a research note. J Child Psychol Psychiatry. 1997, 38 (5): 581-586. 10.1111/j.1469-7610.1997.tb01545.x.CrossRefPubMed Goodman R: The strengths and difficulties questionnaire: a research note. J Child Psychol Psychiatry. 1997, 38 (5): 581-586. 10.1111/j.1469-7610.1997.tb01545.x.CrossRefPubMed
13.
go back to reference Bayer JK, Ukoumunne OC, Lucas N, Wake M, Scalzo K, Nicholson JM: Risk factors for childhood mental health symptoms: national Longitudinal Study Of Australian Children. Pediatrics. 2011, 128 (4): 865-879. 10.1542/peds.2011-0491.CrossRef Bayer JK, Ukoumunne OC, Lucas N, Wake M, Scalzo K, Nicholson JM: Risk factors for childhood mental health symptoms: national Longitudinal Study Of Australian Children. Pediatrics. 2011, 128 (4): 865-879. 10.1542/peds.2011-0491.CrossRef
15.
go back to reference Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol meth. 2001, 6 (4): 330-351.CrossRef Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol meth. 2001, 6 (4): 330-351.CrossRef
16.
go back to reference Lee KJ, Carlin JB: Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010, 171 (5): 624-632. 10.1093/aje/kwp425.CrossRefPubMed Lee KJ, Carlin JB: Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010, 171 (5): 624-632. 10.1093/aje/kwp425.CrossRefPubMed
17.
go back to reference Moons KGM, Donders RART, Stijnen T, Harrell JFE: Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006, 59 (10): 1092-1101. 10.1016/j.jclinepi.2006.01.009.CrossRefPubMed Moons KGM, Donders RART, Stijnen T, Harrell JFE: Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006, 59 (10): 1092-1101. 10.1016/j.jclinepi.2006.01.009.CrossRefPubMed
18.
go back to reference StataCorp: Stata Statistical Software: Release 12. 2011, College Station, TX: StataCorp LP StataCorp: Stata Statistical Software: Release 12. 2011, College Station, TX: StataCorp LP
19.
go back to reference Bernaards CA, Belin TR, Schafer JL: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med. 2007, 26 (6): 1368-1382. 10.1002/sim.2619.CrossRefPubMed Bernaards CA, Belin TR, Schafer JL: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med. 2007, 26 (6): 1368-1382. 10.1002/sim.2619.CrossRefPubMed
20.
go back to reference Azzalini A, Capitanio A: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J Roy Stat Soc B. 2003, 65: 367-389. 10.1111/1467-9868.00391.CrossRef Azzalini A, Capitanio A: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J Roy Stat Soc B. 2003, 65: 367-389. 10.1111/1467-9868.00391.CrossRef
21.
go back to reference Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Stat Med. 2006, 25 (24): 4279-4292. 10.1002/sim.2673.CrossRefPubMed Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Stat Med. 2006, 25 (24): 4279-4292. 10.1002/sim.2673.CrossRefPubMed
22.
go back to reference Stuart EA, Azur M, Frangakis C, Leaf P: Multiple imputation with large data sets: a case study of the Children’s mental health initiative. Am J Epidemiol. 2009, 169 (9): 1133-1139. 10.1093/aje/kwp026.CrossRefPubMedPubMedCentral Stuart EA, Azur M, Frangakis C, Leaf P: Multiple imputation with large data sets: a case study of the Children’s mental health initiative. Am J Epidemiol. 2009, 169 (9): 1133-1139. 10.1093/aje/kwp026.CrossRefPubMedPubMedCentral
23.
go back to reference van Buuren S: Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007, 16 (3): 219-242. 10.1177/0962280206074463.CrossRefPubMed van Buuren S: Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007, 16 (3): 219-242. 10.1177/0962280206074463.CrossRefPubMed
24.
go back to reference Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P: A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001, 27: 85-96. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P: A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001, 27: 85-96.
25.
go back to reference Su YS, Gelman A, Hill J, Yajima M: Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box. J Stat Softw. 2011, 45 (2): 1-31.CrossRef Su YS, Gelman A, Hill J, Yajima M: Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box. J Stat Softw. 2011, 45 (2): 1-31.CrossRef
Metadata
Title
Diagnosing problems with imputation models using the Kolmogorov-Smirnov test: a simulation study
Authors
Cattram D Nguyen
John B Carlin
Katherine J Lee
Publication date
01-12-2013
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2013
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-13-144

Other articles of this Issue 1/2013

BMC Medical Research Methodology 1/2013 Go to the issue