Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2017

Open Access 01-12-2017 | Research Article

A nonparametric multiple imputation approach for missing categorical data

Authors: Muhan Zhou, Yulei He, Mandi Yu, Chiu-Hsieh Hsu

Published in: BMC Medical Research Methodology | Issue 1/2017

Login to get access

Abstract

Background

Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities.

Methods

We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented.

Results

The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method.

Conclusions

We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.
Appendix
Available only for authorised users
Literature
2.
go back to reference Cassel C, Sarndal CE, Wretman JH. Some results on generalized estimation Some results difference and generalized for finite populations estimation regression. Biometrika. 1976; 63(3):615–20.CrossRef Cassel C, Sarndal CE, Wretman JH. Some results on generalized estimation Some results difference and generalized for finite populations estimation regression. Biometrika. 1976; 63(3):615–20.CrossRef
4.
go back to reference Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977; 39(1):1–38. http://arxiv.org/abs/0710.5696v2. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977; 39(1):1–38. http://​arxiv.​org/​abs/​0710.​5696v2.​
5.
go back to reference Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995; 142(12):1255–64.CrossRefPubMed Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995; 142(12):1255–64.CrossRefPubMed
6.
go back to reference Heitjan DF, Little RJA. Multiple Imputation for the Fatal Accident Reporting System. Am Stat Ass. 1991; 40(1):13–29. Heitjan DF, Little RJA. Multiple Imputation for the Fatal Accident Reporting System. Am Stat Ass. 1991; 40(1):13–29.
7.
go back to reference Horvitz DG, Thompson DJ. A Generalization of Sampling Without Replacement From a Finite Universe. J Am Stud Stat Assoc. 1952; 47(260):663–85.CrossRef Horvitz DG, Thompson DJ. A Generalization of Sampling Without Replacement From a Finite Universe. J Am Stud Stat Assoc. 1952; 47(260):663–85.CrossRef
8.
go back to reference Hsu CH, Long Q, Li Y, Jacobs E. A nonparametric multiple imputation approach for data with missing covariate values with application to colorectal adenoma data. J Biopharm Stat. 2014; 24(3):634–48.CrossRefPubMedPubMedCentral Hsu CH, Long Q, Li Y, Jacobs E. A nonparametric multiple imputation approach for data with missing covariate values with application to colorectal adenoma data. J Biopharm Stat. 2014; 24(3):634–48.CrossRefPubMedPubMedCentral
9.
go back to reference Jones MP. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. J Am Stat Assoc. 1996; 91(433):222–30.CrossRef Jones MP. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. J Am Stat Assoc. 1996; 91(433):222–30.CrossRef
10.
go back to reference Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley & Sons; 1987. Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley & Sons; 1987.
13.
go back to reference Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987, p. 258.CrossRef Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987, p. 258.CrossRef
14.
go back to reference Rubin DB. Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations. J Bus Econ Stat. 1986; 4(1):87–94. Rubin DB. Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations. J Bus Econ Stat. 1986; 4(1):87–94.
15.
go back to reference Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivar Behav Res. 1998; 33(4):545–71.CrossRef Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivar Behav Res. 1998; 33(4):545–71.CrossRef
16.
go back to reference van der Palm DW, van der Ark LA, Vermunt JK. A comparison of incomplete-data methods for categorical data. Stat Methods Med Res. 2012; 25(2):754–74.CrossRefPubMed van der Palm DW, van der Ark LA, Vermunt JK. A comparison of incomplete-data methods for categorical data. Stat Methods Med Res. 2012; 25(2):754–74.CrossRefPubMed
19.
go back to reference White IR. Simsum: Analyses of simulation studies including Monte Carlo error. Stata J. 2010; 10(3):369–85. White IR. Simsum: Analyses of simulation studies including Monte Carlo error. Stata J. 2010; 10(3):369–85.
20.
go back to reference Wu W, Jia F, Enders C. A Comparison of Imputation Strategies for Ordinal Missing Data on Likert Scale Variables. Multivar Behav Res. 2015; 50(5):484–503.CrossRef Wu W, Jia F, Enders C. A Comparison of Imputation Strategies for Ordinal Missing Data on Likert Scale Variables. Multivar Behav Res. 2015; 50(5):484–503.CrossRef
Metadata
Title
A nonparametric multiple imputation approach for missing categorical data
Authors
Muhan Zhou
Yulei He
Mandi Yu
Chiu-Hsieh Hsu
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2017
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-017-0360-2

Other articles of this Issue 1/2017

BMC Medical Research Methodology 1/2017 Go to the issue