Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2024

Open Access 01-12-2024 | Type 2 Diabetes | Research

Dirichlet process mixture models to impute missing predictor data in counterfactual prediction models: an application to predict optimal type 2 diabetes therapy

Authors: Pedro Cardoso, John M. Dennis, Jack Bowden, Beverley M. Shields, Trevelyan J. McKinley, the MASTERMIND Consortium

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Login to get access

Abstract

Background

The handling of missing data is a challenge for inference and regression modelling. A particular challenge is dealing with missing predictor information, particularly when trying to build and make predictions from models for use in clinical practice.

Methods

We utilise a flexible Bayesian approach for handling missing predictor information in regression models. This provides practitioners with full posterior predictive distributions for both the missing predictor information (conditional on the observed predictors) and the outcome-of-interest. We apply this approach to a previously proposed counterfactual treatment selection model for type 2 diabetes second-line therapies. Our approach combines a regression model and a Dirichlet process mixture model (DPMM), where the former defines the treatment selection model, and the latter provides a flexible way to model the joint distribution of the predictors.

Results

We show that DPMMs can model complex relationships between predictor variables and can provide powerful means of fitting models to incomplete data (under missing-completely-at-random and missing-at-random assumptions). This framework ensures that the posterior distribution for the parameters and the conditional average treatment effect estimates automatically reflect the additional uncertainties associated with missing data due to the hierarchical model structure. We also demonstrate that in the presence of multiple missing predictors, the DPMM model can be used to explore which variable(s), if collected, could provide the most additional information about the likely outcome.

Conclusions

When developing clinical prediction models, DPMMs offer a flexible way to model complex covariate structures and handle missing predictor information. DPMM-based counterfactual prediction models can also provide additional information to support clinical decision-making, including allowing predictions with appropriate uncertainty to be made for individuals with incomplete predictor data.
Appendix
Available only for authorised users
Literature
1.
go back to reference Kent DM, Paulus JK, van Klaveren D, D’Agostino R, Goodman S, Hayward R, et al. The predictive approaches to treatment effect heterogeneity (PATH) statement. Ann Intern Med. 2020;172(35). Kent DM, Paulus JK, van Klaveren D, D’Agostino R, Goodman S, Hayward R, et al. The predictive approaches to treatment effect heterogeneity (PATH) statement. Ann Intern Med. 2020;172(35).
2.
go back to reference Dennis JM, Young KG, McGovern AP, Mateen BA, Vollmer SJ, Simpson MD, et al. Development of a treatment selection algorithm for SGLT2 and DPP-4 inhibitor therapies in people with type 2 diabetes: a retrospective cohort study. Lancet Digit Health. 2022;4(12):e873–83.CrossRef Dennis JM, Young KG, McGovern AP, Mateen BA, Vollmer SJ, Simpson MD, et al. Development of a treatment selection algorithm for SGLT2 and DPP-4 inhibitor therapies in people with type 2 diabetes: a retrospective cohort study. Lancet Digit Health. 2022;4(12):e873–83.CrossRef
3.
go back to reference Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley; 2002. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley; 2002.
4.
go back to reference McLachlan GJ, Rathnayake S, Lee SX. Comprehensive Chemometrics: Chemical and Biochemical Data Analysis. 2nd ed. Oxford: Elsevier; 2020. McLachlan GJ, Rathnayake S, Lee SX. Comprehensive Chemometrics: Chemical and Biochemical Data Analysis. 2nd ed. Oxford: Elsevier; 2020.
5.
go back to reference Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.CrossRef Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.CrossRef
6.
go back to reference Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–9.CrossRefPubMedPubMedCentral Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–9.CrossRefPubMedPubMedCentral
7.
go back to reference Sisk R, Sperrin M, Peek N, van Smeden M, Martin GP. Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study. Stat Methods Med Res. 2023;32(8):1461–77.CrossRefPubMedPubMedCentral Sisk R, Sperrin M, Peek N, van Smeden M, Martin GP. Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study. Stat Methods Med Res. 2023;32(8):1461–77.CrossRefPubMedPubMedCentral
8.
go back to reference Moons KGM, Donders RART, Stijen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–101.CrossRefPubMed Moons KGM, Donders RART, Stijen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–101.CrossRefPubMed
9.
go back to reference Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. New York: Chapman & Hall/CRC; 2013.CrossRef Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. New York: Chapman & Hall/CRC; 2013.CrossRef
10.
go back to reference McAuliffe JD, Blei DM, Jordan MI. Nonparametric empirical Bayes for the Dirichlet process mixture model. Stat Comput. 2006;16:5–14.CrossRef McAuliffe JD, Blei DM, Jordan MI. Nonparametric empirical Bayes for the Dirichlet process mixture model. Stat Comput. 2006;16:5–14.CrossRef
11.
go back to reference Molitor J, Papathomas M, Jerrett M, Richardson S. Bayesian profile regression with an application to the national survey of children’s health. Biostatistics. 2010;11(3):484–98.CrossRefPubMed Molitor J, Papathomas M, Jerrett M, Richardson S. Bayesian profile regression with an application to the national survey of children’s health. Biostatistics. 2010;11(3):484–98.CrossRefPubMed
12.
go back to reference Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. J Stat Softw. 2015;64(7):1–30.CrossRefPubMedPubMedCentral Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. J Stat Softw. 2015;64(7):1–30.CrossRefPubMedPubMedCentral
13.
go back to reference Banerjee A, Murray J, Dunson D. Bayesian learning of joint distributions of objects. Artif Intell Stat. 2013;31:1–9. Banerjee A, Murray J, Dunson D. Bayesian learning of joint distributions of objects. Artif Intell Stat. 2013;31:1–9.
14.
go back to reference DeYoreo M, Reiter JP, Hillygus DS. Bayesian mixture models with focused clustering for mixed ordinal and nominal data. Bayesian Anal. 2017;12(3):679–703.CrossRef DeYoreo M, Reiter JP, Hillygus DS. Bayesian mixture models with focused clustering for mixed ordinal and nominal data. Bayesian Anal. 2017;12(3):679–703.CrossRef
15.
go back to reference Akande O, Li F, Reiter J. An empirical comparison of multiple imputation methods for categorical data. Am Stat. 2017;71(2):162–70.CrossRef Akande O, Li F, Reiter J. An empirical comparison of multiple imputation methods for categorical data. Am Stat. 2017;71(2):162–70.CrossRef
16.
go back to reference Ferguson TS. A Bayesian analysis of some nonparametric problems. Annals Stat. 1973;1(2):209–30.CrossRef Ferguson TS. A Bayesian analysis of some nonparametric problems. Annals Stat. 1973;1(2):209–30.CrossRef
17.
go back to reference Favaro S, Walker SG. A generalized constructive definition for the Dirichlet process. Stat Probab Lett. 2010;78(16). Favaro S, Walker SG. A generalized constructive definition for the Dirichlet process. Stat Probab Lett. 2010;78(16).
18.
go back to reference Peel D, McLachlan G. Finite Mixture Models. New York: Wiley; 2000. Peel D, McLachlan G. Finite Mixture Models. New York: Wiley; 2000.
19.
go back to reference Papaspiliopoulos O, Roberts GO. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95(1):169–86.CrossRef Papaspiliopoulos O, Roberts GO. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95(1):169–86.CrossRef
20.
go back to reference Daniels MJ, Linero AR, Roy J. Bayesian Nonparametrics for Causal Inference and Missing Data. Chapman & Hall/CRC; 2023. Daniels MJ, Linero AR, Roy J. Bayesian Nonparametrics for Causal Inference and Missing Data. Chapman & Hall/CRC; 2023.
22.
go back to reference Zio MD, Guarnera U, Luzi O. Imputation through finite Gaussian mixture models. Comput Stat Data Anal. 2007;51(11):5305–16.CrossRef Zio MD, Guarnera U, Luzi O. Imputation through finite Gaussian mixture models. Comput Stat Data Anal. 2007;51(11):5305–16.CrossRef
23.
go back to reference Kim HJ, Reiter JP, Wang Q, Cox LH, Karr AF. Multiple imputation of missing or faulty values under linear constraints. J Bus Econ Stat. 2014;31(2):375–86.CrossRef Kim HJ, Reiter JP, Wang Q, Cox LH, Karr AF. Multiple imputation of missing or faulty values under linear constraints. J Bus Econ Stat. 2014;31(2):375–86.CrossRef
24.
go back to reference Si Y, Reiter JP. Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J Educ Behav Stat. 2013;38(5):499–521.CrossRef Si Y, Reiter JP. Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J Educ Behav Stat. 2013;38(5):499–521.CrossRef
25.
go back to reference Wang C, Liao X, Carin L, Dunson DB. Classification with incomplete data using Dirichlet process priors. J Mach Learn Res. 2010;11(12). Wang C, Liao X, Carin L, Dunson DB. Classification with incomplete data using Dirichlet process priors. J Mach Learn Res. 2010;11(12).
26.
go back to reference Manrique-Vallier D, Reiter JP. Bayesian simultaneous edit and imputation for multivariate categorical data. J Am Stat Assoc. 2017;112(520):1708–19.CrossRef Manrique-Vallier D, Reiter JP. Bayesian simultaneous edit and imputation for multivariate categorical data. J Am Stat Assoc. 2017;112(520):1708–19.CrossRef
27.
go back to reference Roy J, Lum KJ, Zeldow B, Dworkin JD, Lo Re III V, Daniels MJ. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics. 2018;74(4):1193–202.CrossRefPubMed Roy J, Lum KJ, Zeldow B, Dworkin JD, Lo Re III V, Daniels MJ. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics. 2018;74(4):1193–202.CrossRefPubMed
28.
go back to reference Wade S, Mongelluzzo S, Petrone S. An enriched conjugate prior for Bayesian nonparametric inference. Bayesian Anal. 2011;6(3):359–86.CrossRef Wade S, Mongelluzzo S, Petrone S. An enriched conjugate prior for Bayesian nonparametric inference. Bayesian Anal. 2011;6(3):359–86.CrossRef
29.
go back to reference Wade S, Dunson DB, Petrone S, Trippa L. Improving prediction from Dirichlet process mixtures via enrichment. J Mach Learn Res. 2014;15:1041–71. Wade S, Dunson DB, Petrone S, Trippa L. Improving prediction from Dirichlet process mixtures via enrichment. J Mach Learn Res. 2014;15:1041–71.
30.
go back to reference Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67–79.CrossRef Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67–79.CrossRef
31.
go back to reference Dennis J. Precision medicine in type 2 diabetes: using individualized prediction models to optimise selection of treatment. Diabetes. 2020;69:2075–85.CrossRefPubMedPubMedCentral Dennis J. Precision medicine in type 2 diabetes: using individualized prediction models to optimise selection of treatment. Diabetes. 2020;69:2075–85.CrossRefPubMedPubMedCentral
32.
go back to reference de Valpine P, Turek D, Paciorek CJ, Anderson-Bergman C, Temple Lang D, Bodik R. Programming with models: writing statistical algorithms for general model structures with NIMBLE. J Comput Graph Stat. 2017;26:403–13.CrossRef de Valpine P, Turek D, Paciorek CJ, Anderson-Bergman C, Temple Lang D, Bodik R. Programming with models: writing statistical algorithms for general model structures with NIMBLE. J Comput Graph Stat. 2017;26:403–13.CrossRef
35.
go back to reference Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44(3):827–36.CrossRefPubMedCentral Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, et al. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44(3):827–36.CrossRefPubMedCentral
36.
go back to reference Harrell Jr FE. Regression Modeling Strategies. New York: Springer International Publishing; 2015.CrossRef Harrell Jr FE. Regression Modeling Strategies. New York: Springer International Publishing; 2015.CrossRef
37.
go back to reference Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–511.CrossRef Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–511.CrossRef
38.
go back to reference Bossuyt PM, Parvin T. Evaluating biomarkers for guiding treatment decisions. EJIFCC. 2015;26(1):63–70.PubMedCentral Bossuyt PM, Parvin T. Evaluating biomarkers for guiding treatment decisions. EJIFCC. 2015;26(1):63–70.PubMedCentral
39.
go back to reference Linero AR, Daniels MJ. A flexible Bayesian approach to monotone missing data in longitudinal studies with nonignorable missingness with application to an acute schizophrenia clinical trial. J Am Stat Assoc. 2015;110(509):45–55.CrossRefPubMedPubMedCentral Linero AR, Daniels MJ. A flexible Bayesian approach to monotone missing data in longitudinal studies with nonignorable missingness with application to an acute schizophrenia clinical trial. J Am Stat Assoc. 2015;110(509):45–55.CrossRefPubMedPubMedCentral
40.
go back to reference Bürkner PC. brms: an R package for Bayesian multilevel models using Stan. J Stat Softw. 2017;80(1):1–28.CrossRef Bürkner PC. brms: an R package for Bayesian multilevel models using Stan. J Stat Softw. 2017;80(1):1–28.CrossRef
41.
go back to reference van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67.CrossRef van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67.CrossRef
42.
go back to reference van Hasselt M. Bayesian inference in a sample selection model. J Econ. 2011;165(2):221–32.CrossRef van Hasselt M. Bayesian inference in a sample selection model. J Econ. 2011;165(2):221–32.CrossRef
43.
go back to reference Hahn PR, Murray JS, Carvalho CM. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects (with discussion). Bayesian Anal. 2020;15(3):965–1056.CrossRef Hahn PR, Murray JS, Carvalho CM. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects (with discussion). Bayesian Anal. 2020;15(3):965–1056.CrossRef
44.
go back to reference Daniels MJ, Gaskins JT. Bayesian methods for the analysis of mixed categorical and continuous (incomplete) data. In: de Leon AR, Chough KC, editors. Analysis of Mixed Data: Methods and Applications. Chapman & Hall/CRC; 2013. Daniels MJ, Gaskins JT. Bayesian methods for the analysis of mixed categorical and continuous (incomplete) data. In: de Leon AR, Chough KC, editors. Analysis of Mixed Data: Methods and Applications. Chapman & Hall/CRC; 2013.
Metadata
Title
Dirichlet process mixture models to impute missing predictor data in counterfactual prediction models: an application to predict optimal type 2 diabetes therapy
Authors
Pedro Cardoso
John M. Dennis
Jack Bowden
Beverley M. Shields
Trevelyan J. McKinley
the MASTERMIND Consortium
Publication date
01-12-2024
Publisher
BioMed Central
Keyword
Type 2 Diabetes
Published in
BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-023-02400-3

Other articles of this Issue 1/2024

BMC Medical Informatics and Decision Making 1/2024 Go to the issue