Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2020

01-12-2020 | Influenza Vaccination | Research article

Prevalence estimation by joint use of big data and health survey: a demonstration study using electronic health records in New York city

Authors: Ryung S. Kim, Viswanathan Shankar

Published in: BMC Medical Research Methodology | Issue 1/2020

Login to get access

Abstract

Background

Electronic Health Records (EHR) has been increasingly used as a tool to monitor population health. However, subject-level errors in the records can yield biased estimates of health indicators. There is an urgent need for methods to estimate the prevalence of health indicators using large and real-time EHR while correcting the potential bias.

Methods

We demonstrate joint analyses of EHR and a smaller gold-standard health survey. We first adopted Mosteller’s method that pools two estimators, among which one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Then, we adopted the method of Schenker et al., which uses multiple imputations of subject-level health outcomes that are missing for the subjects in EHR. This procedure requires information to link some subjects between two sources and modeling the mechanism of misclassification in EHR as well as modeling inclusion probabilities to both sources.

Results

In a simulation study, both estimators yielded negligible bias even when EHR was biased. They performed as well as health survey estimator when EHR bias was large and better than health survey estimator when EHR bias was moderate. It may be challenging to model the misclassification mechanism in real data for the subject-level imputation estimator. We illustrated the methods analyzing six health indicators from 2013 to 14 NYC HANES and the 2013 NYC Macroscope, and a study that linked some subjects in both data sources.

Conclusions

When a small gold-standard health survey exists, it can serve as a safeguard against potential bias in EHR through the joint analysis of the two sources.
Literature
1.
go back to reference Paul MM, Greene CM, Newton-Dame R, Thorpe LE, Perlman SE, McVeigh KH, et al. The state of population health surveillance using electronic health records: a narrative review. Popul Health Manag. 2015;18(3):209–16.CrossRef Paul MM, Greene CM, Newton-Dame R, Thorpe LE, Perlman SE, McVeigh KH, et al. The state of population health surveillance using electronic health records: a narrative review. Popul Health Manag. 2015;18(3):209–16.CrossRef
2.
go back to reference Newton-Dame R, McVeigh KH, Schreibstein L, Perlman S, Lurie-Moroni E, Jacobson L, et al. Design of the New York City Macroscope: innovations in population health surveillance using electronic health records. EGEMS (Washington, DC). 2016;4(1):1265. Newton-Dame R, McVeigh KH, Schreibstein L, Perlman S, Lurie-Moroni E, Jacobson L, et al. Design of the New York City Macroscope: innovations in population health surveillance using electronic health records. EGEMS (Washington, DC). 2016;4(1):1265.
3.
go back to reference Thorpe LE, McVeigh KH, Perlman S, Chan PY, Bartley K, Schreibstein L, et al. Monitoring prevalence, treatment, and control of metabolic conditions in New York City adults using 2013 primary care electronic health records: a surveillance validation study. EGEMS (Washington, DC). 2016;4(1):1266. Thorpe LE, McVeigh KH, Perlman S, Chan PY, Bartley K, Schreibstein L, et al. Monitoring prevalence, treatment, and control of metabolic conditions in New York City adults using 2013 primary care electronic health records: a surveillance validation study. EGEMS (Washington, DC). 2016;4(1):1266.
4.
go back to reference McVeigh KH, Newton-Dame R, Chan PY, Thorpe LE, Schreibstein L, Tatem KS, et al. Can electronic health records be used for population health surveillance? Validating population health metrics against established survey data. EGEMS (Washington, DC). 2016;4(1):1267. McVeigh KH, Newton-Dame R, Chan PY, Thorpe LE, Schreibstein L, Tatem KS, et al. Can electronic health records be used for population health surveillance? Validating population health metrics against established survey data. EGEMS (Washington, DC). 2016;4(1):1267.
5.
go back to reference McVeigh KH, Lurie-Moroni E, Chan PY, Newton-Dame R, Schreibstein L, Tatem KS, et al. Generalizability of indicators from the New York city macroscope electronic health record surveillance system to systems based on other EHR platforms. EGEMS (Washington, DC). 2017;5(1):25. McVeigh KH, Lurie-Moroni E, Chan PY, Newton-Dame R, Schreibstein L, Tatem KS, et al. Generalizability of indicators from the New York city macroscope electronic health record surveillance system to systems based on other EHR platforms. EGEMS (Washington, DC). 2017;5(1):25.
6.
go back to reference Thompson ME. International surveys: motives and methodologies. Surv Methodol. 2008;34(2):131–41. Thompson ME. International surveys: motives and methodologies. Surv Methodol. 2008;34(2):131–41.
7.
go back to reference Lohr SL, Brick JM. Blending domain estimates from two victimization surveys with possible bias. Can J Stat. 2012;40(4):679–96.CrossRef Lohr SL, Brick JM. Blending domain estimates from two victimization surveys with possible bias. Can J Stat. 2012;40(4):679–96.CrossRef
8.
go back to reference Manzi G, Spiegelhalter DJ, Turner RM, Flowers J, Thompson SG. Modelling bias in combining small area prevalence estimates from multiple surveys. J Royal Stat Soc Ser A. 2011;174:31–50.CrossRef Manzi G, Spiegelhalter DJ, Turner RM, Flowers J, Thompson SG. Modelling bias in combining small area prevalence estimates from multiple surveys. J Royal Stat Soc Ser A. 2011;174:31–50.CrossRef
9.
10.
go back to reference Raghunathan TE, Xie D, Schenker N, Parsons VL, Davis WW, Dodd KW. Combining information from two surveys to estimate county-level prevalence rates of cancer risk factors and screening. J Am Stat Assoc. 2007;102(478):474–86.CrossRef Raghunathan TE, Xie D, Schenker N, Parsons VL, Davis WW, Dodd KW. Combining information from two surveys to estimate county-level prevalence rates of cancer risk factors and screening. J Am Stat Assoc. 2007;102(478):474–86.CrossRef
11.
go back to reference Ybarra LMR, Lohr SL. Small area estimation when auxiliary information is measured with error. Biometrika. 2008;95(4):919–31.CrossRef Ybarra LMR, Lohr SL. Small area estimation when auxiliary information is measured with error. Biometrika. 2008;95(4):919–31.CrossRef
12.
go back to reference Kim J, Rao J. Combining data from two independent surveys: a model-assisted approach. Biometrika. 2012;99(1):85–100.CrossRef Kim J, Rao J. Combining data from two independent surveys: a model-assisted approach. Biometrika. 2012;99(1):85–100.CrossRef
13.
go back to reference Park S, Kim JK, Stukel D. A measurement error model for survey data integration: combining information from two surveys. Metron. 2017;75:345–57.CrossRef Park S, Kim JK, Stukel D. A measurement error model for survey data integration: combining information from two surveys. Metron. 2017;75:345–57.CrossRef
14.
go back to reference Schenker N, Raghunathan TE, Bondarenko I. Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Stat Med. 2010;29(5):533–45.PubMed Schenker N, Raghunathan TE, Bondarenko I. Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Stat Med. 2010;29(5):533–45.PubMed
15.
go back to reference Gelman A, King G, Liu C. Not asked and not answered: multiple imputation for multiple surveys: rejoinder. J Am Stat Assoc. 1998;93(443):869–74. Gelman A, King G, Liu C. Not asked and not answered: multiple imputation for multiple surveys: rejoinder. J Am Stat Assoc. 1998;93(443):869–74.
16.
go back to reference He Y, Landrum MB, Zaslavsky AM. Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach. Stat Med. 2014;20(33):3710–24.CrossRef He Y, Landrum MB, Zaslavsky AM. Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach. Stat Med. 2014;20(33):3710–24.CrossRef
18.
go back to reference R Core Team. R: a language and environment for statistical computing; 2016. R Core Team. R: a language and environment for statistical computing; 2016.
19.
go back to reference Wang Z, Kim JK, Yang S. Approximate Bayesian inference under informative sampling. Biometrika. 2017;105(1):91–102.CrossRef Wang Z, Kim JK, Yang S. Approximate Bayesian inference under informative sampling. Biometrika. 2017;105(1):91–102.CrossRef
20.
go back to reference Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2006.CrossRef Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2006.CrossRef
21.
go back to reference Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 2006. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 2006.
22.
go back to reference Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.CrossRef Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.CrossRef
23.
go back to reference van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):67.CrossRef van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):67.CrossRef
24.
go back to reference Thorpe LE, Greene C, Freeman A, Snell E, Rodriguez-Lopez JS, Frankel M, et al. Rationale, design and respondent characteristics of the 2013-2014 New York City health and nutrition examination survey (NYC HANES 2013-2014). Prev Med Rep. 2015;2:580–5.CrossRef Thorpe LE, Greene C, Freeman A, Snell E, Rodriguez-Lopez JS, Frankel M, et al. Rationale, design and respondent characteristics of the 2013-2014 New York City health and nutrition examination survey (NYC HANES 2013-2014). Prev Med Rep. 2015;2:580–5.CrossRef
25.
26.
go back to reference Valliant R. Poststratification and conditional variance estimation. J Am Stat Assoc. 1993;88(421):89–96. Valliant R. Poststratification and conditional variance estimation. J Am Stat Assoc. 1993;88(421):89–96.
27.
go back to reference Chan PY, Zhao Y, Lim S, Perlman SE, McVeigh KH. Using calibration to reduce measurement error in prevalence estimates based on electronic health records. Prev Chronic Dis. 2018;15:E155.CrossRef Chan PY, Zhao Y, Lim S, Perlman SE, McVeigh KH. Using calibration to reduce measurement error in prevalence estimates based on electronic health records. Prev Chronic Dis. 2018;15:E155.CrossRef
28.
go back to reference Raghunathan TE. Combining information frommultiple surveys for assessing health disparities. Allg Stat Arch. 2006;90:515–26. Raghunathan TE. Combining information frommultiple surveys for assessing health disparities. Allg Stat Arch. 2006;90:515–26.
Metadata
Title
Prevalence estimation by joint use of big data and health survey: a demonstration study using electronic health records in New York city
Authors
Ryung S. Kim
Viswanathan Shankar
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2020
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-020-00956-6

Other articles of this Issue 1/2020

BMC Medical Research Methodology 1/2020 Go to the issue