Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2015

Open Access 01-12-2015 | Research Article

Improving prevalence estimation through data fusion: methods and validation

Authors: Tomàs Aluja-Banet, Josep Daunis-i-Estadella, Núria Brunsó, Anna Mompart-Penina

Published in: BMC Medical Informatics and Decision Making | Issue 1/2015

Login to get access

Abstract

Background

Estimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. We propose here to validate this approach through data fusion. Data Fusion is the process of integrating two sources of data into one combined file. It allows us to take even greater advantage of existing information collected in databases. Here, we use data fusion to improve the estimation of health prevalences for two primary health factors: cardiovascular diseases and diabetes.

Methods

We use a real data fusion operation on population health, where the imputation of basic health risk factors is used to enrich a large-scale survey on self-reported health status. We propose choosing the imputation methodology for this problem through a suite of validation statistics that assess the quality of the fused data. The compared imputation techniques have been chosen from among the main imputation methodologies: k-nearest neighbor, probabilistic modeling and regression. We use the 2006 Health Survey of Catalonia, which provides a complete report of the perceived health status. In order to deal with the uncertainty problem, we compare these methodologies under the single and multiple imputation frames.

Results

A suite of validation statistics allows us to discern the strengths and weaknesses of studied imputation methods. Multiple outperforms single imputation by providing better and much more stable estimates, according to the computed validation statistics. The summarized results indicate that the probabilistic methods preserve the multivariate structure better; sequential regression methods deliver greater accuracy of imputed data; and nearest neighbor methods end up with a more realistic distribution of imputed data.

Conclusions

Data fusion allows us to integrate two sources of information in order to take grater advantage of the available data. Multiple imputed sequential regression models have the advantage of grater interpretability and can be used for health policy. Under certain conditions, more accurate estimates of the prevalences can be obtained using fused data (the original data plus the imputed data) than just by using only the observed data.
Literature
1.
go back to reference Rowland ML. Self-reported weight and height. Am J Clin Nutr. 1990; 52(6):1125–33.PubMed Rowland ML. Self-reported weight and height. Am J Clin Nutr. 1990; 52(6):1125–33.PubMed
2.
go back to reference Schenker N, Raghunathan TE, Bondarenko I. Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Stat Med. 2010; 29(5):533–45.PubMed Schenker N, Raghunathan TE, Bondarenko I. Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Stat Med. 2010; 29(5):533–45.PubMed
3.
go back to reference Saporta G. Data fusion and data grafting. Comput Stat Data Anal. 2002; 38(4):465–73.CrossRef Saporta G. Data fusion and data grafting. Comput Stat Data Anal. 2002; 38(4):465–73.CrossRef
4.
go back to reference Lebart L, Lejeune M. Assessment of data fusions and injections. In: Encuentro Internacional AIMC Sobre Investigación de Medios: 1995. p. 208–25. available at: http://www.aimc.es/-Encuentros-Internacionales-.html. Lebart L, Lejeune M. Assessment of data fusions and injections. In: Encuentro Internacional AIMC Sobre Investigación de Medios: 1995. p. 208–25. available at: http://​www.​aimc.​es/​-Encuentros-Internacionales-.​html.
5.
go back to reference Rubin DB. Assignment to a treatment group on the basis of a covariate. J Educ Stat. 1977; 2:1–26. Rubin DB. Assignment to a treatment group on the basis of a covariate. J Educ Stat. 1977; 2:1–26.
6.
go back to reference Aluja-Banet T, Daunis-i-Estadella J, Pellicer D. Graft, a complete system for data fusion. J Comput Stat Data Anal. 2007; 52(2):635–49.CrossRef Aluja-Banet T, Daunis-i-Estadella J, Pellicer D. Graft, a complete system for data fusion. J Comput Stat Data Anal. 2007; 52(2):635–49.CrossRef
8.
go back to reference D’Orazio M, Di Zio M, Scanu M. Statistical matching: theory and practice. Chichester: Wiley; 2006.CrossRef D’Orazio M, Di Zio M, Scanu M. Statistical matching: theory and practice. Chichester: Wiley; 2006.CrossRef
9.
go back to reference Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B. 1977; 39(1):1–38. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B. 1977; 39(1):1–38.
10.
go back to reference Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res. 1998; 33:545–71.CrossRef Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res. 1998; 33:545–71.CrossRef
11.
go back to reference Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley & Sons; 1987.CrossRef Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley & Sons; 1987.CrossRef
12.
go back to reference Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall. 1997;430. Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall. 1997;430.
13.
go back to reference D’Ambrosio A, Aria M, Siciliano R. Accurate tree-based missing data imputation and data fusion within the statistical learning paradigm. J Classif. 2012; 29(2):227–58. doi:10.1007/s00357-012-9108-1.CrossRef D’Ambrosio A, Aria M, Siciliano R. Accurate tree-based missing data imputation and data fusion within the statistical learning paradigm. J Classif. 2012; 29(2):227–58. doi:10.1007/s00357-012-9108-1.CrossRef
14.
go back to reference Conti PL, Marella D, Scanu M. Nonparametric evaluation of matching noise. In: Compstat 2006. Proceedings in Computational Statistics. Physica-Verlag HD: 2006. p. 453–60. Conti PL, Marella D, Scanu M. Nonparametric evaluation of matching noise. In: Compstat 2006. Proceedings in Computational Statistics. Physica-Verlag HD: 2006. p. 453–60.
15.
go back to reference Paass G. Statistical record linkage methodology, state of the art and future prospects. Bulletin of the International Statistical Institute. In: Proceedings of 45th Session, LI Book 2: 1985. Paass G. Statistical record linkage methodology, state of the art and future prospects. Bulletin of the International Statistical Institute. In: Proceedings of 45th Session, LI Book 2: 1985.
16.
go back to reference Rässler S. Data fusion: Identification problems, validity and multiple imputation. Austrian J Stat. 2004; 33(1–2):153–71. Rässler S. Data fusion: Identification problems, validity and multiple imputation. Austrian J Stat. 2004; 33(1–2):153–71.
18.
go back to reference Aluja-Banet T, Daunis-i-Estadella J, Chen YH. Enriching a large-scale survey from a representative sample by data fusion: Models and validation. In: Davino C, Fabbbris L, editors. Survey Data Collection and Integration. Berlin Heidelberg: Springer: 2013. p. 121–37. Aluja-Banet T, Daunis-i-Estadella J, Chen YH. Enriching a large-scale survey from a representative sample by data fusion: Models and validation. In: Davino C, Fabbbris L, editors. Survey Data Collection and Integration. Berlin Heidelberg: Springer: 2013. p. 121–37.
20.
go back to reference He Y, Zaslavsky A, Landrum M, Harrington D, Catalano P. Multiple imputation in a large-scale complex survey: a practical guide. Stat Methods Med Res. 2010; 19(6):653–70.CrossRefPubMed He Y, Zaslavsky A, Landrum M, Harrington D, Catalano P. Multiple imputation in a large-scale complex survey: a practical guide. Stat Methods Med Res. 2010; 19(6):653–70.CrossRefPubMed
21.
go back to reference van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999; 18(6):681–94.CrossRefPubMed van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999; 18(6):681–94.CrossRefPubMed
Metadata
Title
Improving prevalence estimation through data fusion: methods and validation
Authors
Tomàs Aluja-Banet
Josep Daunis-i-Estadella
Núria Brunsó
Anna Mompart-Penina
Publication date
01-12-2015
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2015
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-015-0169-z

Other articles of this Issue 1/2015

BMC Medical Informatics and Decision Making 1/2015 Go to the issue