Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2016

Open Access 01-12-2016 | Research Article

Comparison of non-parametric methods for ungrouping coarsely aggregated data

Authors: Silvia Rizzi, Mikael Thinggaard, Gerda Engholm, Niels Christensen, Tom Børge Johannesen, James W. Vaupel, Rune Lindahl-Jacobsen

Published in: BMC Medical Research Methodology | Issue 1/2016

Login to get access

Abstract

Background

Histograms are a common tool to estimate densities non-parametrically. They are extensively encountered in health sciences to summarize data in a compact format. Examples are age-specific distributions of death or onset of diseases grouped in 5-years age classes with an open-ended age group at the highest ages. When histogram intervals are too coarse, information is lost and comparison between histograms with different boundaries is arduous. In these cases it is useful to estimate detailed distributions from grouped data.

Methods

From an extensive literature search we identify five methods for ungrouping count data. We compare the performance of two spline interpolation methods, two kernel density estimators and a penalized composite link model first via a simulation study and then with empirical data obtained from the NORDCAN Database. All methods analyzed can be used to estimate differently shaped distributions; can handle unequal interval length; and allow stretches of 0 counts.

Results

The methods show similar performance when the grouping scheme is relatively narrow, i.e. 5-years age classes. With coarser age intervals, i.e. in the presence of open-ended age groups, the penalized composite link model performs the best.

Conclusion

We give an overview and test different methods to estimate detailed distributions from grouped count data. Health researchers can benefit from these versatile methods, which are ready for use in the statistical software R. We recommend using the penalized composite link model when data are grouped in wide age classes.
Appendix
Available only for authorised users
Literature
3.
go back to reference Kostaki A, Panousis V. Expanding an abridged life table. Demogr Res. 2001; 5:1–22.CrossRef Kostaki A, Panousis V. Expanding an abridged life table. Demogr Res. 2001; 5:1–22.CrossRef
5.
go back to reference Smith L, Hydman RJ, Wood SN. Spline interpolation for demographic variables: the monotonicity problem. J Popul Res. 2004; 21:95–8.CrossRef Smith L, Hydman RJ, Wood SN. Spline interpolation for demographic variables: the monotonicity problem. J Popul Res. 2004; 21:95–8.CrossRef
6.
go back to reference Blower G, Kelsall JE. Nonlinear kernel density estimation for binned data: convergence in entropy. Bernoulli. 2002; 8:423–49. Blower G, Kelsall JE. Nonlinear kernel density estimation for binned data: convergence in entropy. Bernoulli. 2002; 8:423–49.
7.
9.
go back to reference Elandt-Johnson R, Johnson N. Survival models and data analysis. New York: John Wiley & Sons; 1980. Elandt-Johnson R, Johnson N. Survival models and data analysis. New York: John Wiley & Sons; 1980.
10.
go back to reference Hsieh JJ. A general theory of life table construction and a precise life table method. Biom J. 1991; 33:143–62.CrossRefPubMed Hsieh JJ. A general theory of life table construction and a precise life table method. Biom J. 1991; 33:143–62.CrossRefPubMed
11.
go back to reference Kostaki A, Lanke J. Degrouping mortality data for the elderly. Math Popul Stud. 2000; 7:331–41.CrossRef Kostaki A, Lanke J. Degrouping mortality data for the elderly. Math Popul Stud. 2000; 7:331–41.CrossRef
12.
go back to reference Mazza A, Punzo A. DBKGrad: An R package for mortality rates graduation by discrete beta kernel techniques. J Stat Softw. 2014; 57:1–18.CrossRef Mazza A, Punzo A. DBKGrad: An R package for mortality rates graduation by discrete beta kernel techniques. J Stat Softw. 2014; 57:1–18.CrossRef
13.
go back to reference Scott DW, Sheather SJ. Kernel density estimation with binned data. Commun Stat Theory Methods. 1985; 14:1353–9.CrossRef Scott DW, Sheather SJ. Kernel density estimation with binned data. Commun Stat Theory Methods. 1985; 14:1353–9.CrossRef
14.
go back to reference Wang B, Wertelecki W. Density estimation for data with rounding errors. Comput Stat Data Anal. 2013; 65:4–12.CrossRef Wang B, Wertelecki W. Density estimation for data with rounding errors. Comput Stat Data Anal. 2013; 65:4–12.CrossRef
15.
go back to reference Braun J, Duchesne T, Stafford JE. Local likelihood density estimation for interval censored data. Can J Stat. 2005; 33:39–60.CrossRef Braun J, Duchesne T, Stafford JE. Local likelihood density estimation for interval censored data. Can J Stat. 2005; 33:39–60.CrossRef
16.
go back to reference McNeil DR, Trussell TJ, Turner JC. Spline interpolation of demographic data. Demography. 1977; 14:245–52.CrossRefPubMed McNeil DR, Trussell TJ, Turner JC. Spline interpolation of demographic data. Demography. 1977; 14:245–52.CrossRefPubMed
18.
go back to reference Hyman JM. Accurate monotonicity preserving cubic interpolation. J Sci Stat Comput. 1983; 4:645–54.CrossRef Hyman JM. Accurate monotonicity preserving cubic interpolation. J Sci Stat Comput. 1983; 4:645–54.CrossRef
21.
go back to reference Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980; 17:238–46.CrossRef Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980; 17:238–46.CrossRef
23.
go back to reference Eilers PHC. Ill-posed problems with counts, the composite link model and penalized likelihood. Stat Model. 2007; 7:239–54. Eilers PHC. Ill-posed problems with counts, the composite link model and penalized likelihood. Stat Model. 2007; 7:239–54.
24.
go back to reference Thompson R, Baker RJ. Composite link functions in generalized linear models. Appl Stat. 1981; 30:125–31.CrossRef Thompson R, Baker RJ. Composite link functions in generalized linear models. Appl Stat. 1981; 30:125–31.CrossRef
25.
go back to reference Nelder JA, Wedderbrun RWM. Generalized linear models. J R Stat Soc Ser A. 1972; 135:370–84.CrossRef Nelder JA, Wedderbrun RWM. Generalized linear models. J R Stat Soc Ser A. 1972; 135:370–84.CrossRef
30.
go back to reference Wilson DL. The analysis of survival (mortality) data: Fitting Gompertz, Weibull, and logistic functions.Mech Ageing Dev. 1994; 74:15–33.CrossRefPubMed Wilson DL. The analysis of survival (mortality) data: Fitting Gompertz, Weibull, and logistic functions.Mech Ageing Dev. 1994; 74:15–33.CrossRefPubMed
31.
go back to reference Juckett DA, Rosenberg B. Comparison of the Gompertz and Weibull functions as descriptors for human mortality distributions and their intersections. Mech Ageing Dev. 1993; 69:1–31.CrossRefPubMed Juckett DA, Rosenberg B. Comparison of the Gompertz and Weibull functions as descriptors for human mortality distributions and their intersections. Mech Ageing Dev. 1993; 69:1–31.CrossRefPubMed
32.
go back to reference Moser A, Clough-Gorr K, Zwahlen M. Modeling absolute differences in life expectancy with a censored skew-normal regression approach. PeerJ. 2015; 3:e1162.CrossRefPubMedPubMedCentral Moser A, Clough-Gorr K, Zwahlen M. Modeling absolute differences in life expectancy with a censored skew-normal regression approach. PeerJ. 2015; 3:e1162.CrossRefPubMedPubMedCentral
33.
go back to reference Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951; 22:79–86.CrossRef Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951; 22:79–86.CrossRef
34.
go back to reference Engholm G, Ferlay J, Christensen N, Kejs AMT, Johannesen TB, Khan S, Milter MC, Ólafsdóttir E, Petersen T, Pukkala E, Stenz F, Storm HH. NORDCAN: Cancer Incidence, Mortality, Prevalence and Survival in the Nordic Countries, Version 7.0 (17.12.2014). Association of the Nordic Cancer Registries. Danish Cancer Society. Available from http://www.ancr.nu. Last access October 10, 2015. Engholm G, Ferlay J, Christensen N, Kejs AMT, Johannesen TB, Khan S, Milter MC, Ólafsdóttir E, Petersen T, Pukkala E, Stenz F, Storm HH. NORDCAN: Cancer Incidence, Mortality, Prevalence and Survival in the Nordic Countries, Version 7.0 (17.12.2014). Association of the Nordic Cancer Registries. Danish Cancer Society. Available from http://​www.​ancr.​nu. Last access October 10, 2015.
35.
36.
37.
go back to reference Matthews FE, Arthur A, Barnes LE, Bond J, Jagger C, Robinson L, Brayne C. A two-decade comparison of prevalence of individuals aged 65 years and older from three geographical areas of England: results of the Cognitive Function and Ageing Study i and ii. The Lancet. 2013; 382:1405–12.CrossRef Matthews FE, Arthur A, Barnes LE, Bond J, Jagger C, Robinson L, Brayne C. A two-decade comparison of prevalence of individuals aged 65 years and older from three geographical areas of England: results of the Cognitive Function and Ageing Study i and ii. The Lancet. 2013; 382:1405–12.CrossRef
38.
go back to reference Hasselblad V, Stead AG, Galke W. Analysis of coarsely grouped data from the lognormal distribution. J Am Stat Assoc. 1980; 75:771–8.CrossRef Hasselblad V, Stead AG, Galke W. Analysis of coarsely grouped data from the lognormal distribution. J Am Stat Assoc. 1980; 75:771–8.CrossRef
Metadata
Title
Comparison of non-parametric methods for ungrouping coarsely aggregated data
Authors
Silvia Rizzi
Mikael Thinggaard
Gerda Engholm
Niels Christensen
Tom Børge Johannesen
James W. Vaupel
Rune Lindahl-Jacobsen
Publication date
01-12-2016
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2016
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-016-0157-8

Other articles of this Issue 1/2016

BMC Medical Research Methodology 1/2016 Go to the issue