Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2021

Open Access 01-12-2021 | Research article

Consequences of ignoring clustering in linear regression

Authors: Georgia Ntani, Hazel Inskip, Clive Osmond, David Coggon

Published in: BMC Medical Research Methodology | Issue 1/2021

Login to get access

Abstract

Background

Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions.

Methods

We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate.

Results

We found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered.

Conclusions

In this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥ 0.01).
Appendix
Available only for authorised users
Literature
1.
go back to reference Rabe-Hesketh S, Skrondal A. Multilevel and longitudinal modeling using stata. USA: Taylor & Francis; 2005. Rabe-Hesketh S, Skrondal A. Multilevel and longitudinal modeling using stata. USA: Taylor & Francis; 2005.
2.
go back to reference Stimson JA. Regression in space and time: a statistical essay. Am J Pol Sci. 1985;29(4):914–47.CrossRef Stimson JA. Regression in space and time: a statistical essay. Am J Pol Sci. 1985;29(4):914–47.CrossRef
3.
go back to reference Bingenheimer JB, Raudenbush SW. Statistical and substantive inferences in public health: issues in the application of multilevel models. Annu Rev Public Health. 2004;25:53–77.CrossRef Bingenheimer JB, Raudenbush SW. Statistical and substantive inferences in public health: issues in the application of multilevel models. Annu Rev Public Health. 2004;25:53–77.CrossRef
4.
go back to reference Goldstein H. Multilevel statistical models. United Kingdom: Wiley; 2011. Goldstein H. Multilevel statistical models. United Kingdom: Wiley; 2011.
5.
go back to reference McNeish D, Kelley K. Fixed effects models versus mixed effects models for clustered data: reviewing the approaches, disentangling the differences, and making recommendations. Psychol Methods. 2019;24(1):20–35.CrossRef McNeish D, Kelley K. Fixed effects models versus mixed effects models for clustered data: reviewing the approaches, disentangling the differences, and making recommendations. Psychol Methods. 2019;24(1):20–35.CrossRef
6.
go back to reference Bland JM. Cluster randomised trials in the medical literature: two bibliometric surveys. BMC Med Res Methodol. 2004;4(1):21.CrossRef Bland JM. Cluster randomised trials in the medical literature: two bibliometric surveys. BMC Med Res Methodol. 2004;4(1):21.CrossRef
7.
go back to reference Crits-Christoph P, Mintz J. Implications of therapist effects for the design and analysis of comparative studies of psychotherapies. J Consult Clin Psychol. 1991;59(1):20.CrossRef Crits-Christoph P, Mintz J. Implications of therapist effects for the design and analysis of comparative studies of psychotherapies. J Consult Clin Psychol. 1991;59(1):20.CrossRef
8.
go back to reference Lee KJ, Thompson SG. Clustering by health professional in individually randomised trials. BMJ (Clinical research ed). 2005;330(7483):142–4.CrossRef Lee KJ, Thompson SG. Clustering by health professional in individually randomised trials. BMJ (Clinical research ed). 2005;330(7483):142–4.CrossRef
9.
go back to reference Simpson JM, Klar N, Donnor A. Accounting for cluster randomization: a review of primary prevention trials, 1990 through 1993. Am J Public Health. 1995;85(10):1378–83.CrossRef Simpson JM, Klar N, Donnor A. Accounting for cluster randomization: a review of primary prevention trials, 1990 through 1993. Am J Public Health. 1995;85(10):1378–83.CrossRef
10.
go back to reference Biau DJ, Halm JA, Ahmadieh H, Capello WN, Jeekel J, Boutron I, et al. Provider and center effect in multicenter randomized controlled trials of surgical specialties: an analysis on patient-level data. Ann Surg. 2008;247(5):892–8.CrossRef Biau DJ, Halm JA, Ahmadieh H, Capello WN, Jeekel J, Boutron I, et al. Provider and center effect in multicenter randomized controlled trials of surgical specialties: an analysis on patient-level data. Ann Surg. 2008;247(5):892–8.CrossRef
11.
go back to reference Oltean H, Gagnier JJ. Use of clustering analysis in randomized controlled trials in orthopaedic surgery. BMC Med Res Methodol. 2015;15(1):1–8.CrossRef Oltean H, Gagnier JJ. Use of clustering analysis in randomized controlled trials in orthopaedic surgery. BMC Med Res Methodol. 2015;15(1):1–8.CrossRef
12.
go back to reference Diaz-Ordaz K, Froud R, Sheehan B, Eldridge S. A systematic review of cluster randomised trials in residential facilities for older people suggests how to improve quality. BMC Med Res Methodol. 2013;13(1):1–10.CrossRef Diaz-Ordaz K, Froud R, Sheehan B, Eldridge S. A systematic review of cluster randomised trials in residential facilities for older people suggests how to improve quality. BMC Med Res Methodol. 2013;13(1):1–10.CrossRef
13.
go back to reference Goldstein H. Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika. 1986;73(1):43–56.CrossRef Goldstein H. Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika. 1986;73(1):43–56.CrossRef
14.
go back to reference Astin AW, Denson N. Multi-campus studies of college impact: which statistical method is appropriate? Res High Educ. 2009;50(4):354–67.CrossRef Astin AW, Denson N. Multi-campus studies of college impact: which statistical method is appropriate? Res High Educ. 2009;50(4):354–67.CrossRef
15.
go back to reference Grieve R, Nixon R, Thompson SG, Normand C. Using multilevel models for assessing the variability of multinational resource use and cost data. Health Econ. 2005;14(2):185–96.CrossRef Grieve R, Nixon R, Thompson SG, Normand C. Using multilevel models for assessing the variability of multinational resource use and cost data. Health Econ. 2005;14(2):185–96.CrossRef
16.
go back to reference Niehaus E, Campbell C, Inkelas K. HLM behind the curtain: unveiling decisions behind the use and interpretation of HLM in higher education research. Res High Educ. 2014;55(1):101–22.CrossRef Niehaus E, Campbell C, Inkelas K. HLM behind the curtain: unveiling decisions behind the use and interpretation of HLM in higher education research. Res High Educ. 2014;55(1):101–22.CrossRef
17.
go back to reference Steenbergen MR, Jones BS. Modeling multilevel data structures. Am J Pol Sci. 2002;46(1):218–37. Steenbergen MR, Jones BS. Modeling multilevel data structures. Am J Pol Sci. 2002;46(1):218–37.
18.
go back to reference Wendel-Vos GCW, van Hooijdonk C, Uitenbroek D, Agyemang C, Lindeman EM, Droomers M. Environmental attributes related to walking and bicycling at the individual and contextual level. J Epidemiol Community Health. 2008;62(8):689–94.CrossRef Wendel-Vos GCW, van Hooijdonk C, Uitenbroek D, Agyemang C, Lindeman EM, Droomers M. Environmental attributes related to walking and bicycling at the individual and contextual level. J Epidemiol Community Health. 2008;62(8):689–94.CrossRef
19.
go back to reference Walters SJ. Therapist effects in randomised controlled trials: what to do about them. J Clin Nurs. 2010;19(7–8):1102–12.CrossRef Walters SJ. Therapist effects in randomised controlled trials: what to do about them. J Clin Nurs. 2010;19(7–8):1102–12.CrossRef
20.
go back to reference Park S, Lake ET. Multilevel modeling of a clustered continuous outcome: nurses’ work hours and burnout. Nurs Res. 2005;54(6):406–13.CrossRef Park S, Lake ET. Multilevel modeling of a clustered continuous outcome: nurses’ work hours and burnout. Nurs Res. 2005;54(6):406–13.CrossRef
21.
go back to reference Newman D, Newman I, Salzman J. Comparing OLS and HLM models and the questions they answer: potential concerns for type VI errors. Mult Linear Regression Viewpoints. 2010;36(1):1–8. Newman D, Newman I, Salzman J. Comparing OLS and HLM models and the questions they answer: potential concerns for type VI errors. Mult Linear Regression Viewpoints. 2010;36(1):1–8.
22.
go back to reference Clarke P. When can group level clustering be ignored? Multilevel models versus single-level models with sparse data. J Epidemiol Community Health. 2008;62(8):752–8.CrossRef Clarke P. When can group level clustering be ignored? Multilevel models versus single-level models with sparse data. J Epidemiol Community Health. 2008;62(8):752–8.CrossRef
23.
go back to reference Bradburn MJ, Deeks JJ, Berlin JA, Russell LA. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med. 2007;26(1):53–77.CrossRef Bradburn MJ, Deeks JJ, Berlin JA, Russell LA. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med. 2007;26(1):53–77.CrossRef
24.
go back to reference Nevalainen J, Datta S, Oja H. Inference on the marginal distribution of clustered data with informative cluster size. Stat Pap. 2014;55(1):71–92.CrossRef Nevalainen J, Datta S, Oja H. Inference on the marginal distribution of clustered data with informative cluster size. Stat Pap. 2014;55(1):71–92.CrossRef
25.
go back to reference Huang FL. Alternatives to multilevel modeling for the analysis of clustered data. J Exp Educ. 2016;84(1):175–96.CrossRef Huang FL. Alternatives to multilevel modeling for the analysis of clustered data. J Exp Educ. 2016;84(1):175–96.CrossRef
26.
go back to reference Chu R, Thabane L, Ma J, Holbrook A, Pullenayegum E, Devereaux PJ. Comparing methods to estimate treatment effects on a continuous outcome in multicentre randomized controlled trials: a simulation study. BMC Med Res Methodol. 2011;11(1):1.CrossRef Chu R, Thabane L, Ma J, Holbrook A, Pullenayegum E, Devereaux PJ. Comparing methods to estimate treatment effects on a continuous outcome in multicentre randomized controlled trials: a simulation study. BMC Med Res Methodol. 2011;11(1):1.CrossRef
27.
go back to reference Galbraith S, Daniel J, Vissel B. A study of clustered data and approaches to its analysis. J Neurosci. 2010;30(32):10601–8.CrossRef Galbraith S, Daniel J, Vissel B. A study of clustered data and approaches to its analysis. J Neurosci. 2010;30(32):10601–8.CrossRef
28.
go back to reference Kahan BC, Morris TP. Assessing potential sources of clustering in individually randomised trials. BMC Med Res Methodol. 2013;13(1):58.CrossRef Kahan BC, Morris TP. Assessing potential sources of clustering in individually randomised trials. BMC Med Res Methodol. 2013;13(1):58.CrossRef
29.
go back to reference Arceneaux K, Nickerson DW. Modeling certainty with clustered data: a comparison of methods. Polit Anal. 2009;17(2):177–90.CrossRef Arceneaux K, Nickerson DW. Modeling certainty with clustered data: a comparison of methods. Polit Anal. 2009;17(2):177–90.CrossRef
30.
go back to reference Scott AJ, Holt D. The effect of two-stage sampling on ordinary least squares methods. J Am Stat Assoc. 1982;77(380):848–54.CrossRef Scott AJ, Holt D. The effect of two-stage sampling on ordinary least squares methods. J Am Stat Assoc. 1982;77(380):848–54.CrossRef
31.
go back to reference Barrios T, Diamond R, Imbens GW, Koleśar M. Clustering, spatial correlations, and randomization inference. J Am Stat Assoc. 2012;107(498):578–91.CrossRef Barrios T, Diamond R, Imbens GW, Koleśar M. Clustering, spatial correlations, and randomization inference. J Am Stat Assoc. 2012;107(498):578–91.CrossRef
32.
go back to reference Seaman S, Pavlou M, Copas A. Review of methods for handling confounding by cluster and informative cluster size in clustered data. Stat Med. 2014;33(30):5371–87.CrossRef Seaman S, Pavlou M, Copas A. Review of methods for handling confounding by cluster and informative cluster size in clustered data. Stat Med. 2014;33(30):5371–87.CrossRef
33.
go back to reference Maas CJ, Hox JJ. The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Comput Stat Data Anal. 2004;46(3):427–40.CrossRef Maas CJ, Hox JJ. The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Comput Stat Data Anal. 2004;46(3):427–40.CrossRef
34.
go back to reference Dickinson LM, Basu A. Multilevel modeling and practice-based research. Ann Fam Med. 2005;3(suppl 1):S52–60.CrossRef Dickinson LM, Basu A. Multilevel modeling and practice-based research. Ann Fam Med. 2005;3(suppl 1):S52–60.CrossRef
35.
go back to reference Austin PC, Goel V, van Walraven C. An introduction to multilevel regression models. Can J Public Health. 2001;92(2):150.CrossRef Austin PC, Goel V, van Walraven C. An introduction to multilevel regression models. Can J Public Health. 2001;92(2):150.CrossRef
36.
go back to reference Lemeshow S, Letenneur L, Dartigues JF, Lafont S, Orgogozo JM, Commenges D. Illustration of analysis taking into account complex survey considerations: the association between wine consumption and dementia in the PAQUID study. Am J Epidemiol. 1998;148(3):298–306.CrossRef Lemeshow S, Letenneur L, Dartigues JF, Lafont S, Orgogozo JM, Commenges D. Illustration of analysis taking into account complex survey considerations: the association between wine consumption and dementia in the PAQUID study. Am J Epidemiol. 1998;148(3):298–306.CrossRef
37.
go back to reference Roberts C, Roberts SA. Design and analysis of clinical trials with clustering effects due to treatment. Clin Trials. 2005;2(2):152–62.CrossRef Roberts C, Roberts SA. Design and analysis of clinical trials with clustering effects due to treatment. Clin Trials. 2005;2(2):152–62.CrossRef
38.
go back to reference Maas CJ, Hox JJ. Sufficient sample sizes for multilevel modeling. Methodology. 2005;1:3–86.CrossRef Maas CJ, Hox JJ. Sufficient sample sizes for multilevel modeling. Methodology. 2005;1:3–86.CrossRef
39.
go back to reference Chuang JH, Hripcsak G, Heitjan DF. Design and analysis of controlled trials in naturally clustered environments: implications for medical informatics. JAMIA. 2002;9(3):230–8.PubMedPubMedCentral Chuang JH, Hripcsak G, Heitjan DF. Design and analysis of controlled trials in naturally clustered environments: implications for medical informatics. JAMIA. 2002;9(3):230–8.PubMedPubMedCentral
40.
go back to reference Sainani K. The importance of accounting for correlated observations. PM&R. 2010;2(9):858–61.CrossRef Sainani K. The importance of accounting for correlated observations. PM&R. 2010;2(9):858–61.CrossRef
41.
go back to reference Jones K. Do multilevel models ever give different results? 2009. Jones K. Do multilevel models ever give different results? 2009.
42.
go back to reference Hedeker D, McMahon SD, Jason LA, Salina D. Analysis of clustered data in community psychology: with an example from a worksite smoking cessation project. Am J Community Psychol. 1994;22(5):595–615.CrossRef Hedeker D, McMahon SD, Jason LA, Salina D. Analysis of clustered data in community psychology: with an example from a worksite smoking cessation project. Am J Community Psychol. 1994;22(5):595–615.CrossRef
43.
go back to reference Bliese PD, Hanges PJ. Being both too liberal and too conservative: the perils of treating grouped data as though they were independent. Organ Res Methods. 2004;7(4):400–17.CrossRef Bliese PD, Hanges PJ. Being both too liberal and too conservative: the perils of treating grouped data as though they were independent. Organ Res Methods. 2004;7(4):400–17.CrossRef
Metadata
Title
Consequences of ignoring clustering in linear regression
Authors
Georgia Ntani
Hazel Inskip
Clive Osmond
David Coggon
Publication date
01-12-2021
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2021
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-021-01333-7

Other articles of this Issue 1/2021

BMC Medical Research Methodology 1/2021 Go to the issue