Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2017

Open Access 01-12-2017 | Research Article

Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data

Authors: M. Shafiqur Rahman, Mahbuba Sultana

Published in: BMC Medical Research Methodology | Issue 1/2017

Login to get access

Abstract

Background

When developing risk models for binary data with small or sparse data sets, the standard maximum likelihood estimation (MLE) based logistic regression faces several problems including biased or infinite estimate of the regression coefficient and frequent convergence failure of the likelihood due to separation. The problem of separation occurs commonly even if sample size is large but there is sufficient number of strong predictors. In the presence of separation, even if one develops the model, it produces overfitted model with poor predictive performance. Firth-and logF-type penalized regression methods are popular alternative to MLE, particularly for solving separation-problem. Despite the attractive advantages, their use in risk prediction is very limited. This paper evaluated these methods in risk prediction in comparison with MLE and other commonly used penalized methods such as ridge.

Methods

The predictive performance of the methods was evaluated through assessing calibration, discrimination and overall predictive performance using an extensive simulation study. Further an illustration of the methods were provided using a real data example with low prevalence of outcome.

Results

The MLE showed poor performance in risk prediction in small or sparse data sets. All penalized methods offered some improvements in calibration, discrimination and overall predictive performance. Although the Firth-and logF-type methods showed almost equal amount of improvement, Firth-type penalization produces some bias in the average predicted probability, and the amount of bias is even larger than that produced by MLE. Of the logF(1,1) and logF(2,2) penalization, logF(2,2) provides slight bias in the estimate of regression coefficient of binary predictor and logF(1,1) performed better in all aspects. Similarly, ridge performed well in discrimination and overall predictive performance but it often produces underfitted model and has high rate of convergence failure (even the rate is higher than that for MLE), probably due to the separation problem.

Conclusions

The logF-type penalized method, particularly logF(1,1) could be used in practice when developing risk model for small or sparse data sets.
Literature
1.
go back to reference Abu-Hanna A, Lucas PJF. Prognostic models in medicine. Methods Inform Med. 2001; 40:1–5. Abu-Hanna A, Lucas PJF. Prognostic models in medicine. Methods Inform Med. 2001; 40:1–5.
2.
go back to reference Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why and how?. BMJ. 2009a; 338:1317–20.CrossRef Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why and how?. BMJ. 2009a; 338:1317–20.CrossRef
3.
go back to reference Altman DG, Royston P. What do you mean by validating a prognostic model?. Stat Med. 2000; 19:453–73.CrossRefPubMed Altman DG, Royston P. What do you mean by validating a prognostic model?. Stat Med. 2000; 19:453–73.CrossRefPubMed
4.
go back to reference Moons KGM, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ. 2009b; 338:1487–90. Moons KGM, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ. 2009b; 338:1487–90.
5.
go back to reference Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996; 49:1373–9.CrossRefPubMed Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996; 49:1373–9.CrossRefPubMed
6.
go back to reference Moons KG, de Groot JA, Linnet K, Reitsma JB, Bossuyt PM. Quantifying the added value of a diagnostic test or marker. Clin Chem. 2012; 58(10):1408–17.CrossRefPubMed Moons KG, de Groot JA, Linnet K, Reitsma JB, Bossuyt PM. Quantifying the added value of a diagnostic test or marker. Clin Chem. 2012; 58(10):1408–17.CrossRefPubMed
7.
go back to reference Bouwmeester W, Zuithoff N, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW, Altman DG, Moons KGM. Reporting and methods in clinical prediction research: a systematic review. PLOS Medecine. 2012; 9(5):e1001221.CrossRef Bouwmeester W, Zuithoff N, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW, Altman DG, Moons KGM. Reporting and methods in clinical prediction research: a systematic review. PLOS Medecine. 2012; 9(5):e1001221.CrossRef
8.
go back to reference Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000; 19(8):1059–79.CrossRefPubMed Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000; 19(8):1059–79.CrossRefPubMed
9.
go back to reference Ambler G, Seaman S, Omar RZ. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat Med. 2012; 31(11–12, SI):1150–61.CrossRefPubMed Ambler G, Seaman S, Omar RZ. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat Med. 2012; 31(11–12, SI):1150–61.CrossRefPubMed
10.
go back to reference Pavlou M, Ambler G, Seaman S, De Iorio M, RZ O. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med. 2016; 35(7):1159–77.CrossRefPubMed Pavlou M, Ambler G, Seaman S, De Iorio M, RZ O. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med. 2016; 35(7):1159–77.CrossRefPubMed
11.
go back to reference Cessie SL, van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Series C. 1992; 41(1):191–201. Cessie SL, van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Series C. 1992; 41(1):191–201.
12.
go back to reference Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B. 1996; 58:267–88. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B. 1996; 58:267–88.
13.
go back to reference Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B. 2005; 67(2):301–20.CrossRef Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B. 2005; 67(2):301–20.CrossRef
14.
go back to reference Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006; 101(476):1418–29.CrossRef Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006; 101(476):1418–29.CrossRef
16.
go back to reference Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984; 71(1):1–10.CrossRef Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984; 71(1):1–10.CrossRef
17.
go back to reference Heinze G, Scemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002; 21(16):2409–19.CrossRefPubMed Heinze G, Scemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002; 21(16):2409–19.CrossRefPubMed
18.
19.
go back to reference Firth D. Bias reduction of maximum likelihood estimates. Biomertika. 1993; 80:27–38.CrossRef Firth D. Bias reduction of maximum likelihood estimates. Biomertika. 1993; 80:27–38.CrossRef
20.
go back to reference Greenland S, Schwartzbaum JA, Finkle WD. Problems due to small samples and sparse data in conditional logistic regression analysis. Am J Epidemiol. 2000; 151(5):531–9.CrossRefPubMed Greenland S, Schwartzbaum JA, Finkle WD. Problems due to small samples and sparse data in conditional logistic regression analysis. Am J Epidemiol. 2000; 151(5):531–9.CrossRefPubMed
21.
go back to reference Lipsitz SR, Fitzmaurice G, Regenbogen SE, Sinha D, Ibrahim JG, Gawande AA. Bias correction for the proportional odds logistic regression model with application to a study of surgical complications. J R Stat Soc Series C. 2013; 62(2):233–50.CrossRef Lipsitz SR, Fitzmaurice G, Regenbogen SE, Sinha D, Ibrahim JG, Gawande AA. Bias correction for the proportional odds logistic regression model with application to a study of surgical complications. J R Stat Soc Series C. 2013; 62(2):233–50.CrossRef
22.
go back to reference Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016; 352:i1981.CrossRefPubMed Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016; 352:i1981.CrossRefPubMed
23.
go back to reference Greenland S, Mansournia MA. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Stat Med. 2015; 34(23):3133–43.CrossRefPubMed Greenland S, Mansournia MA. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Stat Med. 2015; 34(23):3133–43.CrossRefPubMed
25.
go back to reference Greenland S. Generalized conjugate priors for bayesian analysis of risk and survival regressions. Biometrics. 2003; 59:92–9.CrossRefPubMed Greenland S. Generalized conjugate priors for bayesian analysis of risk and survival regressions. Biometrics. 2003; 59:92–9.CrossRefPubMed
26.
go back to reference Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2009; 21(1):128–38.CrossRef Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2009; 21(1):128–38.CrossRef
27.
go back to reference Krivokapich J, Child J, Walter DO, Garfinkel A. Prognostic value of dobutamine stress echocardiography in predicting cardiac events in patients with known or suspected coronary artery disease. J Am Coll Cardiol. 1999; 33(3):708–16.CrossRefPubMed Krivokapich J, Child J, Walter DO, Garfinkel A. Prognostic value of dobutamine stress echocardiography in predicting cardiac events in patients with known or suspected coronary artery disease. J Am Coll Cardiol. 1999; 33(3):708–16.CrossRefPubMed
28.
go back to reference Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016; 35(2):214–26.CrossRefPubMed Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016; 35(2):214–26.CrossRefPubMed
29.
go back to reference Benner A, Zucknick M, Hielscher T, Ittrich C, Mansmann U. High-dimensional cox models: the choice of penalty as part of the model building process. Biometrical J. 2010; 52:50–69.CrossRef Benner A, Zucknick M, Hielscher T, Ittrich C, Mansmann U. High-dimensional cox models: the choice of penalty as part of the model building process. Biometrical J. 2010; 52:50–69.CrossRef
30.
go back to reference van Wieringen WN, Kun D, Hampel AL R, Boulesteix. Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal. 2009; 53:1590–603.CrossRef van Wieringen WN, Kun D, Hampel AL R, Boulesteix. Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal. 2009; 53:1590–603.CrossRef
31.
go back to reference Shen J, Gao S. A solution to separation and multicollinearity in multiple logistic regression. J Data Sci. 2008; 6(4):515–31.PubMedPubMedCentral Shen J, Gao S. A solution to separation and multicollinearity in multiple logistic regression. J Data Sci. 2008; 6(4):515–31.PubMedPubMedCentral
33.
go back to reference Ojeda FM, Müller C, D B, A TD, Schillert A, Heinig M, Zeller T, Schnabel RB. Comparison of cox model methods in a low-dimensional setting with few events. Genomics Proteomics Bioinforma. 2016; 14(4):235–43.CrossRef Ojeda FM, Müller C, D B, A TD, Schillert A, Heinig M, Zeller T, Schnabel RB. Comparison of cox model methods in a low-dimensional setting with few events. Genomics Proteomics Bioinforma. 2016; 14(4):235–43.CrossRef
Metadata
Title
Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data
Authors
M. Shafiqur Rahman
Mahbuba Sultana
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2017
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-017-0313-9

Other articles of this Issue 1/2017

BMC Medical Research Methodology 1/2017 Go to the issue