Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2017

Open Access 01-12-2017 | Research Article

Simulation of complex data structures for planning of studies with focus on biomarker comparison

Authors: Andreas Schulz, Daniela Zöller, Stefan Nickels, Manfred E. Beutel, Maria Blettner, Philipp S. Wild, Harald Binder

Published in: BMC Medical Research Methodology | Issue 1/2017

Login to get access

Abstract

Background

There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure.

Methods

In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated.

Results

We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach.

Conclusions

We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.
Literature
5.
11.
go back to reference Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: A simulation study with continuous response. Stat Med. 2013; 32(13):2262–77. doi:10.1002/sim.5639.CrossRefPubMed Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: A simulation study with continuous response. Stat Med. 2013; 32(13):2262–77. doi:10.​1002/​sim.​5639.CrossRefPubMed
12.
go back to reference Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios LP, Robson R, Thabane M, Giangregorio L, Goldsmith CH. A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010; 10(10):1–10.CrossRefPubMedPubMedCentral Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios LP, Robson R, Thabane M, Giangregorio L, Goldsmith CH. A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010; 10(10):1–10.CrossRefPubMedPubMedCentral
13.
go back to reference Wild PS, Zeller T, Beutel M, Blettner M, Dugi Ka, Lackner KJ, Pfeiffer N, Münzel T, Blankenberg S. [The Gutenberg Health Study]. Bundesgesundheitsblatt, Gesundheitsforschung, Gesundheitsschutz. 2012; 55(6-7):824–9. doi:10.1007/s00103-012-1502-7.CrossRefPubMed Wild PS, Zeller T, Beutel M, Blettner M, Dugi Ka, Lackner KJ, Pfeiffer N, Münzel T, Blankenberg S. [The Gutenberg Health Study]. Bundesgesundheitsblatt, Gesundheitsforschung, Gesundheitsschutz. 2012; 55(6-7):824–9. doi:10.​1007/​s00103-012-1502-7.CrossRefPubMed
14.
go back to reference Wild PS, Sinning CR, Roth A, Wilde S, Schnabel RB, Lubos E, Zeller T, Keller T, Lackner KJ, Blettner M, Vasan RS, Münzel TF, Blankenberg S. Distribution and categorization of left ventricular measurements in the general population: results from the population-based gutenberg-heart study. Circ Cardiovasc Imaging. 2010;604–13. doi:.10.1161/CIRCIMAGING.109.911933 Wild PS, Sinning CR, Roth A, Wilde S, Schnabel RB, Lubos E, Zeller T, Keller T, Lackner KJ, Blettner M, Vasan RS, Münzel TF, Blankenberg S. Distribution and categorization of left ventricular measurements in the general population: results from the population-based gutenberg-heart study. Circ Cardiovasc Imaging. 2010;604–13. doi:.10.​1161/​CIRCIMAGING.​109.​911933
17.
go back to reference Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T. mvtnorm: Multivariate Normal and T Distributions. 2016. R package version 1.0-5. http://CRAN.R-project.org/package=mvtnorm. Accessed 15 Apr 2016http://CRAN.R-project.org/package=mvtnorm. Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T. mvtnorm: Multivariate Normal and T Distributions. 2016. R package version 1.0-5. http://​CRAN.​R-project.​org/​package=​mvtnorm.​ Accessed 15 Apr 2016http://​CRAN.​R-project.​org/​package=​mvtnorm.
21.
go back to reference Akaike H. Information theory and an extension of the maximum likelihood principle In: Parzen E, Tanabe K, Kitagawa G, editors. Selected Papers of Hirotugu Akaike. New York: Springer: 1998. p. 199–213. doi:10.1007/978-1-4612-1694-0/_15. Akaike H. Information theory and an extension of the maximum likelihood principle In: Parzen E, Tanabe K, Kitagawa G, editors. Selected Papers of Hirotugu Akaike. New York: Springer: 1998. p. 199–213. doi:10.​1007/​978-1-4612-1694-0/​_​15.
27.
go back to reference Genuer R, Poggi JM, Tuleau C. Random Forests : some methodological insights. ArXiv e-prints. 2008; 6729:32. Genuer R, Poggi JM, Tuleau C. Random Forests : some methodological insights. ArXiv e-prints. 2008; 6729:32.
29.
go back to reference Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A. Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory. Biometrical J. 2014; 56(4):534–63. doi:10.1002/bimj.201300068.CrossRef Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A. Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory. Biometrical J. 2014; 56(4):534–63. doi:10.​1002/​bimj.​201300068.CrossRef
30.
go back to reference Kruppa J. Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications. Biometrical J. 2014; 56(4):564–83.CrossRef Kruppa J. Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications. Biometrical J. 2014; 56(4):564–83.CrossRef
36.
go back to reference Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology (Cambridge, Mass.) 2010; 21(1):128–38. doi:10.1097/EDE.0b013e3181c30fb2.CrossRef Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology (Cambridge, Mass.) 2010; 21(1):128–38. doi:10.​1097/​EDE.​0b013e3181c30fb2​.CrossRef
37.
go back to reference Hosmer DW, Lemeshow S. Applied Logistic Regression. In: Wiley Series in Probability and Statistics. 2nd ed. vol. 23. no. 1. John Wiley & Sons, Inc.: 2000. p. 375. doi:10.1002/0471722146. Hosmer DW, Lemeshow S. Applied Logistic Regression. In: Wiley Series in Probability and Statistics. 2nd ed. vol. 23. no. 1. John Wiley & Sons, Inc.: 2000. p. 375. doi:10.​1002/​0471722146.
42.
go back to reference van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014; 14(1):1–13. doi:10.1186/1471-2288-14-137.CrossRef van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014; 14(1):1–13. doi:10.​1186/​1471-2288-14-137.CrossRef
43.
go back to reference Bin RD, Herold T, Boulesteix AL. Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol. 2014; 14(1):1–23. doi:10.1186/1471-2288-14-117.CrossRef Bin RD, Herold T, Boulesteix AL. Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol. 2014; 14(1):1–23. doi:10.​1186/​1471-2288-14-117.CrossRef
Metadata
Title
Simulation of complex data structures for planning of studies with focus on biomarker comparison
Authors
Andreas Schulz
Daniela Zöller
Stefan Nickels
Manfred E. Beutel
Maria Blettner
Philipp S. Wild
Harald Binder
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2017
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-017-0364-y

Other articles of this Issue 1/2017

BMC Medical Research Methodology 1/2017 Go to the issue