Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2009

Open Access 01-12-2009 | Research article

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction

Authors: Anne-Laure Boulesteix, Carolin Strobl

Published in: BMC Medical Research Methodology | Issue 1/2009

Login to get access

Abstract

Background

In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.

Methods

In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.

Results

We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.

Conclusions

The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Appendix
Available only for authorised users
Literature
1.
go back to reference Kyzas PA, Denaxa-Kyza D, Ioannidis JP: Almost all articles on cancer prognostic markers report statistically significant results. European Journal of Cancer. 2007, 43: 2559-2579.CrossRefPubMed Kyzas PA, Denaxa-Kyza D, Ioannidis JP: Almost all articles on cancer prognostic markers report statistically significant results. European Journal of Cancer. 2007, 43: 2559-2579.CrossRefPubMed
3.
go back to reference Ioannidis JP: Microarrays and molecular research: noise discovery. The Lancet. 2005, 365: 488-492. 10.1016/S0140-6736(05)17866-0.CrossRef Ioannidis JP: Microarrays and molecular research: noise discovery. The Lancet. 2005, 365: 488-492. 10.1016/S0140-6736(05)17866-0.CrossRef
4.
go back to reference Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Journal of the Royal Statistical Society B. 1995, 29: 1165-1188. Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Journal of the Royal Statistical Society B. 1995, 29: 1165-1188.
6.
go back to reference Ambroise C, McLachlan GJ: Selection bias in gene extraction in tumour classification on basis of microarray gene expression data. Proceedings of the National Academy of Science. 2002, 99: 6562-6566. 10.1073/pnas.102102699.CrossRef Ambroise C, McLachlan GJ: Selection bias in gene extraction in tumour classification on basis of microarray gene expression data. Proceedings of the National Academy of Science. 2002, 99: 6562-6566. 10.1073/pnas.102102699.CrossRef
7.
go back to reference Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute. 2003, 95: 14-18.CrossRefPubMed Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute. 2003, 95: 14-18.CrossRefPubMed
8.
go back to reference Ntzani EE, Ioannidis JPA: Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. The Lancet. 2003, 362: 1439-1444. 10.1016/S0140-6736(03)14686-7.CrossRef Ntzani EE, Ioannidis JPA: Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. The Lancet. 2003, 362: 1439-1444. 10.1016/S0140-6736(03)14686-7.CrossRef
9.
go back to reference Boulesteix AL: WilcoxCV: An R package for fast variable selection in cross-validation. Bioinformatics. 2007, 23: 1702-1704. 10.1093/bioinformatics/btm162.CrossRefPubMed Boulesteix AL: WilcoxCV: An R package for fast variable selection in cross-validation. Bioinformatics. 2007, 23: 1702-1704. 10.1093/bioinformatics/btm162.CrossRefPubMed
10.
go back to reference Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007, 23: 1363-1370. 10.1093/bioinformatics/btm117.CrossRefPubMed Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007, 23: 1363-1370. 10.1093/bioinformatics/btm117.CrossRefPubMed
11.
go back to reference Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B. 1996, 58: 267-288. Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B. 1996, 58: 267-288.
12.
13.
go back to reference Boulesteix AL, Strobl C, Augustin T, Daumer M: Evaluating microarray-based classifiers: an overview. Cancer Informatics. 2008, 6: 77-97.PubMedPubMedCentral Boulesteix AL, Strobl C, Augustin T, Daumer M: Evaluating microarray-based classifiers: an overview. Cancer Informatics. 2008, 6: 77-97.PubMedPubMedCentral
14.
go back to reference Dupuy A, Simon R: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal of the National Cancer Institute. 2007, 99: 147-157. 10.1093/jnci/djk018.CrossRefPubMed Dupuy A, Simon R: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal of the National Cancer Institute. 2007, 99: 147-157. 10.1093/jnci/djk018.CrossRefPubMed
15.
go back to reference Slawski M, Boulesteix AL: CMA (Classiciation for MicroArrays). Bioconductor Package version 0.8.5. 2008 Slawski M, Boulesteix AL: CMA (Classiciation for MicroArrays). Bioconductor Package version 0.8.5. 2008
16.
go back to reference Slawski M, Daumer M, Boulesteix AL: CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics. 2008, 9: 439-10.1186/1471-2105-9-439.CrossRefPubMedPubMedCentral Slawski M, Daumer M, Boulesteix AL: CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics. 2008, 9: 439-10.1186/1471-2105-9-439.CrossRefPubMedPubMedCentral
17.
go back to reference Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.CrossRef Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.CrossRef
18.
go back to reference Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.CrossRefPubMed Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.CrossRefPubMed
19.
go back to reference Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002, 97: 77-87. 10.1198/016214502753479248.CrossRef Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002, 97: 77-87. 10.1198/016214502753479248.CrossRef
20.
go back to reference Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning. 2001, New York: Springer-VerlagCrossRef Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning. 2001, New York: Springer-VerlagCrossRef
21.
go back to reference Boulesteix AL, Strimmer K: Partial Least Squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics. 2007, 8: 32-44. 10.1093/bib/bbl016.CrossRefPubMed Boulesteix AL, Strimmer K: Partial Least Squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics. 2007, 8: 32-44. 10.1093/bib/bbl016.CrossRefPubMed
22.
go back to reference Boulesteix AL: PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004, 3: 33-10.2202/1544-6115.1075.CrossRef Boulesteix AL: PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004, 3: 33-10.2202/1544-6115.1075.CrossRef
23.
go back to reference Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.CrossRef Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.CrossRef
24.
go back to reference Diaz-Uriarte R, de Andrés SA: Gene selection and classification of microarray data using random forests. BMC Bioinformatics. 2006, 7: 3-10.1186/1471-2105-7-3.CrossRefPubMedPubMedCentral Diaz-Uriarte R, de Andrés SA: Gene selection and classification of microarray data using random forests. BMC Bioinformatics. 2006, 7: 3-10.1186/1471-2105-7-3.CrossRefPubMedPubMedCentral
25.
go back to reference Vapnik VN: The nature of statistical learning theory. 1995, New York: SpringerCrossRef Vapnik VN: The nature of statistical learning theory. 1995, New York: SpringerCrossRef
26.
go back to reference Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences. 2002, 99: 6567-6572. 10.1073/pnas.082099299.CrossRef Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences. 2002, 99: 6567-6572. 10.1073/pnas.082099299.CrossRef
27.
go back to reference Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.CrossRef Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.CrossRef
28.
go back to reference Bühlmann P, Hothorn T: Boosting algorithms: regularization, prediction and model fitting (with discussion). Statistical Science. 2007, 22: 477-505. 10.1214/07-STS242.CrossRef Bühlmann P, Hothorn T: Boosting algorithms: regularization, prediction and model fitting (with discussion). Statistical Science. 2007, 22: 477-505. 10.1214/07-STS242.CrossRef
29.
go back to reference Boulesteix AL: Reader's reaction to 'Dimension reduction for classification with gene expression microarray data' by Dai et al (2006)'. Statistical Applications in Genetics and Molecular Biology. 2006, 5: 16-10.2202/1544-6115.1226.CrossRef Boulesteix AL: Reader's reaction to 'Dimension reduction for classification with gene expression microarray data' by Dai et al (2006)'. Statistical Applications in Genetics and Molecular Biology. 2006, 5: 16-10.2202/1544-6115.1226.CrossRef
30.
go back to reference Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004, 3: 3-10.2202/1544-6115.1027.CrossRef Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004, 3: 3-10.2202/1544-6115.1027.CrossRef
31.
go back to reference Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21: 631-643. 10.1093/bioinformatics/bti033.CrossRefPubMed Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21: 631-643. 10.1093/bioinformatics/bti033.CrossRefPubMed
32.
go back to reference Molinaro A, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005, 21: 3301-3307. 10.1093/bioinformatics/bti499.CrossRefPubMed Molinaro A, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005, 21: 3301-3307. 10.1093/bioinformatics/bti499.CrossRefPubMed
33.
go back to reference Lee J, Lee J, Park M, Song S: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis. 2005, 48: 869-885. 10.1016/j.csda.2004.03.017.CrossRef Lee J, Lee J, Park M, Song S: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis. 2005, 48: 869-885. 10.1016/j.csda.2004.03.017.CrossRef
34.
go back to reference Daumer M, Held U, Ickstadt K, Heinz M, Schach S, Ebers G: Reducing the probability of false positive research findings by pre-publication validation - Experience with a large multiple sclerosis database. BMC Medical Research Methodology. 2008, 8: 18-10.1186/1471-2288-8-18.CrossRefPubMedPubMedCentral Daumer M, Held U, Ickstadt K, Heinz M, Schach S, Ebers G: Reducing the probability of false positive research findings by pre-publication validation - Experience with a large multiple sclerosis database. BMC Medical Research Methodology. 2008, 8: 18-10.1186/1471-2288-8-18.CrossRefPubMedPubMedCentral
36.
go back to reference Ruschhaupt M, Mansmann U, Warnat P, Huber W, Benner A: MCRestimate: Misclassification error estimation with cross-validation. R package version 1.10.2. 2007 Ruschhaupt M, Mansmann U, Warnat P, Huber W, Benner A: MCRestimate: Misclassification error estimation with cross-validation. R package version 1.10.2. 2007
37.
go back to reference Simon R: Development and Validation of Therapeutically Relevant Multi-Gene Biomarker Classifiers. Journal of the National Cancer Institute. 2006, 97: 866-867.CrossRef Simon R: Development and Validation of Therapeutically Relevant Multi-Gene Biomarker Classifiers. Journal of the National Cancer Institute. 2006, 97: 866-867.CrossRef
38.
go back to reference Buyse M, Loi S, van't Veer L, et al: Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer. Journal of the National Cancer Institute. 2006, 98: 1183-1192.CrossRefPubMed Buyse M, Loi S, van't Veer L, et al: Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer. Journal of the National Cancer Institute. 2006, 98: 1183-1192.CrossRefPubMed
39.
go back to reference Boulesteix AL: Over-optimism in bioinformatics research. Bioinformatics. 2010, Boulesteix AL: Over-optimism in bioinformatics research. Bioinformatics. 2010,
Metadata
Title
Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
Authors
Anne-Laure Boulesteix
Carolin Strobl
Publication date
01-12-2009
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2009
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-9-85

Other articles of this Issue 1/2009

BMC Medical Research Methodology 1/2009 Go to the issue