Skip to main content
Top
Published in: Breast Cancer Research 1/2010

Open Access 01-02-2010 | Research article

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

Authors: Vlad Popovici, Weijie Chen, Brandon D Gallas, Christos Hatzis, Weiwei Shi, Frank W Samuelson, Yuri Nikolsky, Marina Tsyganova, Alex Ishkin, Tatiana Nikolskaya, Kenneth R Hess, Vicente Valero, Daniel Booser, Mauro Delorenzi, Gabriel N Hortobagyi, Leming Shi, W Fraser Symmans, Lajos Pusztai

Published in: Breast Cancer Research | Issue 1/2010

Login to get access

Abstract

Introduction

As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.

Methods

We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.

Results

A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.

Conclusions

We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.
Appendix
Available only for authorised users
Literature
1.
go back to reference Vijver van de MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, Velde van der T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.CrossRefPubMed Vijver van de MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, Velde van der T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.CrossRefPubMed
2.
go back to reference Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004, 351: 2817-2826. 10.1056/NEJMoa041588.CrossRefPubMed Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004, 351: 2817-2826. 10.1056/NEJMoa041588.CrossRefPubMed
3.
go back to reference Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobágyi GN: Commercialized multigene predictors of clinical outcome for breast cancer. Oncologist. 2008, 13: 477-493. 10.1634/theoncologist.2007-0248.CrossRefPubMed Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobágyi GN: Commercialized multigene predictors of clinical outcome for breast cancer. Oncologist. 2008, 13: 477-493. 10.1634/theoncologist.2007-0248.CrossRefPubMed
4.
go back to reference Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Statist Assoc. 2002, 97: 77-87. 10.1198/016214502753479248.CrossRef Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Statist Assoc. 2002, 97: 77-87. 10.1198/016214502753479248.CrossRef
5.
go back to reference Perou CM, Sørlie T, Eisen MB, Rijn van de M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature. 2000, 406: 747-752. 10.1038/35021093.CrossRefPubMed Perou CM, Sørlie T, Eisen MB, Rijn van de M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature. 2000, 406: 747-752. 10.1038/35021093.CrossRefPubMed
6.
go back to reference Pusztai L, Ayers M, Stec J, Clark E, Hess K, Stivers D, Damokosh A, Sneige N, Buchholz TA, Esteva FJ, Arun B, Cristofanilli M, Booser D, Rosales M, Valero V, Adams C, Hortobagyi GN, Symmans WF: Gene expression profiles obtained from fine-needle aspirations of breast cancer reliably identify routine prognostic markers and reveal large-scale molecular differences between estrogen-negative and estrogen-positive tumors. Clin Cancer Res. 2003, 9: 2406-2415.PubMed Pusztai L, Ayers M, Stec J, Clark E, Hess K, Stivers D, Damokosh A, Sneige N, Buchholz TA, Esteva FJ, Arun B, Cristofanilli M, Booser D, Rosales M, Valero V, Adams C, Hortobagyi GN, Symmans WF: Gene expression profiles obtained from fine-needle aspirations of breast cancer reliably identify routine prognostic markers and reveal large-scale molecular differences between estrogen-negative and estrogen-positive tumors. Clin Cancer Res. 2003, 9: 2406-2415.PubMed
7.
go back to reference Andre F, Mazouni C, Liedtke C, Kau S-W, Frye D, Green M, Gonzalez-Angulo AM, Symmans WF, Hortobagyi GN, Pusztai L: HER2 expression and efficacy of preoperative paclitaxel/FAC chemotherapy in breast cancer. Breast Cancer Res Treat. 2008, 108: 183-190. 10.1007/s10549-007-9594-8.CrossRefPubMed Andre F, Mazouni C, Liedtke C, Kau S-W, Frye D, Green M, Gonzalez-Angulo AM, Symmans WF, Hortobagyi GN, Pusztai L: HER2 expression and efficacy of preoperative paclitaxel/FAC chemotherapy in breast cancer. Breast Cancer Res Treat. 2008, 108: 183-190. 10.1007/s10549-007-9594-8.CrossRefPubMed
8.
go back to reference Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005, 21: 171-178. 10.1093/bioinformatics/bth469.CrossRefPubMed Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005, 21: 171-178. 10.1093/bioinformatics/bth469.CrossRefPubMed
9.
go back to reference Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D, Theriault RL, Buzdar AU, Dempsey PJ, Rouzier R, Sneige N, Ross JS, Vidaurre T, Gómez HL, Hortobagyi GN, Pusztai L: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006, 24: 4236-4244. 10.1200/JCO.2006.05.6861.CrossRefPubMed Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D, Theriault RL, Buzdar AU, Dempsey PJ, Rouzier R, Sneige N, Ross JS, Vidaurre T, Gómez HL, Hortobagyi GN, Pusztai L: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006, 24: 4236-4244. 10.1200/JCO.2006.05.6861.CrossRefPubMed
10.
go back to reference Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao M-S, Penn LZ, Jurisica I: Prognostic gene signatures for non-small-cell lung cancer. Proc Natl Acad Sci USA. 2009, 106: 2824-2828. 10.1073/pnas.0809444106.CrossRefPubMedPubMedCentral Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao M-S, Penn LZ, Jurisica I: Prognostic gene signatures for non-small-cell lung cancer. Proc Natl Acad Sci USA. 2009, 106: 2824-2828. 10.1073/pnas.0809444106.CrossRefPubMedPubMedCentral
11.
go back to reference Yousef WA, Wagner RF, Loew MH: Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recog Lett. 2005, 26: 2600-2610. 10.1016/j.patrec.2005.06.006.CrossRef Yousef WA, Wagner RF, Loew MH: Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recog Lett. 2005, 26: 2600-2610. 10.1016/j.patrec.2005.06.006.CrossRef
12.
go back to reference Symmans WF, Ayers M, Clark EA, Stec J, Hess KR, Sneige N, Buchholz TA, Krishnamurthy S, Ibrahim NK, Buzdar AU, Theriault RL, Rosales MFM, Thomas ES, Gwyn KM, Green MC, Syed AR, Hortobagyi GN, Pusztai L: Total RNA yield and microarray gene expression profiles from fine-needle aspiration biopsy and core-needle biopsy samples of breast carcinoma. Cancer. 2003, 97: 2960-2971. 10.1002/cncr.11435.CrossRefPubMed Symmans WF, Ayers M, Clark EA, Stec J, Hess KR, Sneige N, Buchholz TA, Krishnamurthy S, Ibrahim NK, Buzdar AU, Theriault RL, Rosales MFM, Thomas ES, Gwyn KM, Green MC, Syed AR, Hortobagyi GN, Pusztai L: Total RNA yield and microarray gene expression profiles from fine-needle aspiration biopsy and core-needle biopsy samples of breast carcinoma. Cancer. 2003, 97: 2960-2971. 10.1002/cncr.11435.CrossRefPubMed
13.
go back to reference Liedtke C, Mazouni C, Hess KR, André F, Tordai A, Mejia JA, Symmans WF, Gonzalez-Angulo AM, Hennessy B, Green M, Cristofanilli M, Hortobagyi GN, Pusztai L: Response to neoadjuvant therapy and long-term survival in patients with triple-negative breast cancer. J Clin Oncol. 2008, 26: 1275-1281. 10.1200/JCO.2007.14.4147.CrossRefPubMed Liedtke C, Mazouni C, Hess KR, André F, Tordai A, Mejia JA, Symmans WF, Gonzalez-Angulo AM, Hennessy B, Green M, Cristofanilli M, Hortobagyi GN, Pusztai L: Response to neoadjuvant therapy and long-term survival in patients with triple-negative breast cancer. J Clin Oncol. 2008, 26: 1275-1281. 10.1200/JCO.2007.14.4147.CrossRefPubMed
14.
go back to reference Ayers M, Symmans WF, Stec J, Damokosh AI, Clark E, Hess K, Lecocke M, Metivier J, Booser D, Ibrahim N, Valero V, Royce M, Arun B, Whitman G, Ross J, Sneige N, Hortobagyi GN, Pusztai L: Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J Clin Oncol. 2004, 22: 2284-2293. 10.1200/JCO.2004.05.166.CrossRefPubMed Ayers M, Symmans WF, Stec J, Damokosh AI, Clark E, Hess K, Lecocke M, Metivier J, Booser D, Ibrahim N, Valero V, Royce M, Arun B, Whitman G, Ross J, Sneige N, Hortobagyi GN, Pusztai L: Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J Clin Oncol. 2004, 22: 2284-2293. 10.1200/JCO.2004.05.166.CrossRefPubMed
15.
go back to reference Peintinger F, Anderson K, Mazouni C, Kuerer HM, Hatzis C, Lin F, Hortobagyi GN, Symmans WF, Pusztai L: Thirty-gene pharmacogenomic test correlates with residual cancer burden after preoperative chemotherapy for breast cancer. Clin Cancer Res. 2007, 13: 4078-4082. 10.1158/1078-0432.CCR-06-2600.CrossRefPubMed Peintinger F, Anderson K, Mazouni C, Kuerer HM, Hatzis C, Lin F, Hortobagyi GN, Symmans WF, Pusztai L: Thirty-gene pharmacogenomic test correlates with residual cancer burden after preoperative chemotherapy for breast cancer. Clin Cancer Res. 2007, 13: 4078-4082. 10.1158/1078-0432.CCR-06-2600.CrossRefPubMed
16.
go back to reference Stec J, Wang J, Coombes K, Ayers M, Hoersch S, Gold DL, Ross JS, Hess KR, Tirrell S, Linette G, Hortobagyi GN, Symmans WF, Pusztai L: Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix genechips. J Mol Diagn. 2005, 7: 357-367. 10.1016/S1525-1578(10)60565-X.CrossRefPubMedPubMedCentral Stec J, Wang J, Coombes K, Ayers M, Hoersch S, Gold DL, Ross JS, Hess KR, Tirrell S, Linette G, Hortobagyi GN, Symmans WF, Pusztai L: Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix genechips. J Mol Diagn. 2005, 7: 357-367. 10.1016/S1525-1578(10)60565-X.CrossRefPubMedPubMedCentral
17.
go back to reference Rouzier R, Perou CM, Symmans WF, Ibrahim N, Cristofanilli M, Anderson K, Hess KR, Stec J, Ayers M, Wagner P, Morandi P, Fan C, Rabiul I, Ross JS, Hortobagyi GN, Pusztai L: Breast cancer molecular subtypes respond differently to preoperative chemotherapy. Clin Cancer Res. 2005, 11: 5678-5685. 10.1158/1078-0432.CCR-04-2421.CrossRefPubMed Rouzier R, Perou CM, Symmans WF, Ibrahim N, Cristofanilli M, Anderson K, Hess KR, Stec J, Ayers M, Wagner P, Morandi P, Fan C, Rabiul I, Ross JS, Hortobagyi GN, Pusztai L: Breast cancer molecular subtypes respond differently to preoperative chemotherapy. Clin Cancer Res. 2005, 11: 5678-5685. 10.1158/1078-0432.CCR-04-2421.CrossRefPubMed
18.
go back to reference Ho TK, Basu M: Complexity measures of supervised classification problems. IEEE Trans Patt Anal Mach Intel. 2002, 24: 289-300. 10.1109/34.990132.CrossRef Ho TK, Basu M: Complexity measures of supervised classification problems. IEEE Trans Patt Anal Mach Intel. 2002, 24: 289-300. 10.1109/34.990132.CrossRef
19.
go back to reference Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007, 23: 1363-1370. 10.1093/bioinformatics/btm117.CrossRefPubMed Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007, 23: 1363-1370. 10.1093/bioinformatics/btm117.CrossRefPubMed
20.
go back to reference Efron B, Tibshirani R: Improvements on cross-validation: the 632+ bootstrap method. J Am Statist Assoc. 1997, 92: 548-560. 10.2307/2965703. Efron B, Tibshirani R: Improvements on cross-validation: the 632+ bootstrap method. J Am Statist Assoc. 1997, 92: 548-560. 10.2307/2965703.
21.
go back to reference Yousef WA, Wagner RF, Loew MH: Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Trans Patt Anal Mach Intel. 2006, 28: 1809-1817. 10.1109/TPAMI.2006.218.CrossRef Yousef WA, Wagner RF, Loew MH: Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. IEEE Trans Patt Anal Mach Intel. 2006, 28: 1809-1817. 10.1109/TPAMI.2006.218.CrossRef
22.
go back to reference Fukunaga K, Hayes RR: Effects of sample size in classifier design. IEEE Trans Patt Anal Mach Intel. 1989, 11: 873-885. 10.1109/34.31448.CrossRef Fukunaga K, Hayes RR: Effects of sample size in classifier design. IEEE Trans Patt Anal Mach Intel. 1989, 11: 873-885. 10.1109/34.31448.CrossRef
23.
go back to reference Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8: R183-10.1186/gb-2007-8-9-r183.CrossRefPubMedPubMedCentral Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8: R183-10.1186/gb-2007-8-9-r183.CrossRefPubMedPubMedCentral
24.
go back to reference Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DSA, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. N Engl J Med. 2006, 355: 560-569. 10.1056/NEJMoa052933.CrossRefPubMed Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DSA, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. N Engl J Med. 2006, 355: 560-569. 10.1056/NEJMoa052933.CrossRefPubMed
25.
go back to reference Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schütz F, Goldstein DR, Piccart M, Delorenzi M: Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 2008, 10: R65-10.1186/bcr2124.CrossRefPubMedPubMedCentral Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schütz F, Goldstein DR, Piccart M, Delorenzi M: Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 2008, 10: R65-10.1186/bcr2124.CrossRefPubMedPubMedCentral
26.
go back to reference Zucknick M, Richardson S, Stronach EA: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol. 2008, 7: Article7-PubMedPubMedCentral Zucknick M, Richardson S, Stronach EA: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol. 2008, 7: Article7-PubMedPubMedCentral
27.
go back to reference Lai C, Reinders MJT, van't Veer LJ, Wessels LFA: A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics. 2006, 7: 235-10.1186/1471-2105-7-235.CrossRefPubMedPubMedCentral Lai C, Reinders MJT, van't Veer LJ, Wessels LFA: A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics. 2006, 7: 235-10.1186/1471-2105-7-235.CrossRefPubMedPubMedCentral
28.
go back to reference Lecocke M, Hess KR: An empirical study of univariate and genetic algorithm-based feature selection in binary classification with microarray data. Cancer Inform. 2007, 2: 313-327.PubMedPubMedCentral Lecocke M, Hess KR: An empirical study of univariate and genetic algorithm-based feature selection in binary classification with microarray data. Cancer Inform. 2007, 2: 313-327.PubMedPubMedCentral
Metadata
Title
Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
Authors
Vlad Popovici
Weijie Chen
Brandon D Gallas
Christos Hatzis
Weiwei Shi
Frank W Samuelson
Yuri Nikolsky
Marina Tsyganova
Alex Ishkin
Tatiana Nikolskaya
Kenneth R Hess
Vicente Valero
Daniel Booser
Mauro Delorenzi
Gabriel N Hortobagyi
Leming Shi
W Fraser Symmans
Lajos Pusztai
Publication date
01-02-2010
Publisher
BioMed Central
Published in
Breast Cancer Research / Issue 1/2010
Electronic ISSN: 1465-542X
DOI
https://doi.org/10.1186/bcr2468

Other articles of this Issue 1/2010

Breast Cancer Research 1/2010 Go to the issue
Webinar | 19-02-2024 | 17:30 (CET)

Keynote webinar | Spotlight on antibody–drug conjugates in cancer

Antibody–drug conjugates (ADCs) are novel agents that have shown promise across multiple tumor types. Explore the current landscape of ADCs in breast and lung cancer with our experts, and gain insights into the mechanism of action, key clinical trials data, existing challenges, and future directions.

Dr. Véronique Diéras
Prof. Fabrice Barlesi
Developed by: Springer Medicine