ABSTRACT
Stroke is the third leading cause of death and the principal cause of serious long-term disability in the United States. Accurate prediction of stroke is highly valuable for early intervention and treatment. In this study, we compare the Cox proportional hazards model with a machine learning approach for stroke prediction on the Cardiovascular Health Study (CHS) dataset. Specifically, we consider the common problems of data imputation, feature selection, and prediction in medical datasets. We propose a novel automatic feature selection algorithm that selects robust features based on our proposed heuristic: conservative mean. Combined with Support Vector Machines (SVMs), our proposed feature selection algorithm achieves a greater area under the ROC curve (AUC) as compared to the Cox proportional hazards model and L1 regularized Cox feature selection algorithm. Furthermore, we present a margin-based censored regression algorithm that combines the concept of margin-based classifiers with censored regression to achieve a better concordance index than the Cox model. Overall, our approach outperforms the current state-of-the-art in both metrics of AUC and concordance index. In addition, our work has also identified potential risk factors that have not been discovered by traditional approaches. Our method can be applied to clinical prediction of other diseases, where missing data are common and risk factors are not well understood.
Supplemental Material
- K. Akazawa and T. Nakamura. Simulation program for estimating statistical power of Cox's proportional hazards model assuming no specific distribution for the survival time. Elseview Ireland, 1991.Google ScholarCross Ref
- American Heart Association. Heart Disease and Stroke Statistics 2009 Update. American Heart Association, Dallas, Texas, 2009.Google Scholar
- R. Bender, T. Augustin, and M. Blettner. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24:1713--1723, 2005.Google ScholarCross Ref
- L. E. Chambless, G. Heiss, E. Shahar, M. J. Earp, and J. Toole. Prediction of ischemic stroke risk in the atherosclerosis risk in communities study. Am. J. Epidemiol., 160(3):259--269, 2004.Google ScholarCross Ref
- C. Cortes,, C. Cortes, and M. Mohri. Auc optimization vs. error rate minimization. In in Advances in Neural Information Processing Systems. MIT Press, 2003.Google Scholar
- T. R. Dawber, G. F. Meadors, and F. E. Moore. Epidemiological approaches to heart disease: The framingham study. Am J Public Health Nations Health, 41:279--286, March 1951.Google ScholarCross Ref
- J. M. Engels and P. Diehr. Imputation of missing longitudinal data: a comparison of methods. J. Clin. Epidemiol., 56(10):968--976, 2003.Google ScholarCross Ref
- L. P. Fried, N. O. Borhani, P. Enright, C. D. Furberg, J. M. Gardin, R. A. Kronmal, L. H. Kuller, T. A. Manolio, M. B. Mittelmark, A. Newman, D. H. O'Leary, B. Psaty, P. Rautaharju, R. P. Tracy, and P. G. Weiler. The Cardiovascular Health Study: design and rationale. Ann Epidemiol., 1(3):263--276, February 1991.Google ScholarCross Ref
- J. Goeman. penalized estimation in the Cox proportional hazards model. Biom J., November 2009.Google ScholarCross Ref
- M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95--110. Springer-Verlag Limited, 2008. http://stanford.edu/~boyd/graph_dcp.html.Google Scholar
- M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, May 2010.Google Scholar
- E. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarDigital Library
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, 2003. Google ScholarDigital Library
- J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. 1982.Google Scholar
- K. Ikeda, H. Kumada, and S. Saitoh. Effect of repeated transcatheter arterial embolization on the survival time in patients with hepatocellular carcinoma. Cancer, 2006.Google Scholar
- T. Joachims. A support vector method for multivariate performance measures. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 377--384. ACM, 2005. Google ScholarDigital Library
- F. E. H. Jr. Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, 2001.Google Scholar
- M. W. Kattan. Comparison of cox regression with other methods for determining prediction models and nomograms. The Journal of Urology, 170:S6--S10, December 2003.Google ScholarCross Ref
- H. Kim, G. H. Golub, and H. Park. Imputation of missing values in DNA microarray gene expression data. In IEEE Computational Systems Bioinformatics Conference (CSB'04), pages 572--573, 2004. Google ScholarDigital Library
- J. Klein and M. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, 2003.Google ScholarCross Ref
- K.-Y. Liang, S. G. Self, and X. Liu. The Cox proportional hazards model with change point: An epidemiologic application. Biometrics, 46:783--793, 1990.Google ScholarCross Ref
- J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.Google Scholar
- W. T. Longstreth, Jr., C. Bernick, A. Fitzpatrick, M. Cushman, L. Knepper, J. Lima, and C. Furberg. Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study. Neurology, 56:368--375, February 2001.Google ScholarCross Ref
- T. Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. J. Clin. Epidemiol., 55(2):129--136, February 2002.Google ScholarCross Ref
- T. A. Manolio, R. A. Kronmal, G. L. Burke, D. H. O'Leary, and T. R. Price. Short-term predictors of incident stroke in older adults: The Cardiovascular Health Study. Stroke, 27:1479--1486, September 1996.Google Scholar
- A. P. McGinn, R. C. Kaplan, J. Verghese, D. M. Rosenbaum, B. M. Psaty, A. E. Baird, J. K. Lynch, P. A. Wolf, C. Kooperberg, J. C. Larson, and S. Wassertheil-Smoller. Walking speed and risk of incident ischemic stroke among postmenopausal women. Stroke, 39:1233--1239, April 2008.Google ScholarCross Ref
- A. Y. Ng. Feature selection, vs. regularization, and rotational invariance. In Proc. International Conf. on Machine Learning, 2004. Google ScholarDigital Library
- M.-Y. Park and T. Hastie. An regularization-path algorithm for generalized linear models. JRSSB, 69(4):659--677, 2007.Google ScholarCross Ref
- V. Raykar, H. Steck, B. Krishnapuram, C. Dehing-Oberije, and P. Lambin. On ranking in survival analysis: Bounds on the concordance index. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1209--1216. MIT Press, Cambridge, MA, 2008.Google Scholar
- M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for regularization: A comparative study and two new approaches. In European Conference on Machine Learning, 2007. Google ScholarDigital Library
- T. Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14:853--871, 2001.Google ScholarCross Ref
- R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267--288, 1996.Google ScholarCross Ref
- Z. Vokó, M. Hollander, P. J. Koudstaal, A. Hofman, and M. M. Breteler. How do American stroke risk functions perform in a western European population? Neuroepidemiology, 23(5):247--253, September-October 2004.Google ScholarCross Ref
- P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: a risk profile from the framingham study. Stroke, 22:312--318, March 1991.Google ScholarCross Ref
- E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601--608. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
- X.-F. Zhang, J. Attia, C. D'este, X.-H. Yu, and X.-G. Wu. A risk score predicted coronary heart disease and stroke in a Chinese cohort. Journal of clinical epidemiology, 58(9):951--958, 2005.Google Scholar
Index Terms
- An integrated machine learning approach to stroke prediction
Recommendations
Machine Learning for Survival Analysis: A Survey
Survival analysis is a subfield of statistics where the goal is to analyze and model data where the outcome is the time until an event of interest occurs. One of the main challenges in this context is the presence of instances whose event outcomes ...
On the use of Harrell's C for clinical risk prediction via random survival forests
Harrell's C is proposed as a split criterion in random survival forests.Split points of continuous predictor variables differ substantially between Harrell's C and log-rank splitting.The log-rank statistic has a stronger end-cut preference than Harrell'...
Risk factors and prediction of very short term versus short/intermediate term post-stroke mortality
Data mining and knowledge discovery as an approach to examining medical data can limit some of the inherent bias in the hypothesis assumptions that can be found in traditional clinical data analysis. In this paper we illustrate the benefits of a data ...
Comments