skip to main content
10.1145/1835804.1835830acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An integrated machine learning approach to stroke prediction

Published:25 July 2010Publication History

ABSTRACT

Stroke is the third leading cause of death and the principal cause of serious long-term disability in the United States. Accurate prediction of stroke is highly valuable for early intervention and treatment. In this study, we compare the Cox proportional hazards model with a machine learning approach for stroke prediction on the Cardiovascular Health Study (CHS) dataset. Specifically, we consider the common problems of data imputation, feature selection, and prediction in medical datasets. We propose a novel automatic feature selection algorithm that selects robust features based on our proposed heuristic: conservative mean. Combined with Support Vector Machines (SVMs), our proposed feature selection algorithm achieves a greater area under the ROC curve (AUC) as compared to the Cox proportional hazards model and L1 regularized Cox feature selection algorithm. Furthermore, we present a margin-based censored regression algorithm that combines the concept of margin-based classifiers with censored regression to achieve a better concordance index than the Cox model. Overall, our approach outperforms the current state-of-the-art in both metrics of AUC and concordance index. In addition, our work has also identified potential risk factors that have not been discovered by traditional approaches. Our method can be applied to clinical prediction of other diseases, where missing data are common and risk factors are not well understood.

Skip Supplemental Material Section

Supplemental Material

kdd2010_lee_imla_01.mov

mov

74.3 MB

References

  1. K. Akazawa and T. Nakamura. Simulation program for estimating statistical power of Cox's proportional hazards model assuming no specific distribution for the survival time. Elseview Ireland, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  2. American Heart Association. Heart Disease and Stroke Statistics 2009 Update. American Heart Association, Dallas, Texas, 2009.Google ScholarGoogle Scholar
  3. R. Bender, T. Augustin, and M. Blettner. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24:1713--1723, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. L. E. Chambless, G. Heiss, E. Shahar, M. J. Earp, and J. Toole. Prediction of ischemic stroke risk in the atherosclerosis risk in communities study. Am. J. Epidemiol., 160(3):259--269, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Cortes,, C. Cortes, and M. Mohri. Auc optimization vs. error rate minimization. In in Advances in Neural Information Processing Systems. MIT Press, 2003.Google ScholarGoogle Scholar
  6. T. R. Dawber, G. F. Meadors, and F. E. Moore. Epidemiological approaches to heart disease: The framingham study. Am J Public Health Nations Health, 41:279--286, March 1951.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. M. Engels and P. Diehr. Imputation of missing longitudinal data: a comparison of methods. J. Clin. Epidemiol., 56(10):968--976, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  8. L. P. Fried, N. O. Borhani, P. Enright, C. D. Furberg, J. M. Gardin, R. A. Kronmal, L. H. Kuller, T. A. Manolio, M. B. Mittelmark, A. Newman, D. H. O'Leary, B. Psaty, P. Rautaharju, R. P. Tracy, and P. G. Weiler. The Cardiovascular Health Study: design and rationale. Ann Epidemiol., 1(3):263--276, February 1991.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Goeman. penalized estimation in the Cox proportional hazards model. Biom J., November 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95--110. Springer-Verlag Limited, 2008. http://stanford.edu/~boyd/graph_dcp.html.Google ScholarGoogle Scholar
  11. M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, May 2010.Google ScholarGoogle Scholar
  12. E. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. 1982.Google ScholarGoogle Scholar
  15. K. Ikeda, H. Kumada, and S. Saitoh. Effect of repeated transcatheter arterial embolization on the survival time in patients with hepatocellular carcinoma. Cancer, 2006.Google ScholarGoogle Scholar
  16. T. Joachims. A support vector method for multivariate performance measures. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 377--384. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. E. H. Jr. Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, 2001.Google ScholarGoogle Scholar
  18. M. W. Kattan. Comparison of cox regression with other methods for determining prediction models and nomograms. The Journal of Urology, 170:S6--S10, December 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. H. Kim, G. H. Golub, and H. Park. Imputation of missing values in DNA microarray gene expression data. In IEEE Computational Systems Bioinformatics Conference (CSB'04), pages 572--573, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Klein and M. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  21. K.-Y. Liang, S. G. Self, and X. Liu. The Cox proportional hazards model with change point: An epidemiologic application. Biometrics, 46:783--793, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.Google ScholarGoogle Scholar
  23. W. T. Longstreth, Jr., C. Bernick, A. Fitzpatrick, M. Cushman, L. Knepper, J. Lima, and C. Furberg. Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study. Neurology, 56:368--375, February 2001.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. J. Clin. Epidemiol., 55(2):129--136, February 2002.Google ScholarGoogle ScholarCross RefCross Ref
  25. T. A. Manolio, R. A. Kronmal, G. L. Burke, D. H. O'Leary, and T. R. Price. Short-term predictors of incident stroke in older adults: The Cardiovascular Health Study. Stroke, 27:1479--1486, September 1996.Google ScholarGoogle Scholar
  26. A. P. McGinn, R. C. Kaplan, J. Verghese, D. M. Rosenbaum, B. M. Psaty, A. E. Baird, J. K. Lynch, P. A. Wolf, C. Kooperberg, J. C. Larson, and S. Wassertheil-Smoller. Walking speed and risk of incident ischemic stroke among postmenopausal women. Stroke, 39:1233--1239, April 2008.Google ScholarGoogle ScholarCross RefCross Ref
  27. A. Y. Ng. Feature selection, vs. regularization, and rotational invariance. In Proc. International Conf. on Machine Learning, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M.-Y. Park and T. Hastie. An regularization-path algorithm for generalized linear models. JRSSB, 69(4):659--677, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. V. Raykar, H. Steck, B. Krishnapuram, C. Dehing-Oberije, and P. Lambin. On ranking in survival analysis: Bounds on the concordance index. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1209--1216. MIT Press, Cambridge, MA, 2008.Google ScholarGoogle Scholar
  30. M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for regularization: A comparative study and two new approaches. In European Conference on Machine Learning, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14:853--871, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267--288, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  33. Z. Vokó, M. Hollander, P. J. Koudstaal, A. Hofman, and M. M. Breteler. How do American stroke risk functions perform in a western European population? Neuroepidemiology, 23(5):247--253, September-October 2004.Google ScholarGoogle ScholarCross RefCross Ref
  34. P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: a risk profile from the framingham study. Stroke, 22:312--318, March 1991.Google ScholarGoogle ScholarCross RefCross Ref
  35. E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601--608. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X.-F. Zhang, J. Attia, C. D'este, X.-H. Yu, and X.-G. Wu. A risk score predicted coronary heart disease and stroke in a Chinese cohort. Journal of clinical epidemiology, 58(9):951--958, 2005.Google ScholarGoogle Scholar

Index Terms

  1. An integrated machine learning approach to stroke prediction

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader