research-article

An integrated machine learning approach to stroke prediction

Authors:
Aditya Khosla

Stanford University, Stanford, USA

Stanford University, Stanford, USA
View Profile

,
Yu Cao

Stanford University, Stanford, USA

Stanford University, Stanford, USA
View Profile

,
Cliff Chiung-Yu Lin

Stanford University, Stanford, USA

Stanford University, Stanford, USA
View Profile

,
Hsu-Kuang Chiu

Stanford University, Stanford, USA

Stanford University, Stanford, USA
View Profile

,
Junling Hu

eBay Inc, San Jose, USA

eBay Inc, San Jose, USA
View Profile

,
Honglak Lee

Stanford University, Stanford, USA

Stanford University, Stanford, USA
View Profile

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2010Pages 183–192https://doi.org/10.1145/1835804.1835830

Published:25 July 2010Publication History

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 183–192

ABSTRACT

Stroke is the third leading cause of death and the principal cause of serious long-term disability in the United States. Accurate prediction of stroke is highly valuable for early intervention and treatment. In this study, we compare the Cox proportional hazards model with a machine learning approach for stroke prediction on the Cardiovascular Health Study (CHS) dataset. Specifically, we consider the common problems of data imputation, feature selection, and prediction in medical datasets. We propose a novel automatic feature selection algorithm that selects robust features based on our proposed heuristic: conservative mean. Combined with Support Vector Machines (SVMs), our proposed feature selection algorithm achieves a greater area under the ROC curve (AUC) as compared to the Cox proportional hazards model and L1 regularized Cox feature selection algorithm. Furthermore, we present a margin-based censored regression algorithm that combines the concept of margin-based classifiers with censored regression to achieve a better concordance index than the Cox model. Overall, our approach outperforms the current state-of-the-art in both metrics of AUC and concordance index. In addition, our work has also identified potential risk factors that have not been discovered by traditional approaches. Our method can be applied to clinical prediction of other diseases, where missing data are common and risk factors are not well understood.

Supplemental Material

kdd2010_lee_imla_01.mov

mov

74.3 MB

Download

References

K. Akazawa and T. Nakamura. Simulation program for estimating statistical power of Cox's proportional hazards model assuming no specific distribution for the survival time. Elseview Ireland, 1991.Google ScholarCross Ref
American Heart Association. Heart Disease and Stroke Statistics 2009 Update. American Heart Association, Dallas, Texas, 2009.Google Scholar
R. Bender, T. Augustin, and M. Blettner. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24:1713--1723, 2005.Google ScholarCross Ref
L. E. Chambless, G. Heiss, E. Shahar, M. J. Earp, and J. Toole. Prediction of ischemic stroke risk in the atherosclerosis risk in communities study. Am. J. Epidemiol., 160(3):259--269, 2004.Google ScholarCross Ref
C. Cortes,, C. Cortes, and M. Mohri. Auc optimization vs. error rate minimization. In in Advances in Neural Information Processing Systems. MIT Press, 2003.Google Scholar
T. R. Dawber, G. F. Meadors, and F. E. Moore. Epidemiological approaches to heart disease: The framingham study. Am J Public Health Nations Health, 41:279--286, March 1951.Google ScholarCross Ref
J. M. Engels and P. Diehr. Imputation of missing longitudinal data: a comparison of methods. J. Clin. Epidemiol., 56(10):968--976, 2003.Google ScholarCross Ref
L. P. Fried, N. O. Borhani, P. Enright, C. D. Furberg, J. M. Gardin, R. A. Kronmal, L. H. Kuller, T. A. Manolio, M. B. Mittelmark, A. Newman, D. H. O'Leary, B. Psaty, P. Rautaharju, R. P. Tracy, and P. G. Weiler. The Cardiovascular Health Study: design and rationale. Ann Epidemiol., 1(3):263--276, February 1991.Google ScholarCross Ref
J. Goeman. penalized estimation in the Cox proportional hazards model. Biom J., November 2009.Google ScholarCross Ref
M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pages 95--110. Springer-Verlag Limited, 2008. http://stanford.edu/~boyd/graph_dcp.html.Google Scholar
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, May 2010.Google Scholar
E. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarDigital Library
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, 2003. Google ScholarDigital Library
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. 1982.Google Scholar
K. Ikeda, H. Kumada, and S. Saitoh. Effect of repeated transcatheter arterial embolization on the survival time in patients with hepatocellular carcinoma. Cancer, 2006.Google Scholar
T. Joachims. A support vector method for multivariate performance measures. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 377--384. ACM, 2005. Google ScholarDigital Library
F. E. H. Jr. Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, 2001.Google Scholar
M. W. Kattan. Comparison of cox regression with other methods for determining prediction models and nomograms. The Journal of Urology, 170:S6--S10, December 2003.Google ScholarCross Ref
H. Kim, G. H. Golub, and H. Park. Imputation of missing values in DNA microarray gene expression data. In IEEE Computational Systems Bioinformatics Conference (CSB'04), pages 572--573, 2004. Google ScholarDigital Library
J. Klein and M. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, 2003.Google ScholarCross Ref
K.-Y. Liang, S. G. Self, and X. Liu. The Cox proportional hazards model with change point: An epidemiologic application. Biometrics, 46:783--793, 1990.Google ScholarCross Ref
J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.Google Scholar
W. T. Longstreth, Jr., C. Bernick, A. Fitzpatrick, M. Cushman, L. Knepper, J. Lima, and C. Furberg. Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study. Neurology, 56:368--375, February 2001.Google ScholarCross Ref
T. Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the elderly: Validation and web-based application. J. Clin. Epidemiol., 55(2):129--136, February 2002.Google ScholarCross Ref
T. A. Manolio, R. A. Kronmal, G. L. Burke, D. H. O'Leary, and T. R. Price. Short-term predictors of incident stroke in older adults: The Cardiovascular Health Study. Stroke, 27:1479--1486, September 1996.Google Scholar
A. P. McGinn, R. C. Kaplan, J. Verghese, D. M. Rosenbaum, B. M. Psaty, A. E. Baird, J. K. Lynch, P. A. Wolf, C. Kooperberg, J. C. Larson, and S. Wassertheil-Smoller. Walking speed and risk of incident ischemic stroke among postmenopausal women. Stroke, 39:1233--1239, April 2008.Google ScholarCross Ref
A. Y. Ng. Feature selection, vs. regularization, and rotational invariance. In Proc. International Conf. on Machine Learning, 2004. Google ScholarDigital Library
M.-Y. Park and T. Hastie. An regularization-path algorithm for generalized linear models. JRSSB, 69(4):659--677, 2007.Google ScholarCross Ref
V. Raykar, H. Steck, B. Krishnapuram, C. Dehing-Oberije, and P. Lambin. On ranking in survival analysis: Bounds on the concordance index. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1209--1216. MIT Press, Cambridge, MA, 2008.Google Scholar
M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for regularization: A comparative study and two new approaches. In European Conference on Machine Learning, 2007. Google ScholarDigital Library
T. Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14:853--871, 2001.Google ScholarCross Ref
R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267--288, 1996.Google ScholarCross Ref
Z. Vokó, M. Hollander, P. J. Koudstaal, A. Hofman, and M. M. Breteler. How do American stroke risk functions perform in a western European population? Neuroepidemiology, 23(5):247--253, September-October 2004.Google ScholarCross Ref
P. A. Wolf, R. B. D'Agostino, A. J. Belanger, and W. B. Kannel. Probability of stroke: a risk profile from the framingham study. Stroke, 22:312--318, March 1991.Google ScholarCross Ref
E. P. Xing, M. I. Jordan, and R. M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601--608. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
X.-F. Zhang, J. Attia, C. D'este, X.-H. Yu, and X.-G. Wu. A risk score predicted coronary heart disease and stroke in a Chinese cohort. Journal of clinical epidemiology, 58(9):951--958, 2005.Google Scholar

Index Terms

An integrated machine learning approach to stroke prediction

Recommendations

Machine Learning for Survival Analysis: A Survey

Survival analysis is a subfield of statistics where the goal is to analyze and model data where the outcome is the time until an event of interest occurs. One of the main challenges in this context is the presence of instances whose event outcomes ...
Read More
On the use of Harrell's C for clinical risk prediction via random survival forests

Harrell's C is proposed as a split criterion in random survival forests.Split points of continuous predictor variables differ substantially between Harrell's C and log-rank splitting.The log-rank statistic has a stronger end-cut preference than Harrell'...
Read More
Risk factors and prediction of very short term versus short/intermediate term post-stroke mortality

Data mining and knowledge discovery as an approach to examining medical data can limit some of the inherent bias in the hypothesis assumptions that can be found in traditional clinical data analysis. In this paper we illustrate the benefits of a data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
July 2010
1240 pages
ISBN:9781450300551
DOI:10.1145/1835804
General Chairs:
Bharat Rao
Siemens
,
Balaji Krishnapuram
Siemens
,
Program Chairs:
Andrew Tomkins
Google Inc.
,
Qiang Yang
Hong Kong University of Science and Technology
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ROC
SVM
benchmark
classification
concordance index
data analysis
feature selection
healthcare
medical data analysis
prediction
stroke
stroke prediction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 99
  Total Citations
  View Citations
- 2,589
  Total Downloads
- Downloads (Last 12 months)297
- Downloads (Last 6 weeks)44
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An integrated machine learning approach to stroke prediction

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Machine Learning for Survival Analysis: A Survey

On the use of Harrell's C for clinical risk prediction via random survival forests

Risk factors and prediction of very short term versus short/intermediate term post-stroke mortality