Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2019 | Stroke | Research article

Using machine learning models to improve stroke risk level classification methods of China national stroke screening

Authors: Xuemeng Li, Di Bian, Jinghui Yu, Mei Li, Dongsheng Zhao

Published in: BMC Medical Informatics and Decision Making | Issue 1/2019

Abstract

Background

With the character of high incidence, high prevalence and high mortality, stroke has brought a heavy burden to families and society in China. In 2009, the Ministry of Health of China launched the China national stroke screening and intervention program, which screens stroke and its risk factors and conducts high-risk population interventions for people aged above 40 years old all over China. In this program, stroke risk factors include hypertension, diabetes, dyslipidemia, smoking, lack of exercise, apparently overweight and family history of stroke. People with more than two risk factors or history of stroke or transient ischemic attack (TIA) are considered as high-risk. However, it is impossible for this criterion to classify stroke risk levels for people with unknown values in fields of risk factors. The missing of stroke risk levels results in reduced efficiency of stroke interventions and inaccuracies in statistical results at the national level. In this paper, we use 2017 national stroke screening data to develop stroke risk classification models based on machine learning algorithms to improve the classification efficiency.

Method

Firstly, we construct training set and test sets and process the imbalance training set based on oversampling and undersampling method. Then, we develop logistic regression model, Naïve Bayesian model, Bayesian network model, decision tree model, neural network model, random forest model, bagged decision tree model, voting model and boosting model with decision trees to classify stroke risk levels.

Result

The recall of the boosting model with decision trees is the highest (99.94%), and the precision of the model based on the random forest is highest (97.33%). Using the random forest model (recall: 98.44%), the recall will be increased by about 2.8% compared with the method currently used, and several thousands more people with high risk of stroke can be identified each year.

Conclusion

Models developed in this paper can improve the current screening method in the way that it can avoid the impact of unknown values, and avoid unnecessary rescreening and intervention expenditures. The national stroke screening program can choose classification models according to the practice need.

Available only for authorised users

1,000,000*3%*19.7%*95.82% ≈ 6000

1,000,000*3%* (1–19.7%) * (1–36.35%) ≈ 15,000

Liu L, Wang D, Wong KS, Wang Y. Stroke and stroke care in China: huge burden, significant workload, and a national priority. Stroke. 2011;42:3651–4.CrossRef

Liu M, et al. Stroke in China: epidemiology, prevention, and management strategies. Lancet Neurol. 2007;6:456–64.CrossRef

Yu J, Mao H, Li M, et al. CSDC — A nationwide screening platform for stroke control and prevention in China. In: Proceedings of the 38th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC 16); 2016. p. 2974.

Wang L, An M, Zhang Z. Report on stroke prevention and treatment in China (Chinese version). China: People’s Medical Publishing House; 2018.

Wang X, Fu Q, Song F, et al. Prevalence of atrial fibrillation in different socioeconomic regions of China and its association with stroke: results from a national stroke screening survey. Int J Cardiol. 2018;271:92–7.CrossRef

Wang X, Li W, Song F, et al. Carotid atherosclerosis detected by ultrasonography: a national cross-sectional study. J American Heart Assoc. 2018;7(8):1–14.

Li W, Song F, Wang X, et al. Prevalence of metabolic syndrome among middle-aged and elderly adults in China: current status and temporal trends. Annals of medicine. 2018;50(4):345–53.CrossRef

Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2011;16(1):321–57.

Kermany DS, Goldbaum M, Cai W, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–1131.e9.CrossRef

10.

Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression. New Zealan: Wiley; 2013.

11.

Murphy KP. Naive bayes classifiers. Vancouver: University of British Columbia; 2006. p. 18.

12.

Friedman N, Dan G, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29(2–3):131–63.CrossRef

13.

Hagan MT, Beale M, Beale M. Neural network design; 2002.

14.

Liaw A, Wiener M. Classification and regression by random forest. R News. 2002;2(3):18–22.

15.

Holmes G, Donkin A, Witten IH. WEKA: a machine learning workbench. New Zealand: The university of Waikato; 1994.

16.

Singh S, Gupta P. Comparative study ID3, cart and C4. 5 decision tree algorithms: a survey. Int J Adv Inf Sci Technol. 2014;27(27):97–103.

17.

Quinlan JR. C4. 5: programs for machine learning. Amsterdam: Elsevier; 2014.

18.

Li X, Liu H, Du X, et al. Integrated machine learning approaches for predicting ischemic stroke and thromboembolism in atrial fibrillation. AMIA Annu Symp Proc. 2017;2016:799.PubMedPubMedCentral

19.

Zhang Y, Zhou Y, Zhang D, et al. A stroke risk detection: improving hybrid feature selection method. J Med Internet Res. 2019;21(4):e12437.CrossRef

20.

Asadi H, Dowling R, Yan B, et al. Machine learning for outcome prediction of acute ischemic stroke post intra-arterial therapy. PLoS One. 2014;9(2):e88225.CrossRef

21.

Austin PC, Tu JV, Ho JE, et al. Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013;66(4):398–407.CrossRef

22.

Kaur G, Chhabra A. Improved J48 classification algorithm for the prediction of diabetes. International Journal of Computer Applications. 2014;98(22):13–17.CrossRef

23.

Al-Maqaleh BM, Abdullah AMG. Intelligent predictive system using classification techniques for heart disease diagnosis. Int J Comput Sci Eng. 2017;6(6):145–51.

24.

Jabbar MA, Deekshatulu BL, Chandra P. Prediction of heart disease using random forest and feature subset selection. In: Innovations in bio-inspired computing and applications. Cham: Springer; 2016. p. 187–96.CrossRef

25.

Lee SJ, Xu Z, Li T, et al. A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J Biomed Inform. 2018;78:144–55.CrossRef

26.

Bashir S, Qamar U, Khan FH. IntelliHealth: a medical decision support application using a novel weighted multi-layer classifier ensemble framework. J Biomed Inform. 2016;59:185–200.CrossRef

27.

Li X, Yu J, Li M, et al. Discover high-risk factor combinations using Bayesian network from national screening data in China. 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. 2017. p. 1047–51.

Title: Using machine learning models to improve stroke risk level classification methods of China national stroke screening
Authors: Xuemeng Li
Di Bian
Jinghui Yu
Mei Li
Dongsheng Zhao
Publication date: 01-12-2019
Publisher: BioMed Central
Keyword: Stroke
Published in: BMC Medical Informatics and Decision Making / Issue 1/2019
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-019-0998-2

At a glance: The STEP trials

Springer Medicine

Using machine learning models to improve stroke risk level classification methods of China national stroke screening

Abstract

Background

Method

Result

Conclusion

At a glance: The STEP trials

Springer Medicine

Abstract

Background

Method

Result

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2019

Identifying undetected dementia in UK primary care patients: a retrospective case-control study comparing machine-learning and standard epidemiological approaches

A machine-learning approach to predict postprandial hypoglycemia

Correction to: ThalPred: a web-based prediction tool for discriminating thalassemia trait and iron deficiency anemia

A classification framework for exploiting sparse multi-variate temporal features with application to adverse drug event detection in medical records

Using decision fusion methods to improve outbreak detection in disease surveillance

Design and evaluation of a LIS-based autoverification system for coagulation assays in a core clinical laboratory