Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2020 | Software

A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models

Authors: Johanna Eicher, Raffael Bild, Helmut Spengler, Klaus A. Kuhn, Fabian Prasser

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Abstract

Background

Modern data driven medical research promises to provide new insights into the development and course of disease and to enable novel methods of clinical decision support. To realize this, machine learning models can be trained to make predictions from clinical, paraclinical and biomolecular data. In this process, privacy protection and regulatory requirements need careful consideration, as the resulting models may leak sensitive personal information. To counter this threat, a wide range of methods for integrating machine learning with formal methods of privacy protection have been proposed. However, there is a significant lack of practical tools to create and evaluate such privacy-preserving models. In this software article, we report on our ongoing efforts to bridge this gap.

Results

We have extended the well-known ARX anonymization tool for biomedical data with machine learning techniques to support the creation of privacy-preserving prediction models. Our methods are particularly well suited for applications in biomedicine, as they preserve the truthfulness of data (e.g. no noise is added) and they are intuitive and relatively easy to explain to non-experts. Moreover, our implementation is highly versatile, as it supports binomial and multinomial target variables, different types of prediction models and a wide range of privacy protection techniques. All methods have been integrated into a sound framework that supports the creation, evaluation and refinement of models through intuitive graphical user interfaces. To demonstrate the broad applicability of our solution, we present three case studies in which we created and evaluated different types of privacy-preserving prediction models for breast cancer diagnosis, diagnosis of acute inflammation of the urinary system and prediction of the contraceptive method used by women. In this process, we also used a wide range of different privacy models (k-anonymity, differential privacy and a game-theoretic approach) as well as different data transformation techniques.

Conclusions

With the tool presented in this article, accurate prediction models can be created that preserve the privacy of individuals represented in the training set in a variety of threat scenarios. Our implementation is available as open source software.

Hood L, Friend SH. Predictive, personalized, preventive, participatory (P4) cancer medicine. Nat Rev Clin oncol. 2011; 8(3):184.PubMedCrossRef

Schneeweiss S. Learning from big health care data. N Engl J Med. 2014; 370(23):2161–3.PubMedCrossRef

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al.A guide to deep learning in healthcare. Nat Med. 2019; 25(1):24.PubMedCrossRef

Liu V, Musen MA, Chou T. Data breaches of protected health information in the United States. JAMA. 2015; 313(14):1471–3.PubMedPubMedCentralCrossRef

Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012; 13(6):395–405.PubMedCrossRef

Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Invest Med. 2010; 58(1):11–18.CrossRef

El Emam K, Malin B. Appendix B: Concepts and Methods for De-identifying Clinical Trial Data. In: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. Washington, DC: The National Academies Press: 2015.

Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015; 349(6245):255–60.PubMedCrossRef

Shokri R, Shmatikov V. Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. New York: ACM: 2015. p. 1310–1321.

10.

Dankar FK, Madathil N, Dankar SK, Boughorbel S. Privacy-Preserving Analysis of Distributed Biomedical Data: Designing Efficient and Secure Multiparty Computations Using Distributed Statistical Learning Theory. JMIR Med Inform. 2019; 7(2):e12702.PubMedPubMedCentralCrossRef

11.

Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE: 2017. https://doi.org/10.1109/sp.2017.41.

12.

El Emam K, Arbuckle L. Anonymizing health data: Case studies and methods to get you started. 1st ed.Sebastopol: O’Reilly Media, Inc.; 2013.

13.

Xia W, Heatherly R, Ding X, Li J, Malin BA. R-U policy frontiers for health data de-identification. J Am Med Inform Assoc. 2015; 22(5):1029–41.PubMedPubMedCentralCrossRef

14.

Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. In: Symposium on Security and Privacy. IEEE: 2008. p. 111–125.

15.

Sweeney L. Computational disclosure control - A primer on data privacy protection. Cambridge: Massachusetts Institute of Technology; 2001.

16.

United States. The Health Insurance Portability and Accountability Act (HIPAA). Washington: U.S. Dept. of Labor, Employee Benefits Security Administration; 2004.

17.

EU General Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Off J Eur Union. 2016; 1:119.

18.

Prasser F, Eicher J, Bild R, Spengler H, Kuhn KA. A Tool for Optimizing De-identified Health Data for Use in Statistical Classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE: 2017. https://doi.org/10.1109/cbms.2017.105.

19.

Prasser F, Kohlmayer F. Putting statistical disclosure control into practice: The ARX data anonymization tool. In: Medical Data Privacy Handbook. Springer International Publishing: 2015. p. 111–148. https://doi.org/10.1007/978-3-319-23633-9_6.CrossRef

20.

Witten IH, Eibe F. Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann; 2016.

21.

Prasser F, Kohlmayer F, Kuhn KA. Efficient and effective pruning strategies for health data de-identification. BMC Med Inform Decis Making. 2016; 16(1):49.CrossRef

22.

ARX - Power Data Anonymization. http://arx.deidentifier.org/. Accessed 21 June 2019.

23.

Dwork C. Differential privacy. In: Encyclopedia of Cryptography and Security. Heidelberg: Springer: 2011. p. 338–340.

24.

Bild R, Kuhn KA, Prasser F. SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees. Proc Priv Enhancing Technol. 2018; 2018(1):67–87.CrossRef

25.

Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Ganta R, et al.A game theoretic framework for analyzing re-identification risk. PloS One. 2015; 10(3):e0120592. Cambridge.PubMedPubMedCentralCrossRef

26.

Prasser F, Gaupp J, Wan Z, Xia W, Vorobeychik Y, Kantarcioglu M, et al.An Open Source Tool for Game Theoretic Health Data De-Identification. In: AMIA Annual Symposium Proceedings. AMIA: 2017. Accepted for AMIA 2017 Annual Symposium (AMIA 2017).

27.

Iyengar VS. Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining. ACM: 2002. p. 279–88.

28.

World Health Organization. International statistical classification of diseases and related health problems. 2016. https://www.who.int/classifications/icd/en/. Accessed 21 June 2019.

29.

Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997; 29(2):103–130.CrossRef

30.

Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.CrossRef

31.

Apache Software Foundation. Apache Mahout: Scalable machine-learning and data-mining library. 2011. http://mahout.apache.org/. Accessed 21 June 2019.

32.

Smile – Statistical Machine Intelligence and Learning Engine. https://haifengl.github.io/smile/. Accessed 21 June 2019.

33.

Bailey TL, Elkan C. Estimating the Accuracy of Learned Concepts. In: Proceedings of the 13th International Joint Conference on Artifical Intelligence. San Francisco: Morgan Kaufmann Publishers Inc.: 1993. p. 895–900.

34.

Li T, Li N. On the tradeoff between privacy and utility in data publishing. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09. ACM Press: 2009. https://doi.org/10.1145/1557019.1557079.

35.

Inan A, Kantarcioglu M, Bertino E. Using anonymized data for classification. In: 2009 IEEE 25th International Conference on Data Engineering. IEEE: 2009. https://doi.org/10.1109/icde.2009.19.

36.

Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006; 27(8):861–74.CrossRef

37.

Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950; 78(1):1–3.CrossRef

38.

Wilks DS. Sampling distributions of the Brier score and Brier skill score under serial dependence. Q J R Meteorol Soc. 2010; 136(653):2109–18.CrossRef

39.

Prasser F, Kohlmayer F, Spengler H, Kuhn KA. A scalable and pragmatic method for the safe sharing of high-quality health data. IEEE J Biomed Health Inform. 2017; 22(2):611–22.PubMedCrossRef

40.

Czerniak J, Zarzycki H. Application of rough sets in the presumptive diagnosis of urinary system diseases. In: Artificial Intelligence and Security in Computing Systems. Springer: 2003. p. 41–51. https://doi.org/10.1007/978-1-4419-9226-0_5.CrossRef

41.

Dua D, Graff C. UCI Machine Learning Repository. 2017. http://archive.ics.uci.edu/ml. Accessed 21 June 2019.

42.

European Medicines Agency. External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. 2016:1–99. EMA/90915/2016.

43.

Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci. 1990; 87(23):9193–6.PubMedCrossRefPubMedCentral

44.

McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al.The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011; 4(1):13.PubMedPubMedCentralCrossRef

45.

Brickell J, Shmatikov V. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In: 14th International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas: ACM: 2008. p. 70–78.

46.

Aggarwal CC, Yu PS. A general survey of privacy-preserving data mining models and algorithms. In: Privacy-Preserving Data Mining. Springer: 2008. p. 11–52. https://doi.org/10.1007/978-0-387-70992-5_2.CrossRef

47.

Fung BCM, Wang K, Fu AWC, Yu PS. Introduction to privacy-preserving data publishing: Concepts and techniques. 1st ed.Boca Raton: CRC Press; 2010.CrossRef

48.

Malle B, Kieseberg P, Holzinger A. Do not disturb? classifier behavior on perturbed datasets. In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer: 2017. p. 155–73. https://doi.org/10.1007/978-3-319-66808-6_11.

49.

Li J, Liu J, Baig M, Wong RCW. Information based data anonymization for classification utility. Data Knowl Eng. 2011; 70(12):1030–45.CrossRef

50.

Last M, Tassa T, Zhmudyak A, Shmueli E. Improving accuracy of classification models induced from anonymized datasets. Inf Sci. 2014; 256:138–161.CrossRef

51.

Lin KP, Chen MS. On the design and analysis of the privacy-preserving SVM classifier. IEEE Trans Knowl Data Eng. 2011; 23(11):1704–17.CrossRef

52.

Fong PK, Weber-Jahnke JH. Privacy preserving decision tree learning using unrealized data sets. Trans Knowl Data Eng. 2012; 24(2):353–364.CrossRef

53.

Sazonova V, Matwin S. Combining Binary Classifiers for a Multiclass Problem with Differential Privacy. Trans Data Priv. 2014; 7(1):51–70.

54.

Mancuhan K, Clifton C. Statistical Learning Theory Approach for Data Classification with ℓ-diversity. In: Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM: 2017. p. 651–659. https://doi.org/10.1137/1.9781611974973.73.CrossRef

55.

Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, et al.Deep learning with differential privacy. In: Proceedings of the 2016 SIGSAC Conference on Computer and Communications Security. New York: ACM: 2016. p. 308–318.

56.

Esteban C, Hyland SL, Rätsch G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv preprint arXiv:170602633. 2017.

57.

Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. arXiv preprint arXiv:170306490. 2017.

58.

Friedman A, Schuster A. Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’10. ACM: 2010. https://doi.org/10.1145/1835804.1835868.

59.

Zhang N, Li M, Lou W. Distributed data mining with differential privacy. In: IEEE International Conference on Communications (ICC). IEEE: 2011. https://doi.org/10.1109/icc.2011.5962863.

60.

Jiang X, Ji Z, Wang S, Mohammed N, Cheng S, Ohno-Machado L. Differential-private data publishing through component analysis. Trans Data Priv. 2013; 6(1):19.PubMedPubMedCentral

61.

Zaman ANK, Obimbo C, Dara RA. A Novel Differential Privacy Approach that Enhances Classification Accuracy. In: Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering - C3S2E ’16. ACM: 2016. https://doi.org/10.1145/2948992.2949027.

62.

Zaman ANK, Obimbo C, Dara RA. An Improved Data Sanitization Algorithm for Privacy Preserving Medical Data Publishing. In: Canadian Conference on Artificial Intelligence. Basel: Springer: 2017. p. 64–70.

63.

De Waal A, Hundepool A, Willenborg L. C. R. J. Argus: Software for statistical disclosure control of microdata. US Census Bureau: 1995.

64.

Templ M. Statistical disclosure control for microdata using the R-package sdcMicro. Trans Data Priv. 2008; 1(2):67–85.

65.

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2016.

66.

Dankar FK, El Emam K. Practicing differential privacy in health care: A review. Trans Data Priv. 2013; 6(1):35–67.

Title: A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models
Authors: Johanna Eicher
Raffael Bild
Helmut Spengler
Klaus A. Kuhn
Fabian Prasser
Publication date: 01-12-2020
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-020-1041-3

At a glance: The STEP trials

Springer Medicine

A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models

Abstract

Background

Results

Conclusions

At a glance: The STEP trials

Springer Medicine

Abstract

Background

Results

Conclusions

Please log in to get access to this content

Other articles of this Issue 1/2020

A realism-based approach to an ontological representation of symbiotic interactions

Network analysis of autistic disease comorbidities in Chinese children based on ICD-10 codes

WeChat-based intervention to support breastfeeding for Chinese mothers: protocol of a randomised controlled trial

The role of histogram analysis in diffusion-weighted imaging in the differential diagnosis of benign and malignant breast lesions

A mixed methods systematic review of the effects of patient online self-diagnosing in the ‘smart-phone society’ on the healthcare professional-patient relationship and medical authority

Predicting patient outcomes in psychiatric hospitals with routine data: a machine learning approach