Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2020

Open Access 01-12-2020 | Software

A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models

Authors: Johanna Eicher, Raffael Bild, Helmut Spengler, Klaus A. Kuhn, Fabian Prasser

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Login to get access

Abstract

Background

Modern data driven medical research promises to provide new insights into the development and course of disease and to enable novel methods of clinical decision support. To realize this, machine learning models can be trained to make predictions from clinical, paraclinical and biomolecular data. In this process, privacy protection and regulatory requirements need careful consideration, as the resulting models may leak sensitive personal information. To counter this threat, a wide range of methods for integrating machine learning with formal methods of privacy protection have been proposed. However, there is a significant lack of practical tools to create and evaluate such privacy-preserving models. In this software article, we report on our ongoing efforts to bridge this gap.

Results

We have extended the well-known ARX anonymization tool for biomedical data with machine learning techniques to support the creation of privacy-preserving prediction models. Our methods are particularly well suited for applications in biomedicine, as they preserve the truthfulness of data (e.g. no noise is added) and they are intuitive and relatively easy to explain to non-experts. Moreover, our implementation is highly versatile, as it supports binomial and multinomial target variables, different types of prediction models and a wide range of privacy protection techniques. All methods have been integrated into a sound framework that supports the creation, evaluation and refinement of models through intuitive graphical user interfaces. To demonstrate the broad applicability of our solution, we present three case studies in which we created and evaluated different types of privacy-preserving prediction models for breast cancer diagnosis, diagnosis of acute inflammation of the urinary system and prediction of the contraceptive method used by women. In this process, we also used a wide range of different privacy models (k-anonymity, differential privacy and a game-theoretic approach) as well as different data transformation techniques.

Conclusions

With the tool presented in this article, accurate prediction models can be created that preserve the privacy of individuals represented in the training set in a variety of threat scenarios. Our implementation is available as open source software.
Literature
1.
go back to reference Hood L, Friend SH. Predictive, personalized, preventive, participatory (P4) cancer medicine. Nat Rev Clin oncol. 2011; 8(3):184.PubMedCrossRef Hood L, Friend SH. Predictive, personalized, preventive, participatory (P4) cancer medicine. Nat Rev Clin oncol. 2011; 8(3):184.PubMedCrossRef
3.
go back to reference Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al.A guide to deep learning in healthcare. Nat Med. 2019; 25(1):24.PubMedCrossRef Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al.A guide to deep learning in healthcare. Nat Med. 2019; 25(1):24.PubMedCrossRef
5.
go back to reference Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012; 13(6):395–405.PubMedCrossRef Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012; 13(6):395–405.PubMedCrossRef
6.
go back to reference Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Invest Med. 2010; 58(1):11–18.CrossRef Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Invest Med. 2010; 58(1):11–18.CrossRef
7.
go back to reference El Emam K, Malin B. Appendix B: Concepts and Methods for De-identifying Clinical Trial Data. In: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. Washington, DC: The National Academies Press: 2015. El Emam K, Malin B. Appendix B: Concepts and Methods for De-identifying Clinical Trial Data. In: Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. Washington, DC: The National Academies Press: 2015.
8.
go back to reference Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015; 349(6245):255–60.PubMedCrossRef Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015; 349(6245):255–60.PubMedCrossRef
9.
go back to reference Shokri R, Shmatikov V. Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. New York: ACM: 2015. p. 1310–1321. Shokri R, Shmatikov V. Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. New York: ACM: 2015. p. 1310–1321.
10.
go back to reference Dankar FK, Madathil N, Dankar SK, Boughorbel S. Privacy-Preserving Analysis of Distributed Biomedical Data: Designing Efficient and Secure Multiparty Computations Using Distributed Statistical Learning Theory. JMIR Med Inform. 2019; 7(2):e12702.PubMedPubMedCentralCrossRef Dankar FK, Madathil N, Dankar SK, Boughorbel S. Privacy-Preserving Analysis of Distributed Biomedical Data: Designing Efficient and Secure Multiparty Computations Using Distributed Statistical Learning Theory. JMIR Med Inform. 2019; 7(2):e12702.PubMedPubMedCentralCrossRef
12.
go back to reference El Emam K, Arbuckle L. Anonymizing health data: Case studies and methods to get you started. 1st ed.Sebastopol: O’Reilly Media, Inc.; 2013. El Emam K, Arbuckle L. Anonymizing health data: Case studies and methods to get you started. 1st ed.Sebastopol: O’Reilly Media, Inc.; 2013.
13.
14.
go back to reference Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. In: Symposium on Security and Privacy. IEEE: 2008. p. 111–125. Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. In: Symposium on Security and Privacy. IEEE: 2008. p. 111–125.
15.
go back to reference Sweeney L. Computational disclosure control - A primer on data privacy protection. Cambridge: Massachusetts Institute of Technology; 2001. Sweeney L. Computational disclosure control - A primer on data privacy protection. Cambridge: Massachusetts Institute of Technology; 2001.
16.
go back to reference United States. The Health Insurance Portability and Accountability Act (HIPAA). Washington: U.S. Dept. of Labor, Employee Benefits Security Administration; 2004. United States. The Health Insurance Portability and Accountability Act (HIPAA). Washington: U.S. Dept. of Labor, Employee Benefits Security Administration; 2004.
17.
go back to reference EU General Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Off J Eur Union. 2016; 1:119. EU General Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Off J Eur Union. 2016; 1:119.
18.
go back to reference Prasser F, Eicher J, Bild R, Spengler H, Kuhn KA. A Tool for Optimizing De-identified Health Data for Use in Statistical Classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE: 2017. https://doi.org/10.1109/cbms.2017.105. Prasser F, Eicher J, Bild R, Spengler H, Kuhn KA. A Tool for Optimizing De-identified Health Data for Use in Statistical Classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE: 2017. https://​doi.​org/​10.​1109/​cbms.​2017.​105.
20.
go back to reference Witten IH, Eibe F. Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann; 2016. Witten IH, Eibe F. Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann; 2016.
21.
go back to reference Prasser F, Kohlmayer F, Kuhn KA. Efficient and effective pruning strategies for health data de-identification. BMC Med Inform Decis Making. 2016; 16(1):49.CrossRef Prasser F, Kohlmayer F, Kuhn KA. Efficient and effective pruning strategies for health data de-identification. BMC Med Inform Decis Making. 2016; 16(1):49.CrossRef
23.
go back to reference Dwork C. Differential privacy. In: Encyclopedia of Cryptography and Security. Heidelberg: Springer: 2011. p. 338–340. Dwork C. Differential privacy. In: Encyclopedia of Cryptography and Security. Heidelberg: Springer: 2011. p. 338–340.
24.
go back to reference Bild R, Kuhn KA, Prasser F. SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees. Proc Priv Enhancing Technol. 2018; 2018(1):67–87.CrossRef Bild R, Kuhn KA, Prasser F. SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees. Proc Priv Enhancing Technol. 2018; 2018(1):67–87.CrossRef
25.
go back to reference Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Ganta R, et al.A game theoretic framework for analyzing re-identification risk. PloS One. 2015; 10(3):e0120592. Cambridge.PubMedPubMedCentralCrossRef Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Ganta R, et al.A game theoretic framework for analyzing re-identification risk. PloS One. 2015; 10(3):e0120592. Cambridge.PubMedPubMedCentralCrossRef
26.
go back to reference Prasser F, Gaupp J, Wan Z, Xia W, Vorobeychik Y, Kantarcioglu M, et al.An Open Source Tool for Game Theoretic Health Data De-Identification. In: AMIA Annual Symposium Proceedings. AMIA: 2017. Accepted for AMIA 2017 Annual Symposium (AMIA 2017). Prasser F, Gaupp J, Wan Z, Xia W, Vorobeychik Y, Kantarcioglu M, et al.An Open Source Tool for Game Theoretic Health Data De-Identification. In: AMIA Annual Symposium Proceedings. AMIA: 2017. Accepted for AMIA 2017 Annual Symposium (AMIA 2017).
27.
go back to reference Iyengar VS. Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining. ACM: 2002. p. 279–88. Iyengar VS. Transforming data to satisfy privacy constraints. In: International Conference on Knowledge Discovery and Data Mining. ACM: 2002. p. 279–88.
29.
go back to reference Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997; 29(2):103–130.CrossRef Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997; 29(2):103–130.CrossRef
33.
go back to reference Bailey TL, Elkan C. Estimating the Accuracy of Learned Concepts. In: Proceedings of the 13th International Joint Conference on Artifical Intelligence. San Francisco: Morgan Kaufmann Publishers Inc.: 1993. p. 895–900. Bailey TL, Elkan C. Estimating the Accuracy of Learned Concepts. In: Proceedings of the 13th International Joint Conference on Artifical Intelligence. San Francisco: Morgan Kaufmann Publishers Inc.: 1993. p. 895–900.
36.
go back to reference Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006; 27(8):861–74.CrossRef Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006; 27(8):861–74.CrossRef
37.
go back to reference Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950; 78(1):1–3.CrossRef Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950; 78(1):1–3.CrossRef
38.
go back to reference Wilks DS. Sampling distributions of the Brier score and Brier skill score under serial dependence. Q J R Meteorol Soc. 2010; 136(653):2109–18.CrossRef Wilks DS. Sampling distributions of the Brier score and Brier skill score under serial dependence. Q J R Meteorol Soc. 2010; 136(653):2109–18.CrossRef
39.
go back to reference Prasser F, Kohlmayer F, Spengler H, Kuhn KA. A scalable and pragmatic method for the safe sharing of high-quality health data. IEEE J Biomed Health Inform. 2017; 22(2):611–22.PubMedCrossRef Prasser F, Kohlmayer F, Spengler H, Kuhn KA. A scalable and pragmatic method for the safe sharing of high-quality health data. IEEE J Biomed Health Inform. 2017; 22(2):611–22.PubMedCrossRef
42.
go back to reference European Medicines Agency. External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. 2016:1–99. EMA/90915/2016. European Medicines Agency. External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. 2016:1–99. EMA/90915/2016.
43.
go back to reference Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci. 1990; 87(23):9193–6.PubMedCrossRefPubMedCentral Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci. 1990; 87(23):9193–6.PubMedCrossRefPubMedCentral
44.
go back to reference McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al.The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011; 4(1):13.PubMedPubMedCentralCrossRef McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al.The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011; 4(1):13.PubMedPubMedCentralCrossRef
45.
go back to reference Brickell J, Shmatikov V. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In: 14th International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas: ACM: 2008. p. 70–78. Brickell J, Shmatikov V. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In: 14th International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas: ACM: 2008. p. 70–78.
47.
go back to reference Fung BCM, Wang K, Fu AWC, Yu PS. Introduction to privacy-preserving data publishing: Concepts and techniques. 1st ed.Boca Raton: CRC Press; 2010.CrossRef Fung BCM, Wang K, Fu AWC, Yu PS. Introduction to privacy-preserving data publishing: Concepts and techniques. 1st ed.Boca Raton: CRC Press; 2010.CrossRef
49.
go back to reference Li J, Liu J, Baig M, Wong RCW. Information based data anonymization for classification utility. Data Knowl Eng. 2011; 70(12):1030–45.CrossRef Li J, Liu J, Baig M, Wong RCW. Information based data anonymization for classification utility. Data Knowl Eng. 2011; 70(12):1030–45.CrossRef
50.
go back to reference Last M, Tassa T, Zhmudyak A, Shmueli E. Improving accuracy of classification models induced from anonymized datasets. Inf Sci. 2014; 256:138–161.CrossRef Last M, Tassa T, Zhmudyak A, Shmueli E. Improving accuracy of classification models induced from anonymized datasets. Inf Sci. 2014; 256:138–161.CrossRef
51.
go back to reference Lin KP, Chen MS. On the design and analysis of the privacy-preserving SVM classifier. IEEE Trans Knowl Data Eng. 2011; 23(11):1704–17.CrossRef Lin KP, Chen MS. On the design and analysis of the privacy-preserving SVM classifier. IEEE Trans Knowl Data Eng. 2011; 23(11):1704–17.CrossRef
52.
go back to reference Fong PK, Weber-Jahnke JH. Privacy preserving decision tree learning using unrealized data sets. Trans Knowl Data Eng. 2012; 24(2):353–364.CrossRef Fong PK, Weber-Jahnke JH. Privacy preserving decision tree learning using unrealized data sets. Trans Knowl Data Eng. 2012; 24(2):353–364.CrossRef
53.
go back to reference Sazonova V, Matwin S. Combining Binary Classifiers for a Multiclass Problem with Differential Privacy. Trans Data Priv. 2014; 7(1):51–70. Sazonova V, Matwin S. Combining Binary Classifiers for a Multiclass Problem with Differential Privacy. Trans Data Priv. 2014; 7(1):51–70.
55.
go back to reference Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, et al.Deep learning with differential privacy. In: Proceedings of the 2016 SIGSAC Conference on Computer and Communications Security. New York: ACM: 2016. p. 308–318. Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, et al.Deep learning with differential privacy. In: Proceedings of the 2016 SIGSAC Conference on Computer and Communications Security. New York: ACM: 2016. p. 308–318.
56.
go back to reference Esteban C, Hyland SL, Rätsch G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv preprint arXiv:170602633. 2017. Esteban C, Hyland SL, Rätsch G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv preprint arXiv:170602633. 2017.
57.
go back to reference Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. arXiv preprint arXiv:170306490. 2017. Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. arXiv preprint arXiv:170306490. 2017.
60.
go back to reference Jiang X, Ji Z, Wang S, Mohammed N, Cheng S, Ohno-Machado L. Differential-private data publishing through component analysis. Trans Data Priv. 2013; 6(1):19.PubMedPubMedCentral Jiang X, Ji Z, Wang S, Mohammed N, Cheng S, Ohno-Machado L. Differential-private data publishing through component analysis. Trans Data Priv. 2013; 6(1):19.PubMedPubMedCentral
61.
62.
go back to reference Zaman ANK, Obimbo C, Dara RA. An Improved Data Sanitization Algorithm for Privacy Preserving Medical Data Publishing. In: Canadian Conference on Artificial Intelligence. Basel: Springer: 2017. p. 64–70. Zaman ANK, Obimbo C, Dara RA. An Improved Data Sanitization Algorithm for Privacy Preserving Medical Data Publishing. In: Canadian Conference on Artificial Intelligence. Basel: Springer: 2017. p. 64–70.
63.
go back to reference De Waal A, Hundepool A, Willenborg L. C. R. J. Argus: Software for statistical disclosure control of microdata. US Census Bureau: 1995. De Waal A, Hundepool A, Willenborg L. C. R. J. Argus: Software for statistical disclosure control of microdata. US Census Bureau: 1995.
64.
go back to reference Templ M. Statistical disclosure control for microdata using the R-package sdcMicro. Trans Data Priv. 2008; 1(2):67–85. Templ M. Statistical disclosure control for microdata using the R-package sdcMicro. Trans Data Priv. 2008; 1(2):67–85.
65.
go back to reference Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2016. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2016.
66.
go back to reference Dankar FK, El Emam K. Practicing differential privacy in health care: A review. Trans Data Priv. 2013; 6(1):35–67. Dankar FK, El Emam K. Practicing differential privacy in health care: A review. Trans Data Priv. 2013; 6(1):35–67.
Metadata
Title
A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models
Authors
Johanna Eicher
Raffael Bild
Helmut Spengler
Klaus A. Kuhn
Fabian Prasser
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-020-1041-3

Other articles of this Issue 1/2020

BMC Medical Informatics and Decision Making 1/2020 Go to the issue