Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2017

Open Access 01-12-2017 | Research article

Addressing data privacy in matched studies via virtual pooling

Authors: P. Saha-Chaudhuri, C.R. Weinberg

Published in: BMC Medical Research Methodology | Issue 1/2017

Login to get access

Abstract

Background

Data confidentiality and shared use of research data are two desirable but sometimes conflicting goals in research with multi-center studies and distributed data. While ideal for straightforward analysis, confidentiality restrictions forbid creation of a single dataset that includes covariate information of all participants. Current approaches such as aggregate data sharing, distributed regression, meta-analysis and score-based methods can have important limitations.

Methods

We propose a novel application of an existing epidemiologic tool, specimen pooling, to enable confidentiality-preserving analysis of data arising from a matched case-control, multi-center design. Instead of pooling specimens prior to assay, we apply the methodology to virtually pool (aggregate) covariates within nodes. Such virtual pooling retains most of the information used in an analysis with individual data and since individual participant data is not shared externally, within-node virtual pooling preserves data confidentiality. We show that aggregated covariate levels can be used in a conditional logistic regression model to estimate individual-level odds ratios of interest.

Results

The parameter estimates from the standard conditional logistic regression are compared to the estimates based on a conditional logistic regression model with aggregated data. The parameter estimates are shown to be similar to those without pooling and to have comparable standard errors and confidence interval coverage.

Conclusions

Virtual data pooling can be used to maintain confidentiality of data from multi-center study and can be particularly useful in research with large-scale distributed data.
Footnotes
1
We have used the clogit function in the programming language R to fit a conditional logistic regression model to data from a matched case control design.
 
2
All simulations and statistical analyses were performed using the programming language R. The built-in R function “clogit” was used to estimate parameters of eq. (4) and eq. (5). R codes are available upon request from the corresponding author.
 
Literature
1.
go back to reference Fears R, Brand H, Frackowiak R, Pastoret PP, Souhami R, Thompson B. Data protection regulation and the promotion of health research: getting the balance right. QJM. 2014;107(1):3–5.CrossRefPubMed Fears R, Brand H, Frackowiak R, Pastoret PP, Souhami R, Thompson B. Data protection regulation and the promotion of health research: getting the balance right. QJM. 2014;107(1):3–5.CrossRefPubMed
2.
go back to reference Dolgin, E.: ‘New data protection rules could harm research, science groups say’, Nature Medicine, 2014, 20, (3), pp. 224. Dolgin, E.: ‘New data protection rules could harm research, science groups say’, Nature Medicine, 2014, 20, (3), pp. 224.
3.
go back to reference Mostert, M., Bredenoord, A.L., Biesaart, M.C.I.H., and van Delden, J.J.M.: ‘Big data in medical research and EU data protection law: challenges to the consent or anonymise approach’, Eur J Hum Genet, 2015. Mostert, M., Bredenoord, A.L., Biesaart, M.C.I.H., and van Delden, J.J.M.: ‘Big data in medical research and EU data protection law: challenges to the consent or anonymise approach’, Eur J Hum Genet, 2015.
4.
go back to reference Nyrén O, Stenbeck M, Grönberg H. The European Parliament proposal for the new EU general data protection regulation may severely restrict European epidemiological research. Eur J Epidemiol. 2014;29(4):227–30.CrossRefPubMedPubMedCentral Nyrén O, Stenbeck M, Grönberg H. The European Parliament proposal for the new EU general data protection regulation may severely restrict European epidemiological research. Eur J Epidemiol. 2014;29(4):227–30.CrossRefPubMedPubMedCentral
5.
go back to reference Olsen J. Data protection and epidemiological research: a new EU regulation is in the pipeline. Int J Epidemiol. 2014;43(5):1353–4.CrossRefPubMed Olsen J. Data protection and epidemiological research: a new EU regulation is in the pipeline. Int J Epidemiol. 2014;43(5):1353–4.CrossRefPubMed
6.
go back to reference Ploem MC, Essink-Bot ML, Stronks K. Proposed EU data protection regulation is a threat to medical research. BMJ. 2013;346 Ploem MC, Essink-Bot ML, Stronks K. Proposed EU data protection regulation is a threat to medical research. BMJ. 2013;346
7.
go back to reference Rosano G, Pelliccia F, Gaudio C, Coats AJ. The challenge of performing effective medical research in the era of healthcare data protection. Int J Cardiol. 2014;177(2):510–1.CrossRefPubMed Rosano G, Pelliccia F, Gaudio C, Coats AJ. The challenge of performing effective medical research in the era of healthcare data protection. Int J Cardiol. 2014;177(2):510–1.CrossRefPubMed
8.
go back to reference Vandenbroucke JP, Olsen J. Informed consent and the new EU regulation on data protection. Int J Epidemiol. 2013;42(6):1891–2.CrossRefPubMed Vandenbroucke JP, Olsen J. Informed consent and the new EU regulation on data protection. Int J Epidemiol. 2013;42(6):1891–2.CrossRefPubMed
9.
10.
go back to reference Rumbold, J.M.M., and Pierscionek, B.: ‘The Effect of the General Data Protection Regulation on Medical Research’, J Med Internet Res, 2017, 19, (2), pp. e47. Rumbold, J.M.M., and Pierscionek, B.: ‘The Effect of the General Data Protection Regulation on Medical Research’, J Med Internet Res, 2017, 19, (2), pp. e47.
11.
go back to reference Suissa, S., Henry, D., Caetano, P., Dormuth, C.R., Ernst, P., Hemmelgarn, B., Lelorier, J., Levy, A., Martens, P.J., Paterson, J.M., Platt, R.W., Sketris, I., and Teare, G.: ‘CNODES: the Canadian Network for Observational Drug Effect Studies.’, Open Med, 2012, 6, (4), pp. ~e134–140. Suissa, S., Henry, D., Caetano, P., Dormuth, C.R., Ernst, P., Hemmelgarn, B., Lelorier, J., Levy, A., Martens, P.J., Paterson, J.M., Platt, R.W., Sketris, I., and Teare, G.: ‘CNODES: the Canadian Network for Observational Drug Effect Studies.’, Open Med, 2012, 6, (4), pp. ~e134–140.
13.
go back to reference Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., and Craig, D.W.: ‘Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays’, PLoS Genet, 2008, 4, (8), pp. e1000167 (1000161–1000169). Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., and Craig, D.W.: ‘Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays’, PLoS Genet, 2008, 4, (8), pp. e1000167 (1000161–1000169).
14.
go back to reference Karr, A.F., Fulp, W.J., Vera, F., Young, S.S., Lin, X., and Reiter, J.P.: ‘Secure, privacy-preserving analysis of distributed databases’, Technometrics, 2007, 49, pp. ~335–345. Karr, A.F., Fulp, W.J., Vera, F., Young, S.S., Lin, X., and Reiter, J.P.: ‘Secure, privacy-preserving analysis of distributed databases’, Technometrics, 2007, 49, pp. ~335–345.
15.
go back to reference Karr, A.F., Lin, X., Sanil, A.P., and Reiter, J.P.: ‘Secure Regression on Distributed Databases’, Journal of Computational and Graphical Statistics, 2005, 14, (2), pp. ~263–279. Karr, A.F., Lin, X., Sanil, A.P., and Reiter, J.P.: ‘Secure Regression on Distributed Databases’, Journal of Computational and Graphical Statistics, 2005, 14, (2), pp. ~263–279.
16.
go back to reference Raghunathan, T.E., Reiter, J.P., and and Rubin, D.B.: ‘Multiple imputation for statistical disclosure Limitation’, Journal of Official Statistics, 2003, 19, (1), pp. ~1–16. Raghunathan, T.E., Reiter, J.P., and and Rubin, D.B.: ‘Multiple imputation for statistical disclosure Limitation’, Journal of Official Statistics, 2003, 19, (1), pp. ~1–16.
17.
go back to reference Rassen, J.A., Moran, J., Toh, D., Kowal, M.K., Johnson, K., Shoabi, A., Hammad, T.A., Raebel, M.A., Holmes, J.H., Haynes, K., Myers, J., and Schneeweiss, S.: ‘Evaluating strategies for data sharing and analyses in distributed data settings’, in Editor (Ed.)^(Eds.): ‘Book Evaluating strategies for data sharing and analyses in distributed data settings’ (2010, edn.), pp. Rassen, J.A., Moran, J., Toh, D., Kowal, M.K., Johnson, K., Shoabi, A., Hammad, T.A., Raebel, M.A., Holmes, J.H., Haynes, K., Myers, J., and Schneeweiss, S.: ‘Evaluating strategies for data sharing and analyses in distributed data settings’, in Editor (Ed.)^(Eds.): ‘Book Evaluating strategies for data sharing and analyses in distributed data settings’ (2010, edn.), pp.
18.
go back to reference Walker E, Hernandez AV, Kattan MW. Meta-analysis: its strengths and limitations. Cleve Clin J Med. 2008;75(6):431–9.CrossRefPubMed Walker E, Hernandez AV, Kattan MW. Meta-analysis: its strengths and limitations. Cleve Clin J Med. 2008;75(6):431–9.CrossRefPubMed
19.
go back to reference Greco T, Zangrillo A, Biondi-Zoccai G, Landoni G. Meta-analysis: pitfalls and hints. Heart Lung Vessel. 2013;5(4):219–25.PubMedPubMedCentral Greco T, Zangrillo A, Biondi-Zoccai G, Landoni G. Meta-analysis: pitfalls and hints. Heart Lung Vessel. 2013;5(4):219–25.PubMedPubMedCentral
20.
go back to reference Ng TT, McGory ML, Ko CY, Maggard MA. Meta-analysis in surgery: methods and limitations. Arch Surg. 2006;141(11):1125–30. discussion 1131CrossRefPubMed Ng TT, McGory ML, Ko CY, Maggard MA. Meta-analysis in surgery: methods and limitations. Arch Surg. 2006;141(11):1125–30. discussion 1131CrossRefPubMed
21.
go back to reference Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one-stage and two-stage approaches, and why they may differ. Stat Med. 2017;36(5):855–75.CrossRefPubMed Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one-stage and two-stage approaches, and why they may differ. Stat Med. 2017;36(5):855–75.CrossRefPubMed
22.
go back to reference Debray TP, Moons KG, Ahmed I, Koffijberg H, Riley RD. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013;32(18):3158–80.CrossRefPubMed Debray TP, Moons KG, Ahmed I, Koffijberg H, Riley RD. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013;32(18):3158–80.CrossRefPubMed
23.
go back to reference Hua H, Burke DL, Crowther MJ, Ensor J, Tudur Smith C, Riley RD. One-stage individual participant data meta-analysis models: estimation of treatment-covariate interactions must avoid ecological bias by separating out within-trial and across-trial information. Stat Med. 2017;36(5):772–89.CrossRefPubMed Hua H, Burke DL, Crowther MJ, Ensor J, Tudur Smith C, Riley RD. One-stage individual participant data meta-analysis models: estimation of treatment-covariate interactions must avoid ecological bias by separating out within-trial and across-trial information. Stat Med. 2017;36(5):772–89.CrossRefPubMed
24.
go back to reference Liu D, Liu RY, Xie M. Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. J Am Stat Assoc. 2015;110(509):326–40.CrossRefPubMed Liu D, Liu RY, Xie M. Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. J Am Stat Assoc. 2015;110(509):326–40.CrossRefPubMed
25.
go back to reference Saha-Chaudhuri, P.: ‘Covariate microaggregation for logistic regression: an application for analysis of confidential data’, Pre-print, 2016. Saha-Chaudhuri, P.: ‘Covariate microaggregation for logistic regression: an application for analysis of confidential data’, Pre-print, 2016.
26.
go back to reference Hosmer, D.W., Lemeshow, S., and Sturdivant, R.X.: ‘Applied Logistic Regression’ (John Wiley & Sons Inc., New York, NY, 2013, Third edn. 2013). Hosmer, D.W., Lemeshow, S., and Sturdivant, R.X.: ‘Applied Logistic Regression’ (John Wiley & Sons Inc., New York, NY, 2013, Third edn. 2013).
27.
go back to reference Dorfman R. The detection of defective members of large populations. Ann Math Stat. 1943;14:436–40.CrossRef Dorfman R. The detection of defective members of large populations. Ann Math Stat. 1943;14:436–40.CrossRef
28.
go back to reference Kline RL, Brothers TA, Brookmeyer R, Zeger S, Quinn TC. Evaluation of human immunodeficiency virus seroprevalence in population surveys using pooled sera. J Clin Microbiol. 1989;27(7):1449–52.PubMedPubMedCentral Kline RL, Brothers TA, Brookmeyer R, Zeger S, Quinn TC. Evaluation of human immunodeficiency virus seroprevalence in population surveys using pooled sera. J Clin Microbiol. 1989;27(7):1449–52.PubMedPubMedCentral
29.
go back to reference Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55(3):718–26.CrossRefPubMed Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55(3):718–26.CrossRefPubMed
31.
go back to reference Saha-Chaudhuri P, Weinberg CR. Specimen pooling for efficient use of bio-specimens in studies of time to a common event. Am J Epidemiol. 2013;178:126–35.CrossRefPubMedPubMedCentral Saha-Chaudhuri P, Weinberg CR. Specimen pooling for efficient use of bio-specimens in studies of time to a common event. Am J Epidemiol. 2013;178:126–35.CrossRefPubMedPubMedCentral
32.
go back to reference Weinberg, C.R., and Umbach, D.M.: ‘Correction to “Using Pooled Exposure Assessment to Improve Efficiency in Case–Control Studies,” by Clarice R. Weinberg and David M. Umbach; 55, 718–726, September 1999’, Biometrics, 2014, 70, (4), pp. 1061–1061. Weinberg, C.R., and Umbach, D.M.: ‘Correction to “Using Pooled Exposure Assessment to Improve Efficiency in Case–Control Studies,” by Clarice R. Weinberg and David M. Umbach; 55, 718–726, September 1999’, Biometrics, 2014, 70, (4), pp. 1061–1061.
33.
go back to reference Yu, O.H.Y., Filion, K.B., Azoulay, L., Patenaude, V., Majdan, A., and Suissa, S.: ‘Incretin-Based Drugs and the Risk of Congestive Heart Failure’, Diabetes Care, 2015, 38, (2), pp. ~277–284. Yu, O.H.Y., Filion, K.B., Azoulay, L., Patenaude, V., Majdan, A., and Suissa, S.: ‘Incretin-Based Drugs and the Risk of Congestive Heart Failure’, Diabetes Care, 2015, 38, (2), pp. ~277–284.
34.
go back to reference Herrett, E., Gallagher, A.M., Bhaskaran, K., Forbes, H., Mathur, R., van Staa, T., and Smeeth, L.: ‘Data resource profile: clinical practice research Datalink (CPRD)’, International Journal of Epidemiology, 2015. Herrett, E., Gallagher, A.M., Bhaskaran, K., Forbes, H., Mathur, R., van Staa, T., and Smeeth, L.: ‘Data resource profile: clinical practice research Datalink (CPRD)’, International Journal of Epidemiology, 2015.
35.
go back to reference Wolfson, M., Wallace, S.E., Masca, N., Rowe, G., Sheehan, N.A., Ferretti, V., LaFlamme, P., Tobin, M.D., Macleod, J., Little, J., Fortier, I., Knoppers, B.M., and Burton, P.R.: ‘DataSHIELD: resolving a conflict in contemporary bioscience-performing a pooled analysis of individual-level data without sharing the data’, International Journal of Epidemiology, 2010, 39, pp. ~1372–1382. Wolfson, M., Wallace, S.E., Masca, N., Rowe, G., Sheehan, N.A., Ferretti, V., LaFlamme, P., Tobin, M.D., Macleod, J., Little, J., Fortier, I., Knoppers, B.M., and Burton, P.R.: ‘DataSHIELD: resolving a conflict in contemporary bioscience-performing a pooled analysis of individual-level data without sharing the data’, International Journal of Epidemiology, 2010, 39, pp. ~1372–1382.
36.
go back to reference Fienberg, S.E., Fulp, W.J., Slavkovic, A.B., and Wrobel, T.A.: ‘"Secure" Log-Linear and Logistic Regression Analysis of Distributed Databases’, in Editor (Ed.)^(Eds.): ‘Book "Secure" Log-Linear and Logistic Regression Analysis of Distributed Databases’ (Springer Berlin Heidelberg, 2006, edn.), pp. ~277–290. Fienberg, S.E., Fulp, W.J., Slavkovic, A.B., and Wrobel, T.A.: ‘"Secure" Log-Linear and Logistic Regression Analysis of Distributed Databases’, in Editor (Ed.)^(Eds.): ‘Book "Secure" Log-Linear and Logistic Regression Analysis of Distributed Databases’ (Springer Berlin Heidelberg, 2006, edn.), pp. ~277–290.
37.
go back to reference Domingo-Ferrer JM, Mateo-Sanz J. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knolwedge and Data Engeneering. 2002;14:189–201.CrossRef Domingo-Ferrer JM, Mateo-Sanz J. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knolwedge and Data Engeneering. 2002;14:189–201.CrossRef
38.
go back to reference Schmid M. Estimation of a linear model under microaggregation by individual ranking. Allg Stat Arch. 2006;90(3):419–38. Schmid M. Estimation of a linear model under microaggregation by individual ranking. Allg Stat Arch. 2006;90(3):419–38.
39.
go back to reference Schmid M, Schneeweiss H. The effect of microaggregation by individual ranking on the estimation of moments. J Econ. 2009;153(2):174–82.CrossRef Schmid M, Schneeweiss H. The effect of microaggregation by individual ranking on the estimation of moments. J Econ. 2009;153(2):174–82.CrossRef
Metadata
Title
Addressing data privacy in matched studies via virtual pooling
Authors
P. Saha-Chaudhuri
C.R. Weinberg
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2017
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-017-0419-0

Other articles of this Issue 1/2017

BMC Medical Research Methodology 1/2017 Go to the issue