Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2009

Open Access 01-12-2009 | Technical advance

Privacy-preserving record linkage using Bloom filters

Authors: Rainer Schnell, Tobias Bachteler, Jörg Reiher

Published in: BMC Medical Informatics and Decision Making | Issue 1/2009

Login to get access

Abstract

Background

Combining multiple databases with disjunctive or additional information on the same person is occurring increasingly throughout research. If unique identification numbers for these individuals are not available, probabilistic record linkage is used for the identification of matching record pairs. In many applications, identifiers have to be encrypted due to privacy concerns.

Methods

A new protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers has been developed. The protocol is based on Bloom filters on q-grams of identifiers.

Results

Tests on simulated and actual databases yield linkage results comparable to non-encrypted identifiers and superior to results from phonetic encodings.

Conclusion

We proposed a protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers. Since the protocol can be easily enhanced and has a low computational burden, the protocol might be useful for many applications requiring privacy-preserving record linkage.
Appendix
Available only for authorised users
Literature
1.
go back to reference Herzog TN, Scheuren FJ, Winkler WE: Data quality and record linkage techniques. 2007, New York: Springer Herzog TN, Scheuren FJ, Winkler WE: Data quality and record linkage techniques. 2007, New York: Springer
2.
go back to reference Clifton C, Kantarcioglu M, Doan A, Schadow G, Vaidya J, Elmagarmid AK, Suciu D: Privacy-preserving data integration and sharing. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery: 13 June 2004; Paris. Edited by: Das G, Liu B, Yu PS. 2004, New York: ACM, 19-26.CrossRef Clifton C, Kantarcioglu M, Doan A, Schadow G, Vaidya J, Elmagarmid AK, Suciu D: Privacy-preserving data integration and sharing. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery: 13 June 2004; Paris. Edited by: Das G, Liu B, Yu PS. 2004, New York: ACM, 19-26.CrossRef
3.
go back to reference Churches T, Christen P: Blind data linkage using n-gram similarity comparisons. Advances in knowledge discovery and data mining. Proceedings of the 8th Pacific-Asia Conference: 26–28 May 2004; Sydney. Edited by: Dai H, Srikant R, Zhang C. 2004, Berlin: Springer, 121-126.CrossRef Churches T, Christen P: Blind data linkage using n-gram similarity comparisons. Advances in knowledge discovery and data mining. Proceedings of the 8th Pacific-Asia Conference: 26–28 May 2004; Sydney. Edited by: Dai H, Srikant R, Zhang C. 2004, Berlin: Springer, 121-126.CrossRef
4.
go back to reference Al-Lawati A, Lee D, McDaniel P: Blocking-aware private record linkage. Proceedings of the 2nd International Workshop on Information Quality in Information Systems: 17 June 2005; Baltimore. Edited by: Berti-Equille L, Batini C, Srivastava D. 2005, New York: ACM, 59-68. Al-Lawati A, Lee D, McDaniel P: Blocking-aware private record linkage. Proceedings of the 2nd International Workshop on Information Quality in Information Systems: 17 June 2005; Baltimore. Edited by: Berti-Equille L, Batini C, Srivastava D. 2005, New York: ACM, 59-68.
5.
go back to reference Agrawal R, Evfimievski A, Srikant R: Information sharing across private databases. Proceedings of the ACM SIGMOD International Conference on Management of Data: 9–12 June 2003; San Diego. Edited by: Halevy AY, Ives ZG, Doan A. 2003, New York: ACM, 86-97.CrossRef Agrawal R, Evfimievski A, Srikant R: Information sharing across private databases. Proceedings of the ACM SIGMOD International Conference on Management of Data: 9–12 June 2003; San Diego. Edited by: Halevy AY, Ives ZG, Doan A. 2003, New York: ACM, 86-97.CrossRef
6.
go back to reference Agrawal R, Asonov D, Kantarcioglu M, Li Y: Sovereign joins. Proceedings of the 22nd International Conference on Data Engineering: 3–8 April 2006; Atlanta. Edited by: Liu L, Reuter A, Whang KY, Zhang J. 2006, Los Alamitos: IEEE Agrawal R, Asonov D, Kantarcioglu M, Li Y: Sovereign joins. Proceedings of the 22nd International Conference on Data Engineering: 3–8 April 2006; Atlanta. Edited by: Liu L, Reuter A, Whang KY, Zhang J. 2006, Los Alamitos: IEEE
7.
go back to reference Kantarcioglu M, Jiang W, Malin B: A privacy-preserving framework for integrating person-specific databases. Privacy in statistical databases. Proceedings of the UNESCO Chair in Data Privacy International Conference: 24–26 September 2008; Instanbul, Turkey. Edited by: Domingo-Ferrer J, Saygin Y. 2008, Berlin: Springer, 298-314. Kantarcioglu M, Jiang W, Malin B: A privacy-preserving framework for integrating person-specific databases. Privacy in statistical databases. Proceedings of the UNESCO Chair in Data Privacy International Conference: 24–26 September 2008; Instanbul, Turkey. Edited by: Domingo-Ferrer J, Saygin Y. 2008, Berlin: Springer, 298-314.
8.
go back to reference Xiao X, Tao Y: Dynamic anonymization: accurate statistical analysis with privacy preservation. Proceedings of the ACM SIGMOD International Conference on Management of Data: 10–12 June 2008; Vancouver. Edited by: Wang JTL. 2008, New York: ACM, 107-120.CrossRef Xiao X, Tao Y: Dynamic anonymization: accurate statistical analysis with privacy preservation. Proceedings of the ACM SIGMOD International Conference on Management of Data: 10–12 June 2008; Vancouver. Edited by: Wang JTL. 2008, New York: ACM, 107-120.CrossRef
10.
go back to reference Bellare M, Canetti R, Krawczyk H: Keying hash functions for message authentication. Advances in cryptology-CRYPTO '96. Proceedings of the 16th Annual International Cryptology Conference: 18–22 August 1996; Santa Barbara. Edited by: Koblitz N. 1996, Berlin: Springer, 1-15. Bellare M, Canetti R, Krawczyk H: Keying hash functions for message authentication. Advances in cryptology-CRYPTO '96. Proceedings of the 16th Annual International Cryptology Conference: 18–22 August 1996; Santa Barbara. Edited by: Koblitz N. 1996, Berlin: Springer, 1-15.
11.
go back to reference Churches T, Christen P: Some methods for blindfolded record linkage. BMC Med Inf Decis Mak. 2004, 4 (9): Churches T, Christen P: Some methods for blindfolded record linkage. BMC Med Inf Decis Mak. 2004, 4 (9):
12.
go back to reference O'Keefe CM, Yung M, Gu L, Baxter R: Privacy-preserving data linkage protocols. Proceedings of the 2004 ACM Workshop on Privacy in the Electronic Society: 28 October 2004; Washington, DC. Edited by: De Capitani di Vimercati S, Syverson P. 2004, New York: ACM, 94-102.CrossRef O'Keefe CM, Yung M, Gu L, Baxter R: Privacy-preserving data linkage protocols. Proceedings of the 2004 ACM Workshop on Privacy in the Electronic Society: 28 October 2004; Washington, DC. Edited by: De Capitani di Vimercati S, Syverson P. 2004, New York: ACM, 94-102.CrossRef
13.
go back to reference Berman JJ: Zero-check: a zero-knowledge protocol for reconciling patient identities across institutions. Arch Pathol Lab Med. 2004, 128 (3): 344-346.PubMed Berman JJ: Zero-check: a zero-knowledge protocol for reconciling patient identities across institutions. Arch Pathol Lab Med. 2004, 128 (3): 344-346.PubMed
14.
go back to reference Jaro MA: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989, 84 (406): 414-420. 10.2307/2289924.CrossRef Jaro MA: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989, 84 (406): 414-420. 10.2307/2289924.CrossRef
15.
go back to reference Bouzelat H, Quantin C, Dusserre L: Extraction and anonymity protocol of medical file. Proceedings of the 1996 AMIA Annual Fall Symposium: 26–30 October 1996; Washington, DC. Edited by: Cimino JJ. 1996, Philadelphia: Hanley & Belfus, 323-327. Bouzelat H, Quantin C, Dusserre L: Extraction and anonymity protocol of medical file. Proceedings of the 1996 AMIA Annual Fall Symposium: 26–30 October 1996; Washington, DC. Edited by: Cimino JJ. 1996, Philadelphia: Hanley & Belfus, 323-327.
16.
go back to reference Quantin C, Binquet C, Allaert FA, Cornet B, Pattisina R, Leteuff G, Ferdynus C, Gouyon JB: Decision analysis for the assessment of a record linkage procedure: application to a perinatal network. Methods Inf Med. 2005, 44: 72-79.PubMed Quantin C, Binquet C, Allaert FA, Cornet B, Pattisina R, Leteuff G, Ferdynus C, Gouyon JB: Decision analysis for the assessment of a record linkage procedure: application to a perinatal network. Methods Inf Med. 2005, 44: 72-79.PubMed
18.
go back to reference Trepetin S: Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal. 2008, 17 (5–6): 253-266. 10.1080/19393550802492503. Trepetin S: Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal. 2008, 17 (5–6): 253-266. 10.1080/19393550802492503.
19.
go back to reference Thoben W, Appelrath HJ, Sauer S: Record linkage of anonymous data by control numbers. From data to knowledge: theoretical and practical aspects of classification, data analysis, and knowledge organization. Edited by: Gaul W, Pfeifer D. 1995, Berlin: Springer, 412-419. [Bock H H, Gaul W and Vichi M (Series Editors): Studies in Classification, Data Analysis, and Knowledge Organization, vol 6] Thoben W, Appelrath HJ, Sauer S: Record linkage of anonymous data by control numbers. From data to knowledge: theoretical and practical aspects of classification, data analysis, and knowledge organization. Edited by: Gaul W, Pfeifer D. 1995, Berlin: Springer, 412-419. [Bock H H, Gaul W and Vichi M (Series Editors): Studies in Classification, Data Analysis, and Knowledge Organization, vol 6]
20.
go back to reference Eichelberg M, Aden T, Thoben W: A distributed patient identification protocol based on control numbers with semantic annotation. International Journal on Semantic Web and Information Systems. 2005, 1 (4): 24-43.CrossRef Eichelberg M, Aden T, Thoben W: A distributed patient identification protocol based on control numbers with semantic annotation. International Journal on Semantic Web and Information Systems. 2005, 1 (4): 24-43.CrossRef
21.
go back to reference Rogers HJ, Willett P: Searching for historical word forms in text databases using spelling-correction methods: reverse error and phonetic coding methods. J Doc. 1991, 47 (4): 333-353. 10.1108/eb026883.CrossRef Rogers HJ, Willett P: Searching for historical word forms in text databases using spelling-correction methods: reverse error and phonetic coding methods. J Doc. 1991, 47 (4): 333-353. 10.1108/eb026883.CrossRef
22.
go back to reference Friedman C, Sideli R: Tolerating spelling errors during patient validation. Comput Biomed Res. 1992, 25 (5): 486-509. 10.1016/0010-4809(92)90005-U.CrossRefPubMed Friedman C, Sideli R: Tolerating spelling errors during patient validation. Comput Biomed Res. 1992, 25 (5): 486-509. 10.1016/0010-4809(92)90005-U.CrossRefPubMed
23.
go back to reference Camps R, Daude J: Improving the efficacy of approximate searching by personal-name. Natural language processing and information systems. Proceedings of the 8th International Conference on Applications of Natural Language to Information Systems: June 2003; Burg (Spreewald), Germany. Edited by: Düsterhöft A, Thalheim B. 2003, Bonn: Gesellschaft für Informatik, 70-76. Camps R, Daude J: Improving the efficacy of approximate searching by personal-name. Natural language processing and information systems. Proceedings of the 8th International Conference on Applications of Natural Language to Information Systems: June 2003; Burg (Spreewald), Germany. Edited by: Düsterhöft A, Thalheim B. 2003, Bonn: Gesellschaft für Informatik, 70-76.
25.
go back to reference Christen P: Privacy-preserving data linkage and geocoding: current approaches and research directions. Workshops Proceedings of the 6th IEEE International Conference on Data Mining: 18–22 December 2006; Hong Kong. Edited by: Tsumoto S. 2006, Los Alamitos: IEEE, 497-501. Christen P: Privacy-preserving data linkage and geocoding: current approaches and research directions. Workshops Proceedings of the 6th IEEE International Conference on Data Mining: 18–22 December 2006; Hong Kong. Edited by: Tsumoto S. 2006, Los Alamitos: IEEE, 497-501.
26.
go back to reference Verykios VS, Karakasidis A, Mitrogiannis VK: Privacy preserving record linkage approaches. International Journal of Data Mining, Modelling and Management. 2009, 1 (2): 206-221. 10.1504/IJDMMM.2009.026076.CrossRef Verykios VS, Karakasidis A, Mitrogiannis VK: Privacy preserving record linkage approaches. International Journal of Data Mining, Modelling and Management. 2009, 1 (2): 206-221. 10.1504/IJDMMM.2009.026076.CrossRef
27.
go back to reference Trepetin S: Privacy in context: the costs and benefits of a new deidentification method. PhD thesis. 2006, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science Trepetin S: Privacy in context: the costs and benefits of a new deidentification method. PhD thesis. 2006, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
28.
go back to reference Scannapieco M, Figotin I, Bertino E, Elmagarmid AK: Privacy preserving schema and data matching. Proceedings of the ACM SIGMOD International Conference on Management of Data: 12–14 June 2007; Beijing. Edited by: Chan CY, Ooi BC, Zhou A. 2007, New York: ACM, 653-664.CrossRef Scannapieco M, Figotin I, Bertino E, Elmagarmid AK: Privacy preserving schema and data matching. Proceedings of the ACM SIGMOD International Conference on Management of Data: 12–14 June 2007; Beijing. Edited by: Chan CY, Ooi BC, Zhou A. 2007, New York: ACM, 653-664.CrossRef
32.
go back to reference Inan A, Kantarcioglu M, Bertino E, Scannapieco M: A hybrid approach to private record linkage. Proceedings of the 24th International Conference on Data Engineering: 7–12 April 2008; Cancun, Mexico. 2008, Los Alamitos: IEEE, 496-505.CrossRef Inan A, Kantarcioglu M, Bertino E, Scannapieco M: A hybrid approach to private record linkage. Proceedings of the 24th International Conference on Data Engineering: 7–12 April 2008; Cancun, Mexico. 2008, Los Alamitos: IEEE, 496-505.CrossRef
33.
go back to reference Atallah MJ, Kerschbaum F, Du W: Secure and private sequence comparisons. Proceedings of the ACM Workshop on Privacy in the Electronic Society: 30 October 2003; Washington, DC. Edited by: Samarati P, Syverson P. 2003, New York: ACM, 39-44.CrossRef Atallah MJ, Kerschbaum F, Du W: Secure and private sequence comparisons. Proceedings of the ACM Workshop on Privacy in the Electronic Society: 30 October 2003; Washington, DC. Edited by: Samarati P, Syverson P. 2003, New York: ACM, 39-44.CrossRef
35.
go back to reference Vaidya J, Clifton C: Secure set intersection cardinality with application to association rule mining. Journal of Computer Security. 2005, 13 (4): 593-622.CrossRef Vaidya J, Clifton C: Secure set intersection cardinality with application to association rule mining. Journal of Computer Security. 2005, 13 (4): 593-622.CrossRef
36.
go back to reference Yakout M, Atallah MJ, Elmagarmid AK: Efficient private record linkage. Proceedings of the 25th International Conference on Data Engineering: 29 March–2 April 2009; Shanghai. 2009, 1283-1286.CrossRef Yakout M, Atallah MJ, Elmagarmid AK: Efficient private record linkage. Proceedings of the 25th International Conference on Data Engineering: 29 March–2 April 2009; Shanghai. 2009, 1283-1286.CrossRef
37.
go back to reference Bloom BH: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM. 1970, 13 (7): 422-426.CrossRef Bloom BH: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM. 1970, 13 (7): 422-426.CrossRef
38.
go back to reference Broder A, Mitzenmacher M: Network applications of Bloom filters: a survey. Internet Mathematics. 2003, 1 (4): 485-509.CrossRef Broder A, Mitzenmacher M: Network applications of Bloom filters: a survey. Internet Mathematics. 2003, 1 (4): 485-509.CrossRef
39.
go back to reference Jain N, Dahlin M, Tewari R: Using Bloom filters to refine web search results. Proceedings of the Eight International Workshop on the Web & Databases: 16–17 June 2005; Baltimore. Edited by: Doan A, Neven F, McCann R, Bex GJ. 2005, 25-30. Jain N, Dahlin M, Tewari R: Using Bloom filters to refine web search results. Proceedings of the Eight International Workshop on the Web & Databases: 16–17 June 2005; Baltimore. Edited by: Doan A, Neven F, McCann R, Bex GJ. 2005, 25-30.
40.
go back to reference Mitzenmacher M, Upfal E: Probability and computing: an introduction to randomized algorithms and probabilistic analysis. 2005, New York: Cambridge University PressCrossRef Mitzenmacher M, Upfal E: Probability and computing: an introduction to randomized algorithms and probabilistic analysis. 2005, New York: Cambridge University PressCrossRef
41.
go back to reference Kirsch A, Mitzenmacher M: Less hashing, same performance: building a better Bloom filter. Algorithms-ESA 2006. Proceedings of the 14th Annual European Symposium: 11–13 September 2006; Zürich. Edited by: Azar Y, Erlebach T. 2006, Berlin: Springer, 456-467.CrossRef Kirsch A, Mitzenmacher M: Less hashing, same performance: building a better Bloom filter. Algorithms-ESA 2006. Proceedings of the 14th Annual European Symposium: 11–13 September 2006; Zürich. Edited by: Azar Y, Erlebach T. 2006, Berlin: Springer, 456-467.CrossRef
42.
go back to reference Schneier B: Applied cryptography: protocols, algorithms, and source code in C. 1996, New York: Wiley, 2 Schneier B: Applied cryptography: protocols, algorithms, and source code in C. 1996, New York: Wiley, 2
43.
go back to reference Dice LR: Measures of the amount of ecologic association between species. Ecology. 1945, 26 (3): 297-302.CrossRef Dice LR: Measures of the amount of ecologic association between species. Ecology. 1945, 26 (3): 297-302.CrossRef
44.
go back to reference Salton G, McGill MJ: Introduction to modern information retrieval. 1983, New York: McGraw-Hill Salton G, McGill MJ: Introduction to modern information retrieval. 1983, New York: McGraw-Hill
45.
go back to reference Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. 1999, Harlow: Addison-Wesley Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. 1999, Harlow: Addison-Wesley
46.
go back to reference Winkler WE: Matching and record linkage. Business survey methods. Edited by: Cox BG, Binder DA, Chinnappa BN, Christianson A, Colledge MJ, Kott PS. 1995, New York: Wiley, 355-384. Winkler WE: Matching and record linkage. Business survey methods. Edited by: Cox BG, Binder DA, Chinnappa BN, Christianson A, Colledge MJ, Kott PS. 1995, New York: Wiley, 355-384.
47.
go back to reference Knuth DE: The art of computer programming. Sorting and searching. 1998, Reading: Addison-Wesley, 2 Knuth DE: The art of computer programming. Sorting and searching. 1998, Reading: Addison-Wesley, 2
48.
go back to reference Christen P: Probabilistic data generation for deduplication and data linkage. Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning: 6–8 July 2005; Brisbane, Australia. Edited by: Gallagher M, Hogan JM, Maire F. 2005, Berlin: Springer, 109-116.CrossRef Christen P: Probabilistic data generation for deduplication and data linkage. Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning: 6–8 July 2005; Brisbane, Australia. Edited by: Gallagher M, Hogan JM, Maire F. 2005, Berlin: Springer, 109-116.CrossRef
49.
go back to reference Schnell R, Bachteler T, Bender S: A toolbox for record linkage. Austrian Journal of Statistics. 2004, 33 (1–2): 125-133. Schnell R, Bachteler T, Bender S: A toolbox for record linkage. Austrian Journal of Statistics. 2004, 33 (1–2): 125-133.
51.
go back to reference Dewri R, Ray I, Ray I, Whitley D: On the comparison of microdata disclosure control algorithms. Proceedings of the 12th International Conference on Extending Database Technology: 24–26 March 2009; Saint Petersburg, Russia. Edited by: Kersten M, Novikov B, Teubner J, Polutin V, Manegold S. 2009, New York: ACM, 240-251.CrossRef Dewri R, Ray I, Ray I, Whitley D: On the comparison of microdata disclosure control algorithms. Proceedings of the 12th International Conference on Extending Database Technology: 24–26 March 2009; Saint Petersburg, Russia. Edited by: Kersten M, Novikov B, Teubner J, Polutin V, Manegold S. 2009, New York: ACM, 240-251.CrossRef
Metadata
Title
Privacy-preserving record linkage using Bloom filters
Authors
Rainer Schnell
Tobias Bachteler
Jörg Reiher
Publication date
01-12-2009
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2009
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-9-41

Other articles of this Issue 1/2009

BMC Medical Informatics and Decision Making 1/2009 Go to the issue