Skip to main content

A Comparison of Blocking Methods for Record Linkage

  • Conference paper
Privacy in Statistical Databases (PSD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

Abstract

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as “private blocking.” We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24 (2012)

    Google Scholar 

  3. Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Christen, P., Pudjijono, A.: Accurate Synthetic Generation of Realistic Personal Information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-01307-2_47

    Chapter  Google Scholar 

  5. Christen, P., Vatsalan, D.: Flexible and Extensible Generation and Corruption of Personal Data. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013) (2013)

    Google Scholar 

  6. Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Physical Review E 70, 066111 (2004)

    Google Scholar 

  7. Durham, E.A.: A framework for accurate, efficient private record linkage. Ph.D. thesis, Vanderbilt University (2012)

    Google Scholar 

  8. Fortunato, S.: Community detection in graphs. Physics Reports 486, 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  9. Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statistical network models. Foundations and Trends® in Machine Learning 2, 129–233 (2010)

    Article  Google Scholar 

  10. Hall, R., Fienberg, S.: Valid statistical inference on automatically matched files. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 131–142. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  11. Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007)

    MATH  Google Scholar 

  12. Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics 2 (2010), doi:10.1002/wics.108

    Google Scholar 

  13. Karakasidis, A., Verykios, V.S.: Reference table based k-anonymous private blocking. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 859–864. ACM (2012)

    Google Scholar 

  14. Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  15. Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 449–460. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  16. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)

    Google Scholar 

  17. Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)

    Google Scholar 

  18. Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31, 1348–1358 (2010)

    Article  Google Scholar 

  19. Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge University Press (2012)

    Google Scholar 

  20. Steorts, R.C., Hall, R., Fienberg, S.: A Bayesian approach to graphical record linkage and de-duplication (2013) (submitted)

    Google Scholar 

  21. Vatsalan, D., Christen, P.: Sorted nearest neighborhood clustering for efficient private blocking. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 341–352. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  22. Vatsalan, D., Christen, P., Verykios, V.S.: An efficient two-party protocol for approximate matching in private record linkage. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 125–136. Australian Computer Society, Inc. (2011)

    Google Scholar 

  23. Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of American Statistical Association Section on Survey Research Methods (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E. (2014). A Comparison of Blocking Methods for Record Linkage. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11257-2_20

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11256-5

  • Online ISBN: 978-3-319-11257-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics