Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2016

Open Access 01-12-2016 | Research article

A proficient cost reduction framework for de-duplication of records in data integration

Authors: Asif Sohail, Muhammad Murtaza Yousaf

Published in: BMC Medical Informatics and Decision Making | Issue 1/2016

Login to get access

Abstract

Background

Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.

Methods

In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage).

Results

The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons.

Conclusions

The selection of the linkage key is a critical performance factor for record linkage.
Appendix
Available only for authorised users
Literature
1.
go back to reference Bleiholder J, Naumann F. Data fusion. ACM Comput Surv. 2008;41(1):Article 1.CrossRef Bleiholder J, Naumann F. Data fusion. ACM Comput Surv. 2008;41(1):Article 1.CrossRef
2.
go back to reference Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin. 2000;23(4):3–13. Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin. 2000;23(4):3–13.
3.
go back to reference Herzog TN, Scheuren FJ, Winkler WE. Data quality and record linkage techniques. Springer Science & Business Media; 2007. Herzog TN, Scheuren FJ, Winkler WE. Data quality and record linkage techniques. Springer Science & Business Media; 2007.
5.
go back to reference Whang SE, Garcia-Molina H. Entity resolution with evolving rules. PVLDB. 2010;3(1):1326–37. Whang SE, Garcia-Molina H. Entity resolution with evolving rules. PVLDB. 2010;3(1):1326–37.
6.
go back to reference Whang SE, Garcia-Molina H. Developments in generic entity resolution. IEEE Data Engineering Bulletin. 2011;34(3):51–9. Whang SE, Garcia-Molina H. Developments in generic entity resolution. IEEE Data Engineering Bulletin. 2011;34(3):51–9.
7.
go back to reference Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.CrossRef Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.CrossRef
8.
go back to reference Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. San Jose: ACM SIGMOD; 1995. p. 127–38. Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. San Jose: ACM SIGMOD; 1995. p. 127–38.
9.
go back to reference Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64:1183–210.CrossRef Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64:1183–210.CrossRef
10.
go back to reference Samwald M et al. Linked open drug data for pharmaceutical research and development. J Cheminform. 2011;3.1:19.CrossRef Samwald M et al. Linked open drug data for pharmaceutical research and development. J Cheminform. 2011;3.1:19.CrossRef
11.
go back to reference Bauer F, Kaltenböck M. Linked open data: the essentials. Vienna: Edition mono/monochrome; 2011. Bauer F, Kaltenböck M. Linked open data: the essentials. Vienna: Edition mono/monochrome; 2011.
12.
go back to reference Christen P, Goiser K. Quality and complexity measures for data linkage and deduplication. In: F. Guillet, H. Hamilton (eds). Quality Measures in Data Mining, Studies in Computational Intelligence, vol. 43. Springer; 2007, pp. 127–151. Christen P, Goiser K. Quality and complexity measures for data linkage and deduplication. In: F. Guillet, H. Hamilton (eds). Quality Measures in Data Mining, Studies in Computational Intelligence, vol. 43. Springer; 2007, pp. 127–151.
13.
go back to reference Christen P. A survey of indexing techniques for scalable record linkage and deduplication. Knowledge and Data Engineering, IEEE Transactions on 24.9. 2012. p. 1537–55. Christen P. A survey of indexing techniques for scalable record linkage and deduplication. Knowledge and Data Engineering, IEEE Transactions on 24.9. 2012. p. 1537–55. 
14.
go back to reference Christen P. Data Matching, Concepts and Techniques of Record Linkage, Entity Resolution, and Duplicate Detection. Springer; 2012. Christen P. Data Matching, Concepts and Techniques of Record Linkage, Entity Resolution, and Duplicate Detection. Springer; 2012.
15.
go back to reference Draisbach, U., Naumann, F., Szott, S., & Wonneberg, O. (2012, April). Adaptive windows for duplicate detection. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1073-1083). IEEE. Draisbach, U., Naumann, F., Szott, S., & Wonneberg, O. (2012, April). Adaptive windows for duplicate detection. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1073-1083). IEEE.
16.
go back to reference Elfeky MG, Verykios VS, Elmagarmid AK. TAILOR: a record linkage toolbox. San Jose: IEEE ICDE; 2002. p. 17–28. Elfeky MG, Verykios VS, Elmagarmid AK. TAILOR: a record linkage toolbox. San Jose: IEEE ICDE; 2002. p. 17–28.
17.
go back to reference Goiser K, Christen P. ‘Australasian Data Mining Conference’ (AusDM’06), vol. 61. Sydney: Conferences in Research and Practice in Information Technology (CRPIT); 2006. p. 23–31. Goiser K, Christen P. ‘Australasian Data Mining Conference’ (AusDM’06), vol. 61. Sydney: Conferences in Research and Practice in Information Technology (CRPIT); 2006. p. 23–31.
18.
go back to reference Gu L, Baxter R, Vickers D, C. Rainsford C. Record linkage: current practice and future directions. Canberra: Technical Report 03/83, CSIRO Mathematical and Information Sciences; 2003. Gu L, Baxter R, Vickers D, C. Rainsford C. Record linkage: current practice and future directions. Canberra: Technical Report 03/83, CSIRO Mathematical and Information Sciences; 2003.
19.
go back to reference Patrick L, Fankhauser P. A Precise Blocking Method for Record Linkage. In: Min Tjoa A, Trujillo J, editors. DaWaK 2005, LNCS 3589. 2005. p. 210–20. Patrick L, Fankhauser P. A Precise Blocking Method for Record Linkage. In: Min Tjoa A, Trujillo J, editors. DaWaK 2005, LNCS 3589. 2005. p. 210–20.
20.
go back to reference Christen P. Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface. Las Vegas: ACM SIGKDD; 2008. p. 1065–8. Christen P. Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface. Las Vegas: ACM SIGKDD; 2008. p. 1065–8.
21.
go back to reference Michelson M, Knoblock CA. Learning blocking schemes for record linkage. Boston: AAAI’06; 2006. Michelson M, Knoblock CA. Learning blocking schemes for record linkage. Boston: AAAI’06; 2006.
24.
go back to reference Gu L, Baxter R. Adaptive filtering for efficient record linkage. Orlando: SIAM international conference on data mining; 2004.CrossRef Gu L, Baxter R. Adaptive filtering for efficient record linkage. Orlando: SIAM international conference on data mining; 2004.CrossRef
25.
go back to reference Naumann F, Herschel M. An introduction to duplicate detection. Synthesis Lectures on Data Management 2.1. 2010. 1–87. Naumann F, Herschel M. An introduction to duplicate detection. Synthesis Lectures on Data Management 2.1. 2010. 1–87.
26.
go back to reference Gu L, Baxter R. Decision models for record linkage. In: Selected Papers from AusDM, Springer LNCS 3755. 2006. p. 146–60. Gu L, Baxter R. Decision models for record linkage. In: Selected Papers from AusDM, Springer LNCS 3755. 2006. p. 146–60.
27.
go back to reference Maggi F. A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008. 2008. Maggi F. A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008. 2008.
28.
go back to reference Christen P. A comparison of personal name matching: Techniques and practical issues. In: Workshop on Mining Complex Data, held at IEEE ICDM. Hong Kong. 2006. Christen P. A comparison of personal name matching: Techniques and practical issues. In: Workshop on Mining Complex Data, held at IEEE ICDM. Hong Kong. 2006.
29.
go back to reference Odell M, Russell R. The soundex coding system. US Patents 1261167. 1918. Odell M, Russell R. The soundex coding system. US Patents 1261167. 1918.
30.
go back to reference Baxter R, Christen P, Churches T. “A comparison of fast blocking methods for record linkage.” ACM SIGKDD. Vol. 3. 2003. Baxter R, Christen P, Churches T. “A comparison of fast blocking methods for record linkage.” ACM SIGKDD. Vol. 3. 2003.
31.
go back to reference Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data Knowl Eng. 2010;69(2):197–210.CrossRef Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data Knowl Eng. 2010;69(2):197–210.CrossRef
32.
go back to reference Draisbach U, Naumann F. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In: Workshop on Quality in Databases, held at VLDB. Lyon. 2009. Draisbach U, Naumann F. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In: Workshop on Quality in Databases, held at VLDB. Lyon. 2009.
33.
go back to reference Fawcett T. ROC graphs: notes and practical considerations for researchers. Mach Learn. 2004;31:1–38. Fawcett T. ROC graphs: notes and practical considerations for researchers. Mach Learn. 2004;31:1–38.
34.
go back to reference Jiang L et al. Measuring and Comparing Effectiveness of Data Quality Techniques, Advanced Information Systems Engineering. Heidelberg: Springer Berlin; 2009.CrossRef Jiang L et al. Measuring and Comparing Effectiveness of Data Quality Techniques, Advanced Information Systems Engineering. Heidelberg: Springer Berlin; 2009.CrossRef
35.
go back to reference Yan S, Lee D, Kan MY, Giles LC. Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS joint conference on Digital Libraries. 2007. p. 185–94. Yan S, Lee D, Kan MY, Giles LC. Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS joint conference on Digital Libraries. 2007. p. 185–94.
36.
go back to reference Han J, Kamber M, Pei J. Data mining: concepts and techniques. Elsevier; 2011. Han J, Kamber M, Pei J. Data mining: concepts and techniques. Elsevier; 2011.
37.
go back to reference Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. Tokyo: IEEE ICDE; 2005. p. 865–76. Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. Tokyo: IEEE ICDE; 2005. p. 865–76.
Metadata
Title
A proficient cost reduction framework for de-duplication of records in data integration
Authors
Asif Sohail
Muhammad Murtaza Yousaf
Publication date
01-12-2016
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2016
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-016-0280-9

Other articles of this Issue 1/2016

BMC Medical Informatics and Decision Making 1/2016 Go to the issue