Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2016 | Research article

A proficient cost reduction framework for de-duplication of records in data integration

Authors: Asif Sohail, Muhammad Murtaza Yousaf

Published in: BMC Medical Informatics and Decision Making | Issue 1/2016

Abstract

Background

Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.

Methods

In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage).

Results

The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons.

Conclusions

The selection of the linkage key is a critical performance factor for record linkage.

Available only for authorised users

Bleiholder J, Naumann F. Data fusion. ACM Comput Surv. 2008;41(1):Article 1.CrossRef

Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin. 2000;23(4):3–13.

Herzog TN, Scheuren FJ, Winkler WE. Data quality and record linkage techniques. Springer Science & Business Media; 2007.

Randall SM et al. The effect of data cleaning on record linkage quality. BMC Med Inform Decis Mak. 2013;13(1):64.CrossRefPubMedPubMedCentral

Whang SE, Garcia-Molina H. Entity resolution with evolving rules. PVLDB. 2010;3(1):1326–37.

Whang SE, Garcia-Molina H. Developments in generic entity resolution. IEEE Data Engineering Bulletin. 2011;34(3):51–9.

Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.CrossRef

Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. San Jose: ACM SIGMOD; 1995. p. 127–38.

Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64:1183–210.CrossRef

10.

Samwald M et al. Linked open drug data for pharmaceutical research and development. J Cheminform. 2011;3.1:19.CrossRef

11.

Bauer F, Kaltenböck M. Linked open data: the essentials. Vienna: Edition mono/monochrome; 2011.

12.

Christen P, Goiser K. Quality and complexity measures for data linkage and deduplication. In: F. Guillet, H. Hamilton (eds). Quality Measures in Data Mining, Studies in Computational Intelligence, vol. 43. Springer; 2007, pp. 127–151.

13.

Christen P. A survey of indexing techniques for scalable record linkage and deduplication. Knowledge and Data Engineering, IEEE Transactions on 24.9. 2012. p. 1537–55.

14.

Christen P. Data Matching, Concepts and Techniques of Record Linkage, Entity Resolution, and Duplicate Detection. Springer; 2012.

15.

Draisbach, U., Naumann, F., Szott, S., & Wonneberg, O. (2012, April). Adaptive windows for duplicate detection. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1073-1083). IEEE.

16.

Elfeky MG, Verykios VS, Elmagarmid AK. TAILOR: a record linkage toolbox. San Jose: IEEE ICDE; 2002. p. 17–28.

17.

Goiser K, Christen P. ‘Australasian Data Mining Conference’ (AusDM’06), vol. 61. Sydney: Conferences in Research and Practice in Information Technology (CRPIT); 2006. p. 23–31.

18.

Gu L, Baxter R, Vickers D, C. Rainsford C. Record linkage: current practice and future directions. Canberra: Technical Report 03/83, CSIRO Mathematical and Information Sciences; 2003.

19.

Patrick L, Fankhauser P. A Precise Blocking Method for Record Linkage. In: Min Tjoa A, Trujillo J, editors. DaWaK 2005, LNCS 3589. 2005. p. 210–20.

20.

Christen P. Febrl: an open source data cleaning, deduplication and record linkage system with a graphical user interface. Las Vegas: ACM SIGKDD; 2008. p. 1065–8.

21.

Michelson M, Knoblock CA. Learning blocking schemes for record linkage. Boston: AAAI’06; 2006.

22.

Giang P. A machine learning approach to create blocking criteria for record linkage. Health Care Manag Sci. 2014. doi:10.1007/s10729-014-9276-0.PubMed

23.

Data Integration Manual. Statistics New Zealand. 2006. http://www.stats.govt.nz.

24.

Gu L, Baxter R. Adaptive filtering for efficient record linkage. Orlando: SIAM international conference on data mining; 2004.CrossRef

25.

Naumann F, Herschel M. An introduction to duplicate detection. Synthesis Lectures on Data Management 2.1. 2010. 1–87.

26.

Gu L, Baxter R. Decision models for record linkage. In: Selected Papers from AusDM, Springer LNCS 3755. 2006. p. 146–60.

27.

Maggi F. A Survey of Probabilistic Record Matching Models, Techniques and Tools, Scienti_c Report TR-2008. 2008.

28.

Christen P. A comparison of personal name matching: Techniques and practical issues. In: Workshop on Mining Complex Data, held at IEEE ICDM. Hong Kong. 2006.

29.

Odell M, Russell R. The soundex coding system. US Patents 1261167. 1918.

30.

Baxter R, Christen P, Churches T. “A comparison of fast blocking methods for record linkage.” ACM SIGKDD. Vol. 3. 2003.

31.

Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data Knowl Eng. 2010;69(2):197–210.CrossRef

32.

Draisbach U, Naumann F. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In: Workshop on Quality in Databases, held at VLDB. Lyon. 2009.

33.

Fawcett T. ROC graphs: notes and practical considerations for researchers. Mach Learn. 2004;31:1–38.

34.

Jiang L et al. Measuring and Comparing Effectiveness of Data Quality Techniques, Advanced Information Systems Engineering. Heidelberg: Springer Berlin; 2009.CrossRef

35.

Yan S, Lee D, Kan MY, Giles LC. Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS joint conference on Digital Libraries. 2007. p. 185–94.

36.

Han J, Kamber M, Pei J. Data mining: concepts and techniques. Elsevier; 2011.

37.

Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. Tokyo: IEEE ICDE; 2005. p. 865–76.

Title: A proficient cost reduction framework for de-duplication of records in data integration
Authors: Asif Sohail
Muhammad Murtaza Yousaf
Publication date: 01-12-2016
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2016
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-016-0280-9

Keynote webinar | Spotlight on medication adherence

Springer Medicine

A proficient cost reduction framework for de-duplication of records in data integration

Abstract

Background

Methods

Results

Conclusions

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusions

Please log in to get access to this content

Other articles of this Issue 1/2016

Integrating data from an online diabetes prevention program into an electronic health record and clinical workflow, a design phase usability study

A hierarchical method to automatically encode Chinese diagnoses through semantic similarity estimation

Integral strategy to supportive care in breast cancer survivors through occupational therapy and a m-health system: design of a randomized clinical trial

A formative study to inform mHealth based randomized controlled trial intervention to promote exclusive breastfeeding practices in Myanmar: incorporating qualitative study findings

Pupil diameter changes reflect difficulty and diagnostic accuracy during medical image interpretation

Survival functions for defining a clinical management Lost To Follow-Up (LTFU) cut-off in Antiretroviral Therapy (ART) program in Zomba, Malawi