Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2011

Open Access 01-12-2011 | Research article

An efficient record linkage scheme using graphical analysis for identifier error detection

Authors: John M Finney, A Sarah Walker, Tim EA Peto, David H Wyllie

Published in: BMC Medical Informatics and Decision Making | Issue 1/2011

Login to get access

Abstract

Background

Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.

Methods

We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.

Results

The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.

Conclusions

The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.
Appendix
Available only for authorised users
Literature
1.
go back to reference Wyllie DH, Crook DW, Peto TE: Mortality after Staphylococcus aureus bacteraemia in two hospitals in Oxfordshire, 1997-2003: cohort study. BMJ. 2006, 333: 281-10.1136/bmj.38834.421713.2F.CrossRefPubMedPubMedCentral Wyllie DH, Crook DW, Peto TE: Mortality after Staphylococcus aureus bacteraemia in two hospitals in Oxfordshire, 1997-2003: cohort study. BMJ. 2006, 333: 281-10.1136/bmj.38834.421713.2F.CrossRefPubMedPubMedCentral
2.
go back to reference Wyllie DH, Peto TE, Crook D: MRSA bacteraemia in patients on arrival in hospital: a cohort study in Oxfordshire 1997-2003. BMJ. 2005, 331: 992-10.1136/bmj.38558.453310.8F.CrossRefPubMedPubMedCentral Wyllie DH, Peto TE, Crook D: MRSA bacteraemia in patients on arrival in hospital: a cohort study in Oxfordshire 1997-2003. BMJ. 2005, 331: 992-10.1136/bmj.38558.453310.8F.CrossRefPubMedPubMedCentral
3.
go back to reference Wyllie DH, Walker AS, Peto TE, Crook DW: Hospital exposure in a UK population, and its association with bacteraemia. J Hosp Infect. 2007, 67: 301-307. 10.1016/j.jhin.2007.08.018.CrossRefPubMed Wyllie DH, Walker AS, Peto TE, Crook DW: Hospital exposure in a UK population, and its association with bacteraemia. J Hosp Infect. 2007, 67: 301-307. 10.1016/j.jhin.2007.08.018.CrossRefPubMed
4.
go back to reference Winkler WE: Overview of Record Linkage and Current Research Directions. Book Overview of Record Linkage and Current Research Directions. 2006, City: Statistical Research Division, U.S. Census Bureau, 2006-2: 44-pp. 44 Winkler WE: Overview of Record Linkage and Current Research Directions. Book Overview of Record Linkage and Current Research Directions. 2006, City: Statistical Research Division, U.S. Census Bureau, 2006-2: 44-pp. 44
5.
go back to reference Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710. Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710.
6.
go back to reference Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210. 10.2307/2286061.CrossRef Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210. 10.2307/2286061.CrossRef
7.
go back to reference McCallum A, Nigam K, Ungar LH: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000, New York: ACM McCallum A, Nigam K, Ungar LH: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000, New York: ACM
8.
go back to reference Lyons RA, Jones KH, John G, Brooks CJ, Verplancke JP, Ford DV, Brown G, Leake K: The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak. 2009, 9: 3-10.1186/1472-6947-9-3.CrossRefPubMedPubMedCentral Lyons RA, Jones KH, John G, Brooks CJ, Verplancke JP, Ford DV, Brown G, Leake K: The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak. 2009, 9: 3-10.1186/1472-6947-9-3.CrossRefPubMedPubMedCentral
9.
go back to reference Schaeffer SE: Graph Clustering. Computer Science Review. 2007, 1: 27-64. 10.1016/j.cosrev.2007.05.001.CrossRef Schaeffer SE: Graph Clustering. Computer Science Review. 2007, 1: 27-64. 10.1016/j.cosrev.2007.05.001.CrossRef
10.
go back to reference Sauleau EA, Paumier JP, Buemi A: Medical record linkage in health information systems by approximate string matching and clustering. BMC Med Inform Decis Mak. 2005, 5: 32-10.1186/1472-6947-5-32.CrossRefPubMedPubMedCentral Sauleau EA, Paumier JP, Buemi A: Medical record linkage in health information systems by approximate string matching and clustering. BMC Med Inform Decis Mak. 2005, 5: 32-10.1186/1472-6947-5-32.CrossRefPubMedPubMedCentral
11.
go back to reference Kalashnikov DV, Mehrotra S: Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph. ACM Transactions on Database Systems. 2006, 31: 52-10.1145/1138394.1138401.CrossRef Kalashnikov DV, Mehrotra S: Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph. ACM Transactions on Database Systems. 2006, 31: 52-10.1145/1138394.1138401.CrossRef
12.
go back to reference Chapman S: Simmertics: a Similarity Metric Library. Book Simmertics: a Similarity Metric Library. 2005, A similarity metric library. City: Sheffield University, A similarity metric library, 1.6.2.d07 Chapman S: Simmertics: a Similarity Metric Library. Book Simmertics: a Similarity Metric Library. 2005, A similarity metric library. City: Sheffield University, A similarity metric library, 1.6.2.d07
13.
go back to reference Teorey T, Lightstone S, Nadeau T, Jagadish HV: Database Modelling and Design: Logical Design. 2006, Elsevier, 5 Teorey T, Lightstone S, Nadeau T, Jagadish HV: Database Modelling and Design: Logical Design. 2006, Elsevier, 5
14.
go back to reference Arasu A, Ganti V, Kaushik R: Efficient Exact set-similarity joins. Proceedings of the 32nd international conference on Very large data bases. 2006 Arasu A, Ganti V, Kaushik R: Efficient Exact set-similarity joins. Proceedings of the 32nd international conference on Very large data bases. 2006
15.
go back to reference Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York: Springer, 4CrossRef Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York: Springer, 4CrossRef
16.
go back to reference Deepayan S: Lattice: Multivariate Data Visualisation with R. Book Lattice: Multivariate Data Visualisation with R. 2008, City: Springer, 1 Deepayan S: Lattice: Multivariate Data Visualisation with R. Book Lattice: Multivariate Data Visualisation with R. 2008, City: Springer, 1
17.
go back to reference Rares V, Li C: Efficient top-k algorithms for fuzzy search in string collections. International Conference on Management of Data Rhode Island. 2009, ACM Rares V, Li C: Efficient top-k algorithms for fuzzy search in string collections. International Conference on Management of Data Rhode Island. 2009, ACM
18.
go back to reference Phillips L: The double metaphone search algorithm. C/C++ Users Journal. 2000 Phillips L: The double metaphone search algorithm. C/C++ Users Journal. 2000
Metadata
Title
An efficient record linkage scheme using graphical analysis for identifier error detection
Authors
John M Finney
A Sarah Walker
Tim EA Peto
David H Wyllie
Publication date
01-12-2011
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2011
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-11-7

Other articles of this Issue 1/2011

BMC Medical Informatics and Decision Making 1/2011 Go to the issue