Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2005

Open Access 01-12-2005 | Research article

Medical record linkage in health information systems by approximate string matching and clustering

Authors: Erik A Sauleau, Jean-Philippe Paumier, Antoine Buemi

Published in: BMC Medical Informatics and Decision Making | Issue 1/2005

Login to get access

Abstract

Background

Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.

Methods

The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.

Results

The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.

Conclusion

Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.
Appendix
Available only for authorised users
Literature
1.
go back to reference Belin TR, Rubin DB: A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995, 90: 697-707.CrossRef Belin TR, Rubin DB: A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995, 90: 697-707.CrossRef
2.
go back to reference Newcombe HB, Kennedy JM: Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962, 5: 563-566. 10.1145/368996.369026.CrossRef Newcombe HB, Kennedy JM: Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962, 5: 563-566. 10.1145/368996.369026.CrossRef
3.
go back to reference Vintsyuk T: Speech discrimination by dynamic programming. Cybernetics. 1968, 4: 52-58. 10.1007/BF01074755.CrossRef Vintsyuk T: Speech discrimination by dynamic programming. Cybernetics. 1968, 4: 52-58. 10.1007/BF01074755.CrossRef
4.
go back to reference Sellers P: The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980, 1: 359-373. 10.1016/0196-6774(80)90016-4.CrossRef Sellers P: The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980, 1: 359-373. 10.1016/0196-6774(80)90016-4.CrossRef
5.
go back to reference Navarro G, Raffinot M: Flexible pattern matching in strings. 2002, Cambridge, Cambridge University PressCrossRef Navarro G, Raffinot M: Flexible pattern matching in strings. 2002, Cambridge, Cambridge University PressCrossRef
6.
go back to reference Navarro G: A guided tour to approximate string matching. Association of Computing Machinery Computing Surveys. 2001, 33: 31-88. Navarro G: A guided tour to approximate string matching. Association of Computing Machinery Computing Surveys. 2001, 33: 31-88.
7.
go back to reference Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710. Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710.
8.
go back to reference Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 14: 195-197. 10.1016/0022-2836(81)90087-5.CrossRef Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 14: 195-197. 10.1016/0022-2836(81)90087-5.CrossRef
9.
go back to reference Dempster AP, Laird N, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, series B. 1977, 39: 1-38. Dempster AP, Laird N, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, series B. 1977, 39: 1-38.
10.
go back to reference Winkler WE: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Statistical research report series N°05. 2000, US bureau of census, Washington DC Winkler WE: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Statistical research report series N°05. 2000, US bureau of census, Washington DC
11.
go back to reference Newcombe HB: Handbook of record linkage: methods for health and statistical studies, administration and business. 1988, Oxford, Oxford University Press Newcombe HB: Handbook of record linkage: methods for health and statistical studies, administration and business. 1988, Oxford, Oxford University Press
12.
go back to reference Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210.CrossRef Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210.CrossRef
13.
go back to reference Jaro MA: Advances in record linkage methodology as applied to matching the 1985 Census of Tempa, Florida. Journal of the American Statistical Association. 1989, 84: 414-420.CrossRef Jaro MA: Advances in record linkage methodology as applied to matching the 1985 Census of Tempa, Florida. Journal of the American Statistical Association. 1989, 84: 414-420.CrossRef
14.
go back to reference Jaro MA: Probabilistic linkage of large public health data files. Statistics in Medicine. 1995, 14: 491-498.CrossRefPubMed Jaro MA: Probabilistic linkage of large public health data files. Statistics in Medicine. 1995, 14: 491-498.CrossRefPubMed
15.
go back to reference Hartigan J: Clustering algorithms. 1975, New York, John Wiley and Sons Hartigan J: Clustering algorithms. 1975, New York, John Wiley and Sons
16.
go back to reference Everitt B: Cluster analysis. 1993, London, Edward Arnold, 3 Everitt B: Cluster analysis. 1993, London, Edward Arnold, 3
17.
go back to reference Eades P: A heuristique for graph drawing. Congressus Numerantium. 1984, 42: 149-160. Eades P: A heuristique for graph drawing. Congressus Numerantium. 1984, 42: 149-160.
18.
go back to reference Fruchterman T, Reingold E: Graph drawing by force-directed placement. Software-Practice and Experience. 1991, 21: 1129-1164.CrossRef Fruchterman T, Reingold E: Graph drawing by force-directed placement. Software-Practice and Experience. 1991, 21: 1129-1164.CrossRef
19.
go back to reference Kamada T, Kawai S: An algorithm for drawing general undirected graphs. Information Processing Letters. 1989, 31: 7-15. 10.1016/0020-0190(89)90102-6.CrossRef Kamada T, Kawai S: An algorithm for drawing general undirected graphs. Information Processing Letters. 1989, 31: 7-15. 10.1016/0020-0190(89)90102-6.CrossRef
20.
go back to reference Monge AE, Elkan CP: An efficient domain-independent algorithm for detecting approximately duplicate database records. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon. 1997 Monge AE, Elkan CP: An efficient domain-independent algorithm for detecting approximately duplicate database records. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon. 1997
21.
go back to reference Hylthon JA: Identifying and merging related bibliographic records. 1996, Master of Engineering in Electrical and Computer Sciences. MIT Hylthon JA: Identifying and merging related bibliographic records. 1996, Master of Engineering in Electrical and Computer Sciences. MIT
22.
go back to reference Sharan R, Shamir R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Molecular Biology. Edited by: Jiang T. 2002, The MIT Press, 269-300. Sharan R, Shamir R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Molecular Biology. Edited by: Jiang T. 2002, The MIT Press, 269-300.
23.
go back to reference Baxter A, Christen P, Churches T: A comparison of fast blocking methods for record linkage. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC Baxter A, Christen P, Churches T: A comparison of fast blocking methods for record linkage. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC
24.
go back to reference Hernandez M, Stolfo S: The merge/purge problem for large databases. ACM SIGMOD International Conference on Management of Data, San Jose. 1995 Hernandez M, Stolfo S: The merge/purge problem for large databases. ACM SIGMOD International Conference on Management of Data, San Jose. 1995
25.
go back to reference Baeza-Yates R, Frakes WB: Information Retrieval: Algorithms and Data Structures. 1992, Englewood Cliffs, Prentice-Hall Baeza-Yates R, Frakes WB: Information Retrieval: Algorithms and Data Structures. 1992, Englewood Cliffs, Prentice-Hall
26.
go back to reference McCallum AK, Nigam K, Ungar LH: Efficient clustering of high-dimensional datasets with application to reference matching. Sixth International Conference on Knowledge Discovery and Data Mining, Boston. 2000 McCallum AK, Nigam K, Ungar LH: Efficient clustering of high-dimensional datasets with application to reference matching. Sixth International Conference on Knowledge Discovery and Data Mining, Boston. 2000
27.
go back to reference Cohen W, Ravikumar P, Fienberg S: A comparison of string metrics for matching names and records. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC Cohen W, Ravikumar P, Fienberg S: A comparison of string metrics for matching names and records. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC
28.
go back to reference Porter EH, Winkler WE: Approximate string comparison and its effect on an advanced record linkage system. 1997, Washington DC, US Census Bureau Porter EH, Winkler WE: Approximate string comparison and its effect on an advanced record linkage system. 1997, Washington DC, US Census Bureau
29.
go back to reference Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Information Processing Letters. 2000, 76: 175-181. 10.1016/S0020-0190(00)00142-3.CrossRef Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Information Processing Letters. 2000, 76: 175-181. 10.1016/S0020-0190(00)00142-3.CrossRef
30.
go back to reference Sharan R, Shamir R: CLICK: a clustering algorithm with application to gene expression analysis. Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla. 2000 Sharan R, Shamir R: CLICK: a clustering algorithm with application to gene expression analysis. Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla. 2000
31.
go back to reference Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. Journal of Computational Biology. 1999, 6: 281-297. 10.1089/106652799318274.CrossRefPubMed Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. Journal of Computational Biology. 1999, 6: 281-297. 10.1089/106652799318274.CrossRefPubMed
32.
go back to reference Pavan M, Pellilo M: A new graph-theoretic approach to clustering and segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison. 2003 Pavan M, Pellilo M: A new graph-theoretic approach to clustering and segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison. 2003
33.
go back to reference Kawaji H, Yamaguchi Y, Matsuda H, Hashimoto A: A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Informatics. 2001, 12: 93-102.PubMed Kawaji H, Yamaguchi Y, Matsuda H, Hashimoto A: A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Informatics. 2001, 12: 93-102.PubMed
34.
go back to reference Yancey WE: Frequency-dependent probability measures for record linkage. Statistical research report series N°07. 2000, US bureau of census, Washington DC Yancey WE: Frequency-dependent probability measures for record linkage. Statistical research report series N°07. 2000, US bureau of census, Washington DC
35.
go back to reference Winkler WE: Machine learning, information retrieval and record linkage. Proceedings of the Survey Research Methods Section. 2000, American Statistical Association Winkler WE: Machine learning, information retrieval and record linkage. Proceedings of the Survey Research Methods Section. 2000, American Statistical Association
36.
go back to reference Fortini M, Liseo B, Nuccitelli A, Scanu M: On bayesian record linkage. Sixth World Meeting of the International Society for Bayesian Analysis, Hersonissos, Greece. 2000 Fortini M, Liseo B, Nuccitelli A, Scanu M: On bayesian record linkage. Sixth World Meeting of the International Society for Bayesian Analysis, Hersonissos, Greece. 2000
37.
go back to reference Kilss B, Alvey W, Eds: Record linkage techniques – 1985. Proceedings of the Workshop on Exact Matching Methodologiese. 1985, Arlington, US Internal Revenue Service Kilss B, Alvey W, Eds: Record linkage techniques – 1985. Proceedings of the Workshop on Exact Matching Methodologiese. 1985, Arlington, US Internal Revenue Service
38.
go back to reference Monge AE: Matching algortihm within a duplicate detection system. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2000, 23: 14-20. Monge AE: Matching algortihm within a duplicate detection system. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2000, 23: 14-20.
39.
go back to reference Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L: Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods on Information in Medicine. 1998, 37: 271-277. Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L: Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods on Information in Medicine. 1998, 37: 271-277.
Metadata
Title
Medical record linkage in health information systems by approximate string matching and clustering
Authors
Erik A Sauleau
Jean-Philippe Paumier
Antoine Buemi
Publication date
01-12-2005
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2005
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-5-32

Other articles of this Issue 1/2005

BMC Medical Informatics and Decision Making 1/2005 Go to the issue