Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2005 | Research article

Medical record linkage in health information systems by approximate string matching and clustering

Authors: Erik A Sauleau, Jean-Philippe Paumier, Antoine Buemi

Published in: BMC Medical Informatics and Decision Making | Issue 1/2005

Abstract

Background

Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.

Methods

The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.

Results

The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.

Conclusion

Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.

Available only for authorised users

Belin TR, Rubin DB: A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995, 90: 697-707.CrossRef

Newcombe HB, Kennedy JM: Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962, 5: 563-566. 10.1145/368996.369026.CrossRef

Vintsyuk T: Speech discrimination by dynamic programming. Cybernetics. 1968, 4: 52-58. 10.1007/BF01074755.CrossRef

Sellers P: The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980, 1: 359-373. 10.1016/0196-6774(80)90016-4.CrossRef

Navarro G, Raffinot M: Flexible pattern matching in strings. 2002, Cambridge, Cambridge University PressCrossRef

Navarro G: A guided tour to approximate string matching. Association of Computing Machinery Computing Surveys. 2001, 33: 31-88.

Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710.

Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 14: 195-197. 10.1016/0022-2836(81)90087-5.CrossRef

Dempster AP, Laird N, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, series B. 1977, 39: 1-38.

10.

Winkler WE: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Statistical research report series N°05. 2000, US bureau of census, Washington DC

11.

Newcombe HB: Handbook of record linkage: methods for health and statistical studies, administration and business. 1988, Oxford, Oxford University Press

12.

Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210.CrossRef

13.

Jaro MA: Advances in record linkage methodology as applied to matching the 1985 Census of Tempa, Florida. Journal of the American Statistical Association. 1989, 84: 414-420.CrossRef

14.

Jaro MA: Probabilistic linkage of large public health data files. Statistics in Medicine. 1995, 14: 491-498.CrossRefPubMed

15.

Hartigan J: Clustering algorithms. 1975, New York, John Wiley and Sons

16.

Everitt B: Cluster analysis. 1993, London, Edward Arnold, 3

17.

Eades P: A heuristique for graph drawing. Congressus Numerantium. 1984, 42: 149-160.

18.

Fruchterman T, Reingold E: Graph drawing by force-directed placement. Software-Practice and Experience. 1991, 21: 1129-1164.CrossRef

19.

Kamada T, Kawai S: An algorithm for drawing general undirected graphs. Information Processing Letters. 1989, 31: 7-15. 10.1016/0020-0190(89)90102-6.CrossRef

20.

Monge AE, Elkan CP: An efficient domain-independent algorithm for detecting approximately duplicate database records. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon. 1997

21.

Hylthon JA: Identifying and merging related bibliographic records. 1996, Master of Engineering in Electrical and Computer Sciences. MIT

22.

Sharan R, Shamir R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Molecular Biology. Edited by: Jiang T. 2002, The MIT Press, 269-300.

23.

Baxter A, Christen P, Churches T: A comparison of fast blocking methods for record linkage. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC

24.

Hernandez M, Stolfo S: The merge/purge problem for large databases. ACM SIGMOD International Conference on Management of Data, San Jose. 1995

25.

Baeza-Yates R, Frakes WB: Information Retrieval: Algorithms and Data Structures. 1992, Englewood Cliffs, Prentice-Hall

26.

McCallum AK, Nigam K, Ungar LH: Efficient clustering of high-dimensional datasets with application to reference matching. Sixth International Conference on Knowledge Discovery and Data Mining, Boston. 2000

27.

Cohen W, Ravikumar P, Fienberg S: A comparison of string metrics for matching names and records. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC

28.

Porter EH, Winkler WE: Approximate string comparison and its effect on an advanced record linkage system. 1997, Washington DC, US Census Bureau

29.

Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Information Processing Letters. 2000, 76: 175-181. 10.1016/S0020-0190(00)00142-3.CrossRef

30.

Sharan R, Shamir R: CLICK: a clustering algorithm with application to gene expression analysis. Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla. 2000

31.

Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. Journal of Computational Biology. 1999, 6: 281-297. 10.1089/106652799318274.CrossRefPubMed

32.

Pavan M, Pellilo M: A new graph-theoretic approach to clustering and segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison. 2003

33.

Kawaji H, Yamaguchi Y, Matsuda H, Hashimoto A: A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Informatics. 2001, 12: 93-102.PubMed

34.

Yancey WE: Frequency-dependent probability measures for record linkage. Statistical research report series N°07. 2000, US bureau of census, Washington DC

35.

Winkler WE: Machine learning, information retrieval and record linkage. Proceedings of the Survey Research Methods Section. 2000, American Statistical Association

36.

Fortini M, Liseo B, Nuccitelli A, Scanu M: On bayesian record linkage. Sixth World Meeting of the International Society for Bayesian Analysis, Hersonissos, Greece. 2000

37.

Kilss B, Alvey W, Eds: Record linkage techniques – 1985. Proceedings of the Workshop on Exact Matching Methodologiese. 1985, Arlington, US Internal Revenue Service

38.

Monge AE: Matching algortihm within a duplicate detection system. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2000, 23: 14-20.

39.

Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L: Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods on Information in Medicine. 1998, 37: 271-277.

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/5/32/prepub

Title: Medical record linkage in health information systems by approximate string matching and clustering
Authors: Erik A Sauleau
Jean-Philippe Paumier
Antoine Buemi
Publication date: 01-12-2005
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2005
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/1472-6947-5-32

At a glance: The STEP trials

Springer Medicine

Medical record linkage in health information systems by approximate string matching and clustering

Abstract

Background

Methods

Results

Conclusion

At a glance: The STEP trials

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2005

Information resource preferences by general pediatricians in office settings: a qualitative study

A controlled trial of automated classification of negation from clinical notes

An overview of the design and methods for retrieving high-quality studies for clinical care

Developing optimal search strategies for detecting clinically sound and relevant causation studies in EMBASE

Automatic extraction of candidate nomenclature terms using the doublet method

Personalized online information search and visualization