Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2014

Open Access 01-12-2014 | Research article

Evaluating bias due to data linkage error in electronic healthcare records

Authors: Katie Harron, Angie Wade, Ruth Gilbert, Berit Muller-Pebody, Harvey Goldstein

Published in: BMC Medical Research Methodology | Issue 1/2014

Login to get access

Abstract

Background

Linkage of electronic healthcare records is becoming increasingly important for research purposes. However, linkage error due to mis-recorded or missing identifiers can lead to biased results. We evaluated the impact of linkage error on estimated infection rates using two different methods for classifying links: highest-weight (HW) classification using probabilistic match weights and prior-informed imputation (PII) using match probabilities.

Methods

A gold-standard dataset was created through deterministic linkage of unique identifiers in admission data from two hospitals and infection data recorded at the hospital laboratories (original data). Unique identifiers were then removed and data were re-linked by date of birth, sex and Soundex using two classification methods: i) HW classification - accepting the candidate record with the highest weight exceeding a threshold and ii) PII–imputing values from a match probability distribution. To evaluate methods for linking data with different error rates, non-random error and different match rates, we generated simulation data. Each set of simulated files was linked using both classification methods. Infection rates in the linked data were compared with those in the gold-standard data.

Results

In the original gold-standard data, 1496/20924 admissions linked to an infection. In the linked original data, PII provided least biased results: 1481 and 1457 infections (upper/lower thresholds) compared with 1316 and 1287 (HW upper/lower thresholds). In the simulated data, substantial bias (up to 112%) was introduced when linkage error varied by hospital. Bias was also greater when the match rate was low or the identifier error rate was high and in these cases, PII performed better than HW classification at reducing bias due to false-matches.

Conclusions

This study highlights the importance of evaluating the potential impact of linkage error on results. PII can help incorporate linkage uncertainty into analysis and reduce bias due to linkage error, without requiring identifiers.
Appendix
Available only for authorised users
Literature
1.
go back to reference Jutte DP, Roos L, Brownell MD: Administrative record linkage as a tool for public health research. Annu Rev Public Health. 2011, 32: 91-108. 10.1146/annurev-publhealth-031210-100700.CrossRefPubMed Jutte DP, Roos L, Brownell MD: Administrative record linkage as a tool for public health research. Annu Rev Public Health. 2011, 32: 91-108. 10.1146/annurev-publhealth-031210-100700.CrossRefPubMed
2.
go back to reference Black N: Secondary use of personal data for health and health services research: why identifiable data are essential. J Health Serv Res Policy. 2003, 8 (Supplement 1): 36-40.CrossRefPubMed Black N: Secondary use of personal data for health and health services research: why identifiable data are essential. J Health Serv Res Policy. 2003, 8 (Supplement 1): 36-40.CrossRefPubMed
3.
go back to reference Boyle D, Cunningham S: Resolving fundamental quality issues in linked datasets for clinical care. Health Informatics J. 2002, 8 (2): 73-77. 10.1177/146045820200800205.CrossRef Boyle D, Cunningham S: Resolving fundamental quality issues in linked datasets for clinical care. Health Informatics J. 2002, 8 (2): 73-77. 10.1177/146045820200800205.CrossRef
4.
go back to reference Bohensky M, Jolley D, Sundararajan V, Evans S, Pilcher D, Scott I, Brand C: Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010, 10 (1): 346-352. 10.1186/1472-6963-10-346.CrossRefPubMedPubMedCentral Bohensky M, Jolley D, Sundararajan V, Evans S, Pilcher D, Scott I, Brand C: Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010, 10 (1): 346-352. 10.1186/1472-6963-10-346.CrossRefPubMedPubMedCentral
5.
go back to reference Christen P, Goiser K: Assessing deduplication and data linkage quality: what to measure?. Proceedings of the fourth Australasian Data Mining Conference: 2005. 2005, Sydney Christen P, Goiser K: Assessing deduplication and data linkage quality: what to measure?. Proceedings of the fourth Australasian Data Mining Conference: 2005. 2005, Sydney
6.
go back to reference Leiss JK: A new method for measuring misclassification of maternal sets in maternally linked birth records: true and false linkage proportions. Matern Child Health J. 2007, 11 (3): 293-300. 10.1007/s10995-006-0162-3.CrossRefPubMed Leiss JK: A new method for measuring misclassification of maternal sets in maternally linked birth records: true and false linkage proportions. Matern Child Health J. 2007, 11 (3): 293-300. 10.1007/s10995-006-0162-3.CrossRefPubMed
7.
go back to reference Neter J, Maynes E, Ramanathan R: The effect of mismatching on the measurement of response error. J Am Stat Assoc. 1965, 60 (312): 1005-1027. Neter J, Maynes E, Ramanathan R: The effect of mismatching on the measurement of response error. J Am Stat Assoc. 1965, 60 (312): 1005-1027.
8.
go back to reference Campbell K: Impact of record-linkage methodology on performance indicators and multivariate relationships. J Subst Abuse Treat. 2009, 36 (1): 110-117. 10.1016/j.jsat.2008.05.004.CrossRefPubMed Campbell K: Impact of record-linkage methodology on performance indicators and multivariate relationships. J Subst Abuse Treat. 2009, 36 (1): 110-117. 10.1016/j.jsat.2008.05.004.CrossRefPubMed
9.
go back to reference Kelman CW, Bass AJ, Holman CDJ: Research use of linked health data—a best practice protocol. Aust Nz J Publ Heal. 2002, 26 (3): 251-255. 10.1111/j.1467-842X.2002.tb00682.x.CrossRef Kelman CW, Bass AJ, Holman CDJ: Research use of linked health data—a best practice protocol. Aust Nz J Publ Heal. 2002, 26 (3): 251-255. 10.1111/j.1467-842X.2002.tb00682.x.CrossRef
10.
go back to reference Baldi I, Ponti A, Zanetti R, Ciccone G, Merletti F, Gregori D: The impact of record linkage bias in the Cox model. J Eval Clin Pract. 2010, 16 (1): 92-96. 10.1111/j.1365-2753.2009.01119.x.CrossRefPubMed Baldi I, Ponti A, Zanetti R, Ciccone G, Merletti F, Gregori D: The impact of record linkage bias in the Cox model. J Eval Clin Pract. 2010, 16 (1): 92-96. 10.1111/j.1365-2753.2009.01119.x.CrossRefPubMed
11.
go back to reference Jaro M: Probabilistic linkage of large public health data files. Stat Med. 1995, 14 (5-7): 491-498. 10.1002/sim.4780140510.CrossRefPubMed Jaro M: Probabilistic linkage of large public health data files. Stat Med. 1995, 14 (5-7): 491-498. 10.1002/sim.4780140510.CrossRefPubMed
12.
go back to reference Clark D: Practical introduction to record linkage for injury research. Injury Prev. 2004, 10 (3): 186-191. 10.1136/ip.2003.004580.CrossRef Clark D: Practical introduction to record linkage for injury research. Injury Prev. 2004, 10 (3): 186-191. 10.1136/ip.2003.004580.CrossRef
13.
go back to reference Chambers R, Chipperfield J, Davis W, Kovacevic M: Inference based on estimating equations and probability-linked data. Centre for Statistical & Survey Methodology Working Paper Series. 2009, 38- Chambers R, Chipperfield J, Davis W, Kovacevic M: Inference based on estimating equations and probability-linked data. Centre for Statistical & Survey Methodology Working Paper Series. 2009, 38-
14.
go back to reference Kim G, Chambers R: Regression analysis under probabilistic multi-linkage. Stat Neerl. 2011, 66 (1): 64-79.CrossRef Kim G, Chambers R: Regression analysis under probabilistic multi-linkage. Stat Neerl. 2011, 66 (1): 64-79.CrossRef
15.
go back to reference Scheuren F, Winkler W: Regression analysis of data files that are computer matched–part ii. Surv Methodol. 1997, 23 (2): 126-138. Scheuren F, Winkler W: Regression analysis of data files that are computer matched–part ii. Surv Methodol. 1997, 23 (2): 126-138.
16.
go back to reference Hof MHP, Zwinderman AH: Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Stat Med. 2012, 31 (30): 4231-4242. 10.1002/sim.5498.CrossRefPubMed Hof MHP, Zwinderman AH: Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Stat Med. 2012, 31 (30): 4231-4242. 10.1002/sim.5498.CrossRefPubMed
17.
go back to reference Goldstein H, Harron K, Wade A: The analysis of record-linked data using multiple imputation with data value priors. Stat Med. 2012, 31 (28): 3481-3493. 10.1002/sim.5508.CrossRefPubMed Goldstein H, Harron K, Wade A: The analysis of record-linked data using multiple imputation with data value priors. Stat Med. 2012, 31 (28): 3481-3493. 10.1002/sim.5508.CrossRefPubMed
19.
go back to reference StataCorp: Stata statistical software: release 12. College Station. 2011, TX: StataCorp LP StataCorp: Stata statistical software: release 12. College Station. 2011, TX: StataCorp LP
20.
go back to reference Mortimer J, Salathiel J: ‘Soundex’codes of surnames provide confidentiality and accuracy in a national HIV database. Commun Dis Rep CDR Rev. 1995, 5 (12): R183-PubMed Mortimer J, Salathiel J: ‘Soundex’codes of surnames provide confidentiality and accuracy in a national HIV database. Commun Dis Rep CDR Rev. 1995, 5 (12): R183-PubMed
21.
go back to reference Fellegi IP, Sunter AB: A theory for record linkage. J Am Stat Assoc. 1969, 64 (328): 1183-1210. 10.1080/01621459.1969.10501049.CrossRef Fellegi IP, Sunter AB: A theory for record linkage. J Am Stat Assoc. 1969, 64 (328): 1183-1210. 10.1080/01621459.1969.10501049.CrossRef
22.
go back to reference Charlton CMJ, Michaelides DT, Cameron B, Szmaragd C, Parker RMA, Yang H, Zhang Z, Browne WJ: Stat-JR software. Center for Multilevel Modelling, University of Bristol and Electronics and Computer Science, University of Southampton. 2012 Charlton CMJ, Michaelides DT, Cameron B, Szmaragd C, Parker RMA, Yang H, Zhang Z, Browne WJ: Stat-JR software. Center for Multilevel Modelling, University of Bristol and Electronics and Computer Science, University of Southampton. 2012
23.
go back to reference Tromp M, Méray N, Ravelli A, Reitsma J, Bonsel G: Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies. J Am Med Inform Assn. 2008, 15 (5): 654-660. 10.1197/jamia.M2265.CrossRef Tromp M, Méray N, Ravelli A, Reitsma J, Bonsel G: Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies. J Am Med Inform Assn. 2008, 15 (5): 654-660. 10.1197/jamia.M2265.CrossRef
24.
go back to reference Harron K, Wade A, Muller-Pebody B, Goldstein H, Parslow R, Gray J, Hartley JC, Mok Q, Gilbert R: Risk-adjusted monitoring of blood-stream infection in paediatric intensive care: a data linkage study. Intens Care Med. 2013, 39 (6): 1080-1087. 10.1007/s00134-013-2841-z.CrossRef Harron K, Wade A, Muller-Pebody B, Goldstein H, Parslow R, Gray J, Hartley JC, Mok Q, Gilbert R: Risk-adjusted monitoring of blood-stream infection in paediatric intensive care: a data linkage study. Intens Care Med. 2013, 39 (6): 1080-1087. 10.1007/s00134-013-2841-z.CrossRef
25.
go back to reference Rubin D: Multiple imputation for nonresponse in surveys” Volume: “Wiley series in probability and mathematical statistics. Applied probability and statistics. 1987, New York: WileyCrossRef Rubin D: Multiple imputation for nonresponse in surveys” Volume: “Wiley series in probability and mathematical statistics. Applied probability and statistics. 1987, New York: WileyCrossRef
26.
go back to reference Lariscy JT: Differential record linkage by Hispanic ethnicity and age in linked mortality studies. J Aging Health. 2011, 23 (8): 1263-1284. 10.1177/0898264311421369.CrossRefPubMedPubMedCentral Lariscy JT: Differential record linkage by Hispanic ethnicity and age in linked mortality studies. J Aging Health. 2011, 23 (8): 1263-1284. 10.1177/0898264311421369.CrossRefPubMedPubMedCentral
27.
go back to reference Jasilionis D, Stankuniene V, Ambrozaitiene D, Jdanov DA, Shkolnikov VM: Ethnic mortality differentials in Lithuania: contradictory evidence from census-linked and unlinked mortality estimates. J Epidemiol Commun H. 2011, 66 (6): e7-CrossRef Jasilionis D, Stankuniene V, Ambrozaitiene D, Jdanov DA, Shkolnikov VM: Ethnic mortality differentials in Lithuania: contradictory evidence from census-linked and unlinked mortality estimates. J Epidemiol Commun H. 2011, 66 (6): e7-CrossRef
29.
go back to reference Lawrence D, Christensen D, Mitrou F, Draper G, Davis G, McKeown S, McAullay D, Pearson G, Zubrick SR: Adjusting for under-identification of aboriginal and/or Torres strait islander births in time series produced from birth records: using record linkage of survey data and administrative data sources. BMC Med Res Methodol. 2012, 12 (1): 90-102. 10.1186/1471-2288-12-90.CrossRefPubMedPubMedCentral Lawrence D, Christensen D, Mitrou F, Draper G, Davis G, McKeown S, McAullay D, Pearson G, Zubrick SR: Adjusting for under-identification of aboriginal and/or Torres strait islander births in time series produced from birth records: using record linkage of survey data and administrative data sources. BMC Med Res Methodol. 2012, 12 (1): 90-102. 10.1186/1471-2288-12-90.CrossRefPubMedPubMedCentral
30.
go back to reference DuVall SL, Fraser AM, Rowe K, Thomas A, Mineau GP: Evaluation of record linkage between a large healthcare provider and the Utah population database. J Am Med Inform Assn. 2011, 19 (e1): e54-e59.CrossRef DuVall SL, Fraser AM, Rowe K, Thomas A, Mineau GP: Evaluation of record linkage between a large healthcare provider and the Utah population database. J Am Med Inform Assn. 2011, 19 (e1): e54-e59.CrossRef
31.
go back to reference Coeli CM, Barbosa Fdos S, Brito Ados S, Pinheiro RS, Camargo KR, Medronho Rde A, Bloch KV: Estimated parameters in linkage between mortality and hospitalization databases according to quality of records on underlying cause of death. Cad Saude Publica. 2011, 27 (8): 1654-1658. 10.1590/S0102-311X2011000800020.CrossRefPubMed Coeli CM, Barbosa Fdos S, Brito Ados S, Pinheiro RS, Camargo KR, Medronho Rde A, Bloch KV: Estimated parameters in linkage between mortality and hospitalization databases according to quality of records on underlying cause of death. Cad Saude Publica. 2011, 27 (8): 1654-1658. 10.1590/S0102-311X2011000800020.CrossRefPubMed
32.
go back to reference Adams MM, Wilson HG, Casto DL, Berg CJ, McDermott JM, Gaudino JA, McCarthy BJ: Constructing reproductive histories by linking vital records. Am J Epidemiol. 1997, 145 (4): 339-348. 10.1093/oxfordjournals.aje.a009111.CrossRefPubMed Adams MM, Wilson HG, Casto DL, Berg CJ, McDermott JM, Gaudino JA, McCarthy BJ: Constructing reproductive histories by linking vital records. Am J Epidemiol. 1997, 145 (4): 339-348. 10.1093/oxfordjournals.aje.a009111.CrossRefPubMed
33.
go back to reference Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R: Opening the black box of record linkage. J Epidemiol Commun H. 2012, 66 (12): 1198-CrossRef Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R: Opening the black box of record linkage. J Epidemiol Commun H. 2012, 66 (12): 1198-CrossRef
34.
go back to reference Brenner H, Schmidtmann I, Stegmaier C: Effects of record linkage errors on registry-based follow-up studies. Stat Med. 1997, 16 (23): 2633-2643. 10.1002/(SICI)1097-0258(19971215)16:23<2633::AID-SIM702>3.0.CO;2-1.CrossRefPubMed Brenner H, Schmidtmann I, Stegmaier C: Effects of record linkage errors on registry-based follow-up studies. Stat Med. 1997, 16 (23): 2633-2643. 10.1002/(SICI)1097-0258(19971215)16:23<2633::AID-SIM702>3.0.CO;2-1.CrossRefPubMed
35.
go back to reference Ford JB, Roberts CL, Taylor LK: Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Ep. 2006, 20 (4): 329-337. 10.1111/j.1365-3016.2006.00715.x.CrossRef Ford JB, Roberts CL, Taylor LK: Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Ep. 2006, 20 (4): 329-337. 10.1111/j.1365-3016.2006.00715.x.CrossRef
36.
go back to reference Bohensky MA, Jolley D, Sundararajan V, Evans S, Ibrahim J, Brand C: Development and validation of reporting guidelines for studies involving data linkage. Aust Nz J Publ Heal. 2011, 35 (5): 486-489. 10.1111/j.1753-6405.2011.00741.x.CrossRef Bohensky MA, Jolley D, Sundararajan V, Evans S, Ibrahim J, Brand C: Development and validation of reporting guidelines for studies involving data linkage. Aust Nz J Publ Heal. 2011, 35 (5): 486-489. 10.1111/j.1753-6405.2011.00741.x.CrossRef
37.
go back to reference Benchimol EI, Langan S, Guttmann A: Call to RECORD: the need for complete reporting of research using routinely collected health data. J Clin Epidemiol. 2013, 66 (7): 703-705. 10.1016/j.jclinepi.2012.09.006.CrossRefPubMed Benchimol EI, Langan S, Guttmann A: Call to RECORD: the need for complete reporting of research using routinely collected health data. J Clin Epidemiol. 2013, 66 (7): 703-705. 10.1016/j.jclinepi.2012.09.006.CrossRefPubMed
Metadata
Title
Evaluating bias due to data linkage error in electronic healthcare records
Authors
Katie Harron
Angie Wade
Ruth Gilbert
Berit Muller-Pebody
Harvey Goldstein
Publication date
01-12-2014
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2014
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-14-36

Other articles of this Issue 1/2014

BMC Medical Research Methodology 1/2014 Go to the issue