Introduction

Administrative claims data offer various advantages for pharmacoepidemiologic research, but limitations are usually acknowledged rather than quantified. We review key findings from simulation studies regarding bias due to misclassification, common sources, and types of misclassification in claims data including recent findings that relate to these, and methods for minimizing and quantifying the impact of misclassification on effect estimates and associated confidence intervals.

Brief Overview of Bias Due to Misclassification

Misclassified Exposures

When an effect exists, simulation studies have shown that absolutely nondifferential misclassification of a binary exposure will, on average, bias the estimate of effect towards the null [1]. Researchers often rely on this knowledge and make the assumption that the observed effect is, at worst, an under-estimate of the true effect given nondifferential misclassification of the exposure. The exceptions to this rule of thumb are less well known. If the exposure is polytomous (i.e., is categorized in more than two levels), the bias from nondifferential misclassification could be in either direction [2]. Nor is bias toward the null guaranteed if the misclassification is related to errors in other variables or if there are other sources of systematic error (e.g., selection bias or confounding) in exposure classification [3]. In the unlikely event that each level of exposure is misclassified beyond what would be expected if exposure had been randomly assigned, the bias could remove or even reverse (i.e., beyond the null) the appearance of an association. Of course, exposure misclassification that differs by outcome status (differential) can bias in any direction.

Misclassified Outcomes

In most cases, nondifferential misclassification of a binary outcome will also result in bias toward the null. However, there are some exceptions to this general rule where disease misclassification is not expected to produce bias. Given perfect specificity (i.e., no false positives) and nondifferential misclassification of disease due only to low sensitivity, the risk ratio will be unbiased in expectation; however, the risk difference will be biased toward the null to the degree of the sensitivity (e.g., if the sensitivity is 80 %, the expected risk difference value will be 80 % of the true risk difference). In this situation, odds ratios and rate ratios will also be biased toward the null. If the outcome is rare in all exposure groups, the odds ratio will approximate the risk ratio and the estimate will be unbiased [3]; similarly, if the impact of misclassification on person-time is negligible, the rate ratio will be unbiased [4].

Misclassified Confounders

Misclassification of a confounder leads to an estimated measure of effect that is biased toward the crude measure, if the misclassification is nondifferential with regard to both the exposure and the outcome and there are no other sources of error. The direction of the bias or “residual confounding” can be either away or toward the null [2]. For a binary confounder, the independent nondifferential misclassification will cause a bias in the direction of the confounding because the ability to control fully for the confounder has been reduced, resulting in a partially controlled result that lies between the crude and true values. This partial control is limited to situations where there is no qualitative interaction between the exposure and the confounder [5]. The most extreme effects of confounder misclassification are observed when the study exposure variable is weakly associated with the outcome in comparison to a stronger confounder [6, 7].

Expected Bias vs. Bias of a Single Study

The conditions under which the direction of any given bias is predictable are limited and, in practice, are impossible to guarantee. Investigators need to be aware that conventional wisdom regarding the expected bias due to misclassification in various scenarios reflects an average estimated across multiple study repetitions; thus, an estimate from a single study may not follow the direction of bias according to these rules [1, 8]. In a simulation study of nondifferential misclassification, the mean result across many trials was biased toward the null, as expected, but the estimates from the individual trials were biased both away from and toward the null [1]. This highlights the importance of quantifying the impact of misclassification in each study rather than relying on the expected direction of bias.

Prescription Medications

Arguably, one of the strengths of using administrative claims to evaluate medication effects is the relatively complete nature of data regarding prescription fills. These data derive from insurance claims for medications that are filled by the patient at a community-based pharmacy. These data are generally superior to self-reported medication use (which is susceptible to recall bias) [9, 10]. In some cases, these data are also more accurate than records of physician-ordered prescriptions (which may include medications that are never obtained by the patient) [11, 12]. Nonetheless, there are a variety of circumstances in which these pharmacy claims may not reflect the actual medication exposure of patients.

Non-Users Misclassified as Users

This type of misclassified exposure status includes patients with prescriptions that are filled but never taken, those initially taken and then discontinued, and those taken PRN (as needed) or intermittently. One common approach to minimizing the impact of these misclassified individuals is to require evidence of a second prescription fill within a fixed period of time to increase the likelihood that patients are actually taking the medication [13, 14]. This necessitates starting follow-up at the second fill to avoid introducing immortal time, thus limiting the ability to study short-term effects [15]. We discuss the implications of imperfect identification of when medications are started and stopped in the section on misclassified duration of use below.

Users Misclassified as Non-Users

In the setting of administrative claims data, this type of misclassification occurs when patients pay for prescription medications out of pocket (including $4 generics [1619]), receive samples [20••], or are hospitalized (as inpatient medications are typically included in the bundled payment). For administrative databases that include only those medications on a formulary (such as in Canada), there is also potentially important misclassification of exposure to specific medications in a class that are not included on the formulary. A recent study set in Canada noted a dramatic increase in the number of reported prescriptions for thiazolidinediones (TZDs) which corresponded to a change in policy providing for an automated prior-authorization process for this diabetes medication, suggesting that perhaps 20 % of prior TZD exposure was misclassified as non-use prior to the policy change [21]. There are also instances in which medications are available both with and without a prescription (e.g. analgesics, proton-pump inhibitors, antihistamines) [22]. Patients who obtain these medications over-the-counter would also be misclassified as non-users according to the insurance claims data.

The scenarios in which differential misclassification would affect users of a medication are less clear, although we can imagine that, e.g., in the US Medicare data, individuals who have more complicated medical conditions are more likely to enter the ‘donut hole’ when they become responsible for all prescription costs. These individuals would be at greater risk of experiencing outcomes such as mortality and hospitalization, and would also be more likely to obtain a prescription from a $4 generic list and pay out of pocket if they did not expect to accrue sufficient additional prescription costs during the remainder of the benefit year to qualify for catastrophic coverage. Thus, the sensitivity with which truly exposed individuals would be correctly classified as exposed might differ by outcome status.

Duration of Use Misclassified

Misclassification of the timing of an event – new use of a therapy, or the occurrence of the outcome – has received little attention, perhaps because much of the literature on misclassification deals with settings in which the data can be represented in the form of a 2 × 2 table. But it is the rare analysis in pharmacoepidemiology that conforms to this structure. More often, the timing of exposures, outcomes, and covariates are complex and the analyst must carefully evaluate the sequencing of these to assure that the proper temporal relationships between them are maintained.

The new user design is intended to synchronize patients with respect to the duration of the treatment and ensure that covariates have not been affected by prior exposure [23]. In the case where patients receive free medication samples from the prescribing physician for a new prescription, there is an initial period of use that is misclassified as preceding initiation using claims data. Among patients who go on to file a prescription claim, there is selection bias [24] (non-responders and those with early adverse effects would be less likely to fill) and a difference in the duration of exposure compared to true new users. In one recent paper, 13.4 % of ‘new’ users of a branded statin had lab values suggesting that they were prevalent users, whereas there was no indication that those identified as new users of generic statins were prevalent users [20••].

A similar scenario could occur in which patients paid out of pocket for medication while awaiting special authorization. In a simulation study of this form of misclassified person-time, Gamble et al. found substantial bias of the hazard ratio for mortality for new users of metformin versus sulfonylureas as an increasing proportion of users of metformin were misclassified as non-users while awaiting special authorization [25]. There is also evidence suggesting that the days-supply associated with the prescription, which is used to determine periods of continuous use and discontinuation (for an as-treated analysis) is not uniformly recorded. In Ontario, investigators studied the pattern of days-supply for osteoporosis medications and found that those filled for patients in a long-term care facility were substantially shorter than those for community-dwelling patients [26].

In addition, the effects of the treatment often vary considerably with time [27], and by misidentifying the start of therapy the comparison groups may not reflect the same duration of treatment. These problems would be most pronounced when the duration of follow-up is relatively short, the hazard is not constant, and the extent of misclassification differs between the groups. While within-subject study designs such as the case-crossover [28] and self-controlled case series [29, 30] minimize bias due to confounding by time-invariant characteristics and comorbidities, they remain susceptible to bias due to misclassification of exposure.

Clinical Outcomes

The ability to study rare clinical outcomes in a very large, population-based sample is a potential strength of claims data, but likewise a source of concern due to potential misclassification. Outcomes such as death are considered reliable in some data sources while they are only observed when they occur in the hospital in other data sources [31]. Clinical events may be acute and result in hospitalization (such as hip fracture) or chronic with or without specific clinical interventions (such as type II diabetes). The degree of misclassification of these outcomes can vary considerably.

Medical procedures are considered reliable in billing data given the close relationship and regulated nature of billings for procedures and physician payment. ICD-9 procedure codes, used by hospitals to bill for the facility component of charges, are not sufficiently specific in many instances, while CPT codes, used by physicians to bill for their services, are more specific.

The importance of validating clinical outcomes has been appreciated since the early days of studies conducted using health care claims data as a means by which to assure that a highly specific outcome definition was devised. These are routinely based on an algorithm that may include multiple instances of a given diagnosis code on unique service dates, prescription fills, claims from an inpatient setting, and/or specific procedures to maximize the specificity of the outcome [32••].

Misclassification of outcomes can occur differentially by exposure status by virtue of the fact that individuals who receive a prescription may receive more intensive health screenings and monitoring than patients who are not receiving medication or are receiving a different medication. These might include differences in health-seeking behaviors (screening or diagnostic workups for suspected health problems) [33], more frequent lab testing for potential liver or kidney damage if the medication is suspected to increase risk [34], or use of follow-up colonoscopies after selected types of radiation [35, 36]. This would decrease the proportion of individuals who have the outcome who are incorrectly classified as unaffected among patients with the exposure.

Confounders

Misclassification in the setting of claims data is a significant concern in light of the fact that the absence of a diagnosis or related procedures in claims during a specified time period is taken to indicate the absence of the condition. Patients who do not have healthcare encounters will not generate evidence of their conditions, and those with significant co-morbidity may not have evidence of common, less serious conditions (such as hypertension) when they are under active treatment (CABG) [37]. Typically, studies using insurance claims data define a baseline period during which individuals must be continuously enrolled [38]. But a recent simulation study suggested that using all available data to define confounders may better control confounding than restricting to a uniform time period [39••]. The robustness of this finding under a variety of conditions is still being established, but it serves as a challenge to reconsider the status quo.

The Special Case of Comparative Effectiveness Research (CER)

As questions were being raised about the use of placebo-controlled trials when effective treatment alternatives were available [40], so did pharmacoepidemiologists begin to recognize the value of active comparators in the setting of non-experimental research on medication safety and effectiveness. The comparison of two active agents has made pharmacoepidemiologic studies less susceptible to biases due to confounding by indication, healthy user bias, confounding by frailty, and other sources of unmeasured confounding [41]. In addition, biases due to misclassification of confounders and outcomes (described above) are likely less pronounced with an active comparator. That said, there are several aspects of comparative effectiveness studies which make them particularly susceptible to bias due to misclassification including the comparison of two active treatments, modest effect sizes that are clinically meaningful, the value of absolute measures of effect (such as the risk difference), and the extreme precision that comes from analyzing large datasets.

Comparing Active Treatments

In studies of comparative effectiveness in which two active treatments are being compared, there are at least three (and possibly more) levels of exposure: non-user, user of medication A, and user of medication B. Misclassification of individuals who were truly exposed to medications A and B would place individuals in the non-user category, not in the other category of exposure. Misclassification of this type could result in estimates that are toward or away from the null, even though there are only two levels of exposure being analyzed.

We present a hypothetical example in Table 1 in which each of the medications increases the risk of the outcome two-fold (RR = 2.0) compared to unexposed individuals. (An Excel version of the spreadsheet is available at http://www.unc.edu/~mfunk to facilitate exploration of alternative scenarios.) We apply nondifferential misclassification of each medication with the unexposed group and assume that there is no misclassification of users between the two medication groups. We observe the classic finding that nondifferential misclassification biases the relative risk (RR) toward the null in two hypothetical ‘studies’ comparing each active medication to a group of non-users. But because the degree of bias toward the null is not uniform across the two studies, the resulting head-to-head comparison of medication A versus medication B is biased away from the null – an apparent 20 % increase in risk (RR = 1.2) whereas the true effect is null. In the example, we applied different degrees of sensitivity and specificity to each medication study, but the bias we observed is not dependent on that. (Using uniform sensitivity and specificity across the medications, the bias increases to a relative risk of 1.4.) Rather, the difference in the prevalence of the medications in the population or the presence of differential specificity of the exposure each lead to bias in the relative effect comparing the medications to each other.

Table 1 Hypothetical example of studies in which individuals exposed to one of two drugs are each compared with non-users, or compared with each other in the presence of nondifferential exposure misclassification.

Modest Effect Sizes

Many important differences in safety and effectiveness of active treatments are in the range of 20 to 40 % [4244]. The potential for bias to obscure a clinically relevant difference (or create the appearance of a difference where there is none) is heightened in this context. Modest effect sizes are particularly susceptible to the effects of residual confounding due to misclassified covariate data. In light of the potential for bias due to exposure misclassification that could be in any direction, this is a setting in which validation studies and quantifying the impact on estimates and uncertainty are particularly important.

Absolute Effect Measures

The choice of effect measures in CER also increases concern about bias due to misclassification. While relative effect measures remain dominant, there is growing recognition that absolute measures are important, particularly in terms of communicating the relevance of the findings to patients [45, 46]. Achieving near perfect specificity in the outcome classification may allow us to claim that the relative effect estimate is unlikely to be considerably biased, but the estimated risks and risk differences will still be under-estimated if the sensitivity is not perfect unless further analysis is used to correct for the non-perfect sensitivity of the outcome definition.

Very Large Study Sizes

Analyses of claims data are powerful and allow us to examine rare outcomes. Very large sample sizes which may give the appearance of precision, making a very small increase or decrease (e.g., HR = 1.06, 95 % CI 1.01, 1.11 %) appear statistically significant, or a null effect seem to exclude any possibility of a protective or harmful effect (e.g., 95 % CI 0.96, 1.04). In the presence of misclassification, these confidence intervals misrepresent the true uncertainty about the estimate.

Because of the very nature of comparative effectiveness research, quantifying the extent of these errors and adjusting the effect estimates and their confidence intervals is particularly important. Various methods for doing so have been developed and are discussed in the following section.

Quantifying Impact and Adjusting Estimates

In this section, we highlight several methods that can be used to quantify the effect of bias due to misclassification. These methods are summarized in Table 2. Here we focus on understanding how these might be best applied in the setting of treatment effect estimation using claims data.

Table 2 Review of methods for quantifying bias due to misclassification

Simple Bias Analysis

This method is the easiest to implement, but also has the most limited potential for use in the setting of pharmacoepidemiology. It re-allocates the observed, tabled data to the underlying ‘true’ tabled data using point estimates for (possibly differential) sensitivity and specificity or positive and negative predicted values. The corrections can be applied to categorized exposure, outcome, or covariates. Corrections can be implemented based only on expert opinion or estimates from the literature – an advantage in the setting where validation data are not available. This approach would be suitable for the analysis of short-term outcomes (such as in-hospital mortality) where all individuals are followed for a consistent period of time, but it is not suitable for outcomes that are partially censored (time-to-event). It does not account for error in the estimation of the sensitivity and specificity in the adjusted effect estimates, and it does not simultaneously control for other covariates. Lash et al. (2009) [8] provide an excellent, in-depth discussion of this approach as well as an Excel spreadsheet.

Probabilistic Bias Analysis

This approach is essentially an iterative version of the simple bias analysis which uses a distribution of values for sensitivity and specificity (or positive and negative predictive values) combined with a Monte Carlo process to produce a distribution of estimates adjusted for misclassification. The credible intervals from this analysis can reflect the uncertainty around the validation measures. This method can also be applied at the record level so that misclassified exposures can be evaluated while controlling for measured covariates in the setting of a time-to-event outcome [47••], making this an excellent choice for use in pharmacoepidemiology studies. This method is also described in detail by Lash et al. (2009) [8] in Chapter 8, and Fox et al. (2005) [48] provide a SAS macro to facilitate application of this method.

Bayesian Bias Analysis

Bayesian bias analysis is similar to the probabilistic bias analysis, but with the addition of prior distributions for all parameters – not just those for misclassification. Like probabilistic bias analysis, random error is reflected in the adjusted effect estimates. In most cases, this method does not out-perform probabilistic bias analysis [49]. The more complex implementation (in terms of software and programming) makes the Bayesian approach less attractive as a general method for application in analyses of claims data, although code for applying this method using BUGS has been published by MacLehose et al. 2009 in an online appendix [50].

Modified Maximum Likelihood

This method uses the full data (rather than tabled data) to fit a modified maximum likelihood that forces the sensitivity and specificity to be less than perfect. This method has been demonstrated with dichotomous and polytomous exposures and outcomes, including outcomes that are Poisson distributed to estimate the rate ratio. This method would be suitable for analyses in which follow-up time varies between individuals (for estimating rates rather than risks) and the hazard is approximately constant. Edwards et al. 2014 provide sample SAS code for this method [51•].

Multiple Imputation for Measurement Error (MIME)

In this approach, the true value of the misclassified variable is treated as partially missing data. The gold standard measures from an internal validation sample are used to fit a model for the imperfect data, and multiple datasets with imputed values for the misclassified variable are created. The effect estimates from analyses of these datasets are then combined to account for the variability introduced through the imputation. This approach would be well suited to analyses of claims data in which the exposure or covariate are misclassified and the outcome has been ascertained during differing amounts of follow-up time. Cole et al. provide SAS code for implementing MIME [52].

Regression Calibration

This method is best suited to the setting in which a continuous variable (exposure or covariate) is measured with error. Regression calibration can take advantage of multiple, imperfect measures of a characteristic (such as blood pressure) in the absence of a single gold standard measure [53, 54]. This approach has been used extensively in the field of nutritional epidemiology, but could be useful in pharmacoepidemiology studies in which lab results are available. A SAS macro is provided by Logan and Spiegelman for the correction of measurement error in the context of logistic regression [55].

Propensity Score Calibration

Propensity score calibration addresses covariate misclassification and measurement error by treating the propensity score as having been estimated with error. By fitting a ‘gold standard’ propensity score using the covariate data that is measured without error (in addition to the data measured with error) in a validation study, one can adjust the error prone propensity score values. Like regression calibration, propensity score calibration requires a surrogacy assumption. Surrogacy is, however, less likely to hold for the propensity score than for a mismeasured covariate [56]. Given the prominence of propensity score analyses in the pharmacoepidemiology field, this is a natural extension of the analytic methods used in many studies. Stürmer et al. provide SAS code in an online Appendix [57].

Challenges

Accessible Methods to Account for Misclassification in Complex Data

The relatively few published papers in which methods for accounting for misclassification have been applied tend to ‘cluster’ around the authors who originally published these methods. Clustering of examples for methods with the original method ‘creator’ on the paper suggest that implementation remains a significant challenge to widespread adoption of these methods. In the context of claims data, hundreds of covariates (many of which are presumably measured with error) related to thousands of individuals pose difficult logistical problems for applying these methods and presenting an integrated view of the effect of misclassification. The use of directed acyclic graphs to examine potential sources of bias due to misclassification and measurement error may allow the analyst to identify the variables of greatest concern (exposure, outcome, and/or specific confounders) so that efforts to quantify their effects can be targeted [58].

In Search of a Gold Standard for Prescription Exposures

There has also been an assumption that prescription claims data were sufficiently reliable that there was little concern for misclassification of exposure. Compared to self-reported data on medication use, often even retrospective, these data are likely more reliable. But they are not infallible, as shown by Li et al. Lauffenberger et al. and others [16, 20••]. It is not yet clear what the gold-standard measure for prescription medications use should be in light of research showing that administrative claims, physician orders, medical records, pharmacy records, and self-report are all subject to some degree of error.

Methods for Censored Outcomes

In the context of pharmacoepidemiologic analyses, follow-up time is typically censored which necessitates the use of methods such as Poisson, Kaplan-Meier lifetables, or Cox proportional hazards regression to estimate the treatment effect. Some of the methods noted here allow the investigator to account for misclassification of the exposure or covariates in this setting. The challenge of possibly misclassified outcomes – including the time at which the outcome occurred – is not as tractable.

Further Development of Methods to Handle Misclassified Person-Time

Analytic methods for addressing misclassified data are not yet able to adjust easily for errors in the timing of treatment initiation. This poses a challenge for studies in which patients may be identified as new users, but are actually prevalent users. Recent work by Ahrens et al. [47••] points to one way forward, but this approach has not yet been applied to the claims setting. Further development of this method would enable more thoughtful investigation of the impact of errors in the identification of treatment initiation and discontinuation which are particularly important given the time-varying nature of medication effects. In addition, valid methods are needed to adjust estimates and confidence intervals from self-controlled study designs given that the outcomes of interest in this setting are typically acute, and therefore, misclassified duration of use would be more problematic.

Conclusions

While it is common practice in pharmacoepidemiology to conduct and report the results of sensitivity analyses that examine the influence of many of the assumptions and decisions made during the design and conduct of the study, we found few examples in the literature of sensitivity analyses that quantified the impact of misclassification. Perhaps the greatest challenge in this area is to acknowledge and then quantify the imperfect nature of claims data in spite of this status quo. Particularly with the rise of comparative effectiveness research, we cannot rely on nondifferential misclassification of the exposure to bias effect estimates toward the null. Rather than speculate about the effects of misclassification, we can and should be quantifying the impact on estimates and the uncertainty around them more accurately than we are currently doing using confidence intervals based on sampling error alone.