Published online Sep 21, 2017.
https://doi.org/10.3348/kjr.2017.18.6.888
Selection and Reporting of Statistical Methods to Assess Reliability of a Diagnostic Test: Conformity to Recommended Methods in a Peer-Reviewed Journal
Abstract
Objective
To evaluate the frequency and adequacy of statistical analyses in a general radiology journal when reporting a reliability analysis for a diagnostic test.
Materials and Methods
Sixty-three studies of diagnostic test accuracy (DTA) and 36 studies reporting reliability analyses published in the Korean Journal of Radiology between 2012 and 2016 were analyzed. Studies were judged using the methodological guidelines of the Radiological Society of North America-Quantitative Imaging Biomarkers Alliance (RSNA-QIBA), and COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative. DTA studies were evaluated by nine editorial board members of the journal. Reliability studies were evaluated by study reviewers experienced with reliability analysis.
Results
Thirty-one (49.2%) of the 63 DTA studies did not include a reliability analysis when deemed necessary. Among the 36 reliability studies, proper statistical methods were used in all (5/5) studies dealing with dichotomous/nominal data, 46.7% (7/15) of studies dealing with ordinal data, and 95.2% (20/21) of studies dealing with continuous data. Statistical methods were described in sufficient detail regarding weighted kappa in 28.6% (2/7) of studies and regarding the model and assumptions of intraclass correlation coefficient in 35.3% (6/17) and 29.4% (5/17) of studies, respectively. Reliability parameters were used as if they were agreement parameters in 23.1% (3/13) of studies. Reproducibility and repeatability were used incorrectly in 20% (3/15) of studies.
Conclusion
Greater attention to the importance of reporting reliability, thorough description of the related statistical methods, efforts not to neglect agreement parameters, and better use of relevant terminology is necessary.
INTRODUCTION
In addition to its accuracy, reliability (used in this article as an umbrella term to cover various concepts such as reproducibility, repeatability, and agreement except when used in a fixed expression of “reliability parameter,” which will be further explained later in the Materials and Methods section) is an important performance metric of a diagnostic test (1, 2). The problem of omitting a proper analysis of reliability in diagnostic research studies has previously been recognized (1, 2). However, this issue was still cited as one of the top 10 statistical errors seen in the submissions to one prominent journal in the field of medical imaging in the recent past (3). The lack of familiarity of the investigators and peer reviewers with the statistical tools designed for this purpose was among the main reasons for the suboptimal reporting reliability analysis in diagnostic research studies (1). Regarding this, to help guide the proper use of the statistical tools for reliability analysis, the Radiological Society of North America-Quantitative Imaging Biomarkers Alliance (RSNA-QIBA) (https://www.rsna.org/QIBA), and COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative (http://www.cosmin.nl) have recently provided methodological guides (4, 5, 6). Furthermore, it appears that investigators, and perhaps also journals themselves, might be less attentive to reporting the reliability analysis when compared with the accuracy analysis. For example, although the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) (7) exist, these do not seem to be well-known or referred to as often as the STAndards for Reporting of Diagnostic accuracy (STARD) (8). According to a study by a general radiology journal, the Korean Journal of Radiology, many more studies reporting diagnostic accuracy were published compared with those reporting reliability in the same period (9). Furthermore, in contrast with multiple secondary research studies analyzing the reporting quality of diagnostic test accuracy (DTA) (9, 10, 11, 12, 13, 14), similar secondary research studies of reliability analyses are scarce.
In this regard, we performed this study to evaluate the frequency of reporting a reliability analysis in DTA studies. In addition, we aimed to assess how appropriately the statistical methods for reliability analysis were selected and reported in published studies using the methodological guides provided by the RSNA-QIBA and COSMIN initiative as the adjudication tool with studies from a general radiology journal as a sample.
MATERIALS AND METHODS
Article Search Strategy and Study Selection
We conducted a search to identify all potentially relevant original research papers from the articles published in a single peer-reviewed journal, the Korean Journal of Radiology, during the 5-year period between January 1, 2012 and December 31, 2016 using the PubMed Medline database. The search terms to find DTA studies were “sensitivity” OR “specificity” OR “accuracy” OR “performance” OR “receiver operating” OR “ROC.” The search terms to find studies that analyzed reliability included “reliability” OR “repeatability” OR “reproducibility” OR “agreement” OR “precision” OR “biomarker.” Retrieved articles were screened for eligibility. Regarding the DTA studies, one reviewer experienced in DTA studies selected eligible articles according to criteria established elsewhere (9) with additional confirmation by another DTA expert in cases of ambiguity. Of the initial 124 candidate articles, 63 articles (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77) were finally included. Regarding the studies that analyzed reliability, eligible articles were chosen by consensus after review by two of four independent reviewers experienced in the relevant methodology. When the two reviewers disagreed or in cases of ambiguity, a third reviewer experienced in related methodology was invited as an adjudicator. We excluded studies that investigated the agreement between continuous or ordinal outcomes/test results and fixed reference standard results (78, 79, 80). These studies could be viewed as extensions of DTA analysis of non-binary data, which require different statistical analyses (81), than the standard analysis used for reliability, although some published studies seem to have failed to distinguish between them. Of the initial 71 article candidates, 36 articles (15, 19, 42, 45, 53, 57, 58, 64, 66, 67, 69, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106) were finally included.
Data Extraction for DTA Studies
Diagnostic test accuracy studies were evaluated regarding whether they also analyzed the reliability of the investigated tests/methods and, when reliability was not assessed, whether the reliability analysis was deemed necessary per se. We considered reliability analysis unnecessary if the tests/methods investigated in a DTA study were only a minor component of the study or if their reliability was already well established. The extraction of this information was performed by nine independent editorial board members of the journal (names are listed in the acknowledgment section). Each reviewer was assigned to the articles in his/her area of expertise (two to ten articles per reviewer). When there is doubt, a second reviewer additionally reviewed the article to make a consensus decision with the original reviewer.
Data Extraction for Reliability Studies
Before data extraction, we first established the recommended statistical methods for the analysis of the reliability of a test/method (Table 1) according to the methodological guides provided by the RSNA-QIBA and COSMIN initiative (4, 6, 107, 108). We then used the table as the reference when evaluating if the articles conformed to the recommended statistical methods. Each article was evaluated by two of four independent reviewers experienced in the statistical methodology. Disagreements between two reviewers were adjudicated by two additional reviewers (a biostatistician) both of whom were also experienced in the statistical methodology. The reviewers extracted the data using a predetermined standardized set of questionnaires, which were intended to address the following issues. First, if authors used the proper statistical methods according to the suggestions that we established for this study (Table 1). Second, if authors provided a detailed description of the statistical methods. Third, for studies assessing the reliability of a continuous outcome, if authors distinguished the difference between the “reliability parameter” and “agreement parameter” (Table 1) and used them appropriately with respect to the study purpose and conclusion. Fourth, when the terms “reproducibility” and “repeatability” were used, if authors used the correct definitions.
Table 1
Recommended Statistical Methods for Analysis of Reliability
The “reliability parameter” is a term that has a specific meaning as defined elsewhere (4, 108), unlike reliability which is used as a general umbrella term. Reliability parameters, such as the intraclass correlation coefficient (ICC) or concordance correlation coefficient, explain how well the subjects in a study set can be distinguished from each other (108), but they do not show the exact measurement uncertainties. Small measurement uncertainties (as opposed to large measurement uncertainties) would allow for a clear distinction between the subjects, yielding a large reliability parameter score. However, a clear distinction between subjects can also be obtained even with large measurement uncertainties if there are large differences between subjects (statistically referred to as a large between-subject variance). Therefore, although reliability parameters are useful in making a relative comparison between different tests/methods regarding their levels of reliability, i.e., a higher score means greater reliability (109), they are not helpful if one wants to know what specific range of measurement differences should be considered true changes instead of mere measurement uncertainties in a longitudinal followup. On the other hand, “agreement parameters” assess exactly how close the results for repeated measurements are (108). Therefore, agreement parameters can be used both for the relative comparison of reliability and assessment of absolute measurement uncertainties. Agreement parameters are needed when investigating a test/method for potential use in a longitudinal follow-up setting. Repeatability, as defined by RSNA-QIBA, concerns repeated measurements of the same or similar experimental units under identical or near-identical conditions, using the same measurement procedure, same operators, same measuring system, same operating conditions, and same physical location over a short period (5, 6). On the other hand, reproducibility applies to rerunning a measurement in slightly different settings, for example, different locations, operators, scanners, etc. (5, 6).
Statistical Analysis
We obtained the following study outcomes in a descriptive manner using proportions, i.e., the percentage of articles out of all eligible articles, for each of the following outcome categories:
Reporting of reliability along with accuracy
Use of the recommended statistical methods. We considered that a study satisfied this item if the study used at least one method listed in Table 1 and did not require any further details (for example, explanations of weighting methods for weighted kappa or descriptions of the ICC model and assumption were not considered). The results were obtained for each of three different data types (dichotomous/nominal, ordinal, and continuous data).
Reporting of weighting method when weighted kappa was used.
Reporting of model and assumption when ICC was used.
Appropriate use/interpretation of reliability parameters
Correct use of the terms reproducibility and repeatability
RESULTS
Reporting of Reliability along with Accuracy
Of the 63 DTA studies (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77), 32 studies (50.8%) included an analysis of reliability (n = 22) or did not include reliability analysis when the analysis was not necessary (n = 10). Thirty-one articles (49.2%) did not include a reliability analysis in cases where the analysis was deemed necessary.
Selection and Reporting of Statistical Methods to Assess Reliability
The results obtained from the 36 eligible studies (15, 19, 42, 45, 53, 57, 58, 64, 66, 67, 69, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106) are summarized in Table 2.
Table 2
Selection and Reporting of Statistical Methods to Assess Reliability
Of the five studies that reported an analysis of dichotomous/nominal data, four studies used kappa, and one study used both kappa and proportion of agreement.
Of the 15 studies that reported an analysis of ordinal data, six studies used weighted kappa, and one study used both weighted kappa and ICC, whereas eight studies used kappa without clarifying if they calculated weighted kappa.
Of the 21 studies that reported an analysis of continuous data, one study used Pearson's correlation coefficient instead of the recommended methods. The 20 other studies used the recommended methods, including reliability parameters alone (n = 13, 65%), agreement parameters alone (n = 2, 10%), and both reliability and agreement parameters (n = 5, 25%). Of the 17 studies that used ICC, 11 studies (64.7%) did not report the ICC model, and 12 studies (70.6%) did not explain the assumptions made for the ICC.
Of the 13 studies that used reliability parameters alone, ten studies properly used and interpreted the analysis for the study purpose and conclusion, whereas three studies (23.1%) inappropriately considered the reliability parameters as if they were agreement parameters.
Among the 15 studies that used reproducibility or repeatability, three studies did not use them accurately, with two studies incorrectly using reproducibility instead of repeatability and one study incorrectly using repeatability instead of reproducibility.
DISCUSSION
In our study, approximately half of the DTA studies did not include a reliability analysis when it was deemed necessary. Most of the reliability studies seem to have selected the proper statistical methods for the analysis. However, description of the further details of the statistical methods, including the weighting method for weighted kappa and specific model and assumption for ICC, were generally poor. This study is limited in that we analyzed a single peer-reviewed journal and did not have specific data from other journals. However, according to the current authors' experience, other radiology journals seem to have similar trends. Another notable observation was that studies more frequently used reliability parameters than agreement parameters for analyzing the reliability of continuous data, and a small but notable (23.1%) fraction of studies imprecisely interpreted the reliability parameters. Lastly, the distinction between repeatability and reproducibility was not perfect. These weaknesses found in the published papers would indicate the areas to require improvements in the future.
The importance of reporting reliability along with accuracy needs to be further emphasized because these two parameters are necessary complementary parameters of technical performance and clinical utility for an imaging biomarker (110). It is reassuring that the published studies overall selected the proper methods for reliability analysis. For those investigators who are not familiar with the statistical methods, the table of suggested methods we made for this study (Table 1) could be a useful reference as it succinctly summarizes the well-thoughtout methodological guides by the RSNA-QIBA and COSMIN initiative (4, 6, 107, 108). Regarding the suboptimal reporting of the details of the statistical methods, in fact, some user-friendly software programs for statistical analysis, which authors frequently quote as having been used for statistical analysis, often include the details as optional parameters and report them in their output (Fig. 1). Paying closer attention to these features would facilitate reporting them more clearly and would also help investigators to select the most appropriate statistical analysis. The use of agreement parameters, when applicable, should also be more encouraged. It was reported that agreement parameters were often neglected in medical research studies (108), as was also seen in our study. Among these parameters, the repeatability coefficient (RC) is particularly important as it is the smallest detectable change based on the intrinsic technical uncertainties of a quantitative measurement method and its importance is highlighted by the RSNA-QIBA (6, 108). One of the reasons why the agreement parameters are underutilized compared with reliability parameters may be the lack of readily available user-friendly software programs, except for the Bland-Altman analysis. In this regard, we have developed a web calculator to compute RC and its 95% confidence interval for two or more repeat measurements of a continuous parameter (available at http://datasharing.aim-
Fig. 1
Display of detailed options associated with statistical tests used for reliability analysis in some user-friendly software programs.
A. Selection of weighting method to calculate weighted kappa with MedCalc Version 17.6 (MedCalc Software BVBA; https://www.medcalc.org). B. Selection of model and assumption to calculate ICC with IBM SPSS Statistics for Windows Version 21 (IBM Corp.). C. Selection of model and assumption to calculate ICC with MedCalc Version 17.6 (MedCalc Software BVBA). This software program does not distinguish between random and fixed effects models. ICC = intraclass correlation coefficient
Limitations of this study include the fact that the eligible articles were selected from a single journal and, therefore, there could be an issue regarding generalizability. Nevertheless, the journal, the Korean Journal of Radiology, is a representative general journal in the radiology/medical imaging field ranked 53rd out of 126 journals in the field according to the 2016 Journal Citation Reports by Clarivate Analytics. Given its rank and the coverage of topics, the Korean Journal of Radiology may be a suitable litmus test for journals in general in the radiology/medical imaging field. Second, as we focused on the quality of the reporting of the statistical analysis, our results do not necessarily reflect the overall reporting quality or quality of the research.
In conclusion, the quality of reporting the reliability analysis of a diagnostic test can be improved through greater attention to the importance of reporting the reliability of a test, more thorough description of the related statistical methods, efforts not to neglect agreement parameters, and a clearer distinction of reproducibility and repeatability. Some of the tips discussed in this article, including the software tool to calculate the RC, may be helpful.
This study was supported by a grant from the Korean Health Technology R&D Project, Ministry of Health & Welfare, Republic of Korea (HI17C1862).
Acknowledgments
We appreciate following editorial board members of the Korean Journal of Radiology for their help with the literature analysis:
Jung Hwan Baek, MD, PhD (University of Ulsan, Korea), Joon Young Choi, MD, PhD (Sungkyunkwan University, Korea), Boo-Kyung Han, MD, PhD (Sungkyunkwan University, Korea), Chang Hee Lee, MD, PhD (Korea University, Korea), Hyun-Ju Lee, MD, PhD (Seoul National University, Korea), Jeong Min Lee, MD, PhD (Seoul National University, Korea), Won-Jin Moon, MD, PhD (Konkuk University, Korea), Deuk Jae Sung, MD, PhD (Korea University, Korea), Young Cheol Yoon, MD, PhD (Sungkyunkwan University, Korea)
References
-
Levine D, Bankier AA, Halpern EF. Submissions to radiology: our top 10 list of statistical errors. Radiology 2009;253:288–290.
-
-
Gallo L, Hua N, Mercuri M, Silveira A, Worster A. Best Evidence in Emergency Medicine (BEEM; beem.ca). Adherence to standards for reporting diagnostic accuracy in emergency medicine research. Acad Emerg Med. 2017 Jun 16; [doi: 10.1111/acem.13233][Epub].
-
-
Grob AT, van der Vaart LR, Withagen MI, van der Vaart CH. The quality of reporting of diagnostic accuracy studies in pelvic floor transperineal three-dimensional ultrasound: a systematic review. Ultrasound Obstet Gynecol. 2016 Dec 21; [doi: 10.1002/uog.17390][Epub].
-
-
Hong PJ, Korevaar DA, McGrath TA, Ziai H, Frank R, Alabousi M, et al. Reporting of imaging diagnostic accuracy studies with focus on MRI subgroup: Adherence to STARD 2015. J Magn Reson Imaging. 2017 Jun 22; [doi: 10.1002/jmri.25797][Epub].
-
-
Lee EK, Choi SH, Yun TJ, Kang KM, Kim TM, Lee SH, et al. Prediction of response to concurrent chemoradiotherapy with temozolomide in glioblastoma: application of immediate post-operative dynamic susceptibility contrast and diffusion-weighted MR imaging. Korean J Radiol 2015;16:1341–1348.
-
-
Luczyńska E, Heinze-Paluchowska S, Dyczek S, Blecharz P, Rys J, Reinfuss M. Contrast-enhanced spectral mammography: comparison with conventional mammography and histopathology in 152 women. Korean J Radiol 2014;15:689–696.
-
-
Kim YP, Kannengiesser S, Paek MY, Kim S, Chung TS, Yoo YH, et al. Differentiation between focal malignant marrow-replacing lesions and benign red marrow deposition of the spine with T2*-corrected fat-signal fraction map using a three-echo volume interpolated breath-hold gradient echo Dixon sequence. Korean J Radiol 2014;15:781–791.
-
-
Lee DH, Lee JM, Klotz E, Kim SJ, Kim KW, Han JK, et al. Detection of recurrent hepatocellular carcinoma in cirrhotic liver after transcatheter arterial chemoembolization: value of quantitative color mapping of the arterial enhancement fraction of the liver. Korean J Radiol 2013;14:51–60.
-
-
Lee GY, Lee JW, Choi SW, Lim HJ, Sun HY, Kang Y, et al. MRI inter-reader and intra-reader reliabilities for assessing injury morphology and posterior ligamentous complex integrity of the spine according to the thoracolumbar injury classification system and severity score. Korean J Radiol 2015;16:889–898.
-
-
Seok JH, Choi HS, Jung SL, Ahn KJ, Kim MJ, Shin YS, et al. Artificial luminal narrowing on contrast-enhanced magnetic resonance angiograms on an occasion of stent-assisted coiling of intracranial aneurysm: in vitro comparison using two different stents with variable imaging parameters. Korean J Radiol 2012;13:550–556.
-