Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2013

Open Access 01-12-2013 | Research article

A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples

Authors: Nahathai Wongpakaran, Tinakon Wongpakaran, Danny Wedding, Kilem L Gwet

Published in: BMC Medical Research Methodology | Issue 1/2013

Login to get access

Abstract

Background

Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results.

Methods

This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence.

Results

Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss.

Conclusions

Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.
Appendix
Available only for authorised users
Literature
1.
go back to reference First MB, Gibbon M, Spitzer RL, Williams JBW, Benjamin LS: Structured Clinical Interview for DSM-IV Axis II Personality Disorder (SCID-II). 1997, Washington, DC: merican Psychiatric Press First MB, Gibbon M, Spitzer RL, Williams JBW, Benjamin LS: Structured Clinical Interview for DSM-IV Axis II Personality Disorder (SCID-II). 1997, Washington, DC: merican Psychiatric Press
2.
go back to reference Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960, 20: 37-46. 10.1177/001316446002000104.CrossRef Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960, 20: 37-46. 10.1177/001316446002000104.CrossRef
3.
go back to reference Cohen J: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968, 70: 213-220.CrossRefPubMed Cohen J: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968, 70: 213-220.CrossRefPubMed
4.
go back to reference Wongpakaran T, Wongpakaran N, Bookkamana P, Boonyanaruthee V, Pinyopornpanish M, Likhitsathian S, Suttajit S, Srisutadsanavong U: Interrater reliability of Thai version of the Structured Clinical Interview for DSM-IV Axis II Personality Disorders (T-SCID II). J Med Assoc Thai. 2012, 95: 264-269.PubMed Wongpakaran T, Wongpakaran N, Bookkamana P, Boonyanaruthee V, Pinyopornpanish M, Likhitsathian S, Suttajit S, Srisutadsanavong U: Interrater reliability of Thai version of the Structured Clinical Interview for DSM-IV Axis II Personality Disorders (T-SCID II). J Med Assoc Thai. 2012, 95: 264-269.PubMed
5.
go back to reference Dreessen L, Arntz A: Short-interval test-retest interrater reliability of the Structured Clinical Interview for DSM-III-R personality disorders (SCID-II) in outpatients. J Pers Disord. 1998, 12: 138-148. 10.1521/pedi.1998.12.2.138.CrossRefPubMed Dreessen L, Arntz A: Short-interval test-retest interrater reliability of the Structured Clinical Interview for DSM-III-R personality disorders (SCID-II) in outpatients. J Pers Disord. 1998, 12: 138-148. 10.1521/pedi.1998.12.2.138.CrossRefPubMed
6.
go back to reference Weertman A, Arntz A, Dreessen L, van Velzen C, Vertommen S: Short-interval test-retest interrater reliability of the Dutch version of the Structured Clinical Interview for DSM-IV personality disorders (SCID-II). J Pers Disord. 2003, 17: 562-567. 10.1521/pedi.17.6.562.25359.CrossRefPubMed Weertman A, Arntz A, Dreessen L, van Velzen C, Vertommen S: Short-interval test-retest interrater reliability of the Dutch version of the Structured Clinical Interview for DSM-IV personality disorders (SCID-II). J Pers Disord. 2003, 17: 562-567. 10.1521/pedi.17.6.562.25359.CrossRefPubMed
7.
go back to reference Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990, 43: 551-558. 10.1016/0895-4356(90)90159-M.CrossRefPubMed Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990, 43: 551-558. 10.1016/0895-4356(90)90159-M.CrossRefPubMed
8.
go back to reference Di Eugenio B, Glass M: The Kappa Statistic: A Second Look. Comput Linguist. 2004, 30: 95-101. 10.1162/089120104773633402.CrossRef Di Eugenio B, Glass M: The Kappa Statistic: A Second Look. Comput Linguist. 2004, 30: 95-101. 10.1162/089120104773633402.CrossRef
9.
go back to reference Gwet KL: Handbook of Inter-Rater Reliability. The Definitive Guide to Measuring the Extent of Agreement Among Raters. 2010, Gaithersburg, MD 20886–2696, USA: Advanced Analytics, LLC, 2 Gwet KL: Handbook of Inter-Rater Reliability. The Definitive Guide to Measuring the Extent of Agreement Among Raters. 2010, Gaithersburg, MD 20886–2696, USA: Advanced Analytics, LLC, 2
10.
go back to reference Gwet KL: Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008, 61: 29-48. 10.1348/000711006X126600.CrossRefPubMed Gwet KL: Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008, 61: 29-48. 10.1348/000711006X126600.CrossRefPubMed
11.
go back to reference Kittirattanapaiboon P, Khamwongpin M: The Validity of the Mini International Neuropsychiatric Interview (M.I.N.I.)-ThaiVersion. Journal of Mental Health of Thailand. 2005, 13: 126-136. Kittirattanapaiboon P, Khamwongpin M: The Validity of the Mini International Neuropsychiatric Interview (M.I.N.I.)-ThaiVersion. Journal of Mental Health of Thailand. 2005, 13: 126-136.
14.
go back to reference Day FC, Schriger DL, Annals Of Emergency Medicine Journal Club: A consideration of the measurement and reporting of interrater reliability: answers to the July 2009 Journal Club questions. Ann Emerg Med. 2009, 54: 843-853. 10.1016/j.annemergmed.2009.07.013.CrossRefPubMed Day FC, Schriger DL, Annals Of Emergency Medicine Journal Club: A consideration of the measurement and reporting of interrater reliability: answers to the July 2009 Journal Club questions. Ann Emerg Med. 2009, 54: 843-853. 10.1016/j.annemergmed.2009.07.013.CrossRefPubMed
15.
go back to reference Arntz A, van Beijsterveldt B, Hoekstra R, Hofman A, Eussen M, Sallaerts S: The interrater reliability of a Dutch version of the Structured Clinical Interview for DSM-III-R Personality Disorders. Acta Psychiatr Scand. 1992, 85: 394-400. 10.1111/j.1600-0447.1992.tb10326.x.CrossRefPubMed Arntz A, van Beijsterveldt B, Hoekstra R, Hofman A, Eussen M, Sallaerts S: The interrater reliability of a Dutch version of the Structured Clinical Interview for DSM-III-R Personality Disorders. Acta Psychiatr Scand. 1992, 85: 394-400. 10.1111/j.1600-0447.1992.tb10326.x.CrossRefPubMed
16.
go back to reference Lobbestael J, Leurgans M, Arntz A: Inter-rater reliability of the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID I) and Axis II Disorders (SCID II). Clin Psychol Psychother. 2011, 18: 75-79. 10.1002/cpp.693.CrossRefPubMed Lobbestael J, Leurgans M, Arntz A: Inter-rater reliability of the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID I) and Axis II Disorders (SCID II). Clin Psychol Psychother. 2011, 18: 75-79. 10.1002/cpp.693.CrossRefPubMed
17.
go back to reference Kongerslev M, Moran P, Bo S, Simonsen E: Screening for personality disorder in incarcerated adolescent boys: preliminary validation of an adolescent version of the standardised assessment of personality - abbreviated scale (SAPAS-AV). BMC Psychiatry. 2012, 12: 94-10.1186/1471-244X-12-94.CrossRefPubMedPubMedCentral Kongerslev M, Moran P, Bo S, Simonsen E: Screening for personality disorder in incarcerated adolescent boys: preliminary validation of an adolescent version of the standardised assessment of personality - abbreviated scale (SAPAS-AV). BMC Psychiatry. 2012, 12: 94-10.1186/1471-244X-12-94.CrossRefPubMedPubMedCentral
18.
go back to reference Chan YH: Biostatistics 104: correlational analysis. Singapore Med J. 2003, 44: 614-619.PubMed Chan YH: Biostatistics 104: correlational analysis. Singapore Med J. 2003, 44: 614-619.PubMed
19.
go back to reference Hartling L, Bond K, Santaguida PL, Viswanathan M, Dryden DM: Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy. J Clin Epidemiol. 2011, 64: 861-871. 10.1016/j.jclinepi.2011.01.010.CrossRefPubMed Hartling L, Bond K, Santaguida PL, Viswanathan M, Dryden DM: Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy. J Clin Epidemiol. 2011, 64: 861-871. 10.1016/j.jclinepi.2011.01.010.CrossRefPubMed
20.
go back to reference Hernaez R, Lazo M, Bonekamp S, Kamel I, Brancati FL, Guallar E, Clark JM: Diagnostic accuracy and reliability of ultrasonography for the detection of fatty liver: a meta-analysis. Hepatology. 2011, 54: 1082-1090.CrossRefPubMedPubMedCentral Hernaez R, Lazo M, Bonekamp S, Kamel I, Brancati FL, Guallar E, Clark JM: Diagnostic accuracy and reliability of ultrasonography for the detection of fatty liver: a meta-analysis. Hepatology. 2011, 54: 1082-1090.CrossRefPubMedPubMedCentral
21.
go back to reference Sheehan DV, Sheehan KH, Shytle RD, Janavs J, Bannon Y, Rogers JE, Milo KM, Stock SL, Wilkinson B: Reliability and validity of the Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID). J Clin Psychiatry. 2010, 71: 313-326. 10.4088/JCP.09m05305whi.CrossRefPubMed Sheehan DV, Sheehan KH, Shytle RD, Janavs J, Bannon Y, Rogers JE, Milo KM, Stock SL, Wilkinson B: Reliability and validity of the Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID). J Clin Psychiatry. 2010, 71: 313-326. 10.4088/JCP.09m05305whi.CrossRefPubMed
22.
go back to reference Ingenhoven TJ, Duivenvoorden HJ, Brogtrop J, Lindenborn A, van den Brink W, Passchier J: Interrater reliability for Kernberg's structural interview for assessing personality organization. J Pers Disord. 2009, 23: 528-534. 10.1521/pedi.2009.23.5.528.CrossRefPubMed Ingenhoven TJ, Duivenvoorden HJ, Brogtrop J, Lindenborn A, van den Brink W, Passchier J: Interrater reliability for Kernberg's structural interview for assessing personality organization. J Pers Disord. 2009, 23: 528-534. 10.1521/pedi.2009.23.5.528.CrossRefPubMed
23.
go back to reference Øiesvold T, Nivison M, Hansen V, Sørgaard KW, Østensen L, Skre I: Classification of bipolar disorder in psychiatric hospital. A prospective cohort study. BMC Psychiatry. 2012, 12: 13-CrossRefPubMed Øiesvold T, Nivison M, Hansen V, Sørgaard KW, Østensen L, Skre I: Classification of bipolar disorder in psychiatric hospital. A prospective cohort study. BMC Psychiatry. 2012, 12: 13-CrossRefPubMed
24.
go back to reference Clement S, Brohan E, Jeffery D, Henderson C, Hatch SL, Thornicroft G: Development and psychometric properties the Barriers to Access to Care Evaluation scale (BACE) related to people with mental ill health. BMC Psychiatry. 2012, 12: 36-10.1186/1471-244X-12-36.CrossRefPubMedPubMedCentral Clement S, Brohan E, Jeffery D, Henderson C, Hatch SL, Thornicroft G: Development and psychometric properties the Barriers to Access to Care Evaluation scale (BACE) related to people with mental ill health. BMC Psychiatry. 2012, 12: 36-10.1186/1471-244X-12-36.CrossRefPubMedPubMedCentral
25.
go back to reference McCoul ED, Smith TL, Mace JC, Anand VK, Senior BA, Hwang PH, Stankiewicz JA, Tabaee A: Interrater agreement of nasal endoscopy in patients with a prior history of endoscopic sinus surgery. Int Forum Allergy Rhinol. 2012, 2: 453-459. 10.1002/alr.21058.CrossRefPubMedPubMedCentral McCoul ED, Smith TL, Mace JC, Anand VK, Senior BA, Hwang PH, Stankiewicz JA, Tabaee A: Interrater agreement of nasal endoscopy in patients with a prior history of endoscopic sinus surgery. Int Forum Allergy Rhinol. 2012, 2: 453-459. 10.1002/alr.21058.CrossRefPubMedPubMedCentral
26.
go back to reference Ansari NN, Naghdi S, Forogh B, Hasson S, Atashband M, Lashgari E: Development of the Persian version of the Modified Modified Ashworth Scale: translation, adaptation, and examination of interrater and intrarater reliability in patients with poststroke elbow flexor spasticity. Disabil Rehabil. 2012, 34: 1843-1847. 10.3109/09638288.2012.665133.CrossRef Ansari NN, Naghdi S, Forogh B, Hasson S, Atashband M, Lashgari E: Development of the Persian version of the Modified Modified Ashworth Scale: translation, adaptation, and examination of interrater and intrarater reliability in patients with poststroke elbow flexor spasticity. Disabil Rehabil. 2012, 34: 1843-1847. 10.3109/09638288.2012.665133.CrossRef
27.
go back to reference Gisev N, Bell JS, Chen TF: Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Res Social Adm Pharm. In press, Gisev N, Bell JS, Chen TF: Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Res Social Adm Pharm. In press,
28.
go back to reference Petzold A, Altintas A, Andreoni L, Bartos A, Berthele A, Blankenstein MA, Buee L, Castellazzi M, Cepok S, Comabella M: Neurofilament ELISA validation. J Immunol Methods. 2010, 352: 23-31. 10.1016/j.jim.2009.09.014.CrossRefPubMed Petzold A, Altintas A, Andreoni L, Bartos A, Berthele A, Blankenstein MA, Buee L, Castellazzi M, Cepok S, Comabella M: Neurofilament ELISA validation. J Immunol Methods. 2010, 352: 23-31. 10.1016/j.jim.2009.09.014.CrossRefPubMed
29.
go back to reference Yusuff KB, Tayo F: Frequency, types and severity of medication use-related problems among medical outpatients in Nigeria. Int J Clin Pharm. 2011, 33: 558-564. 10.1007/s11096-011-9508-z.CrossRefPubMed Yusuff KB, Tayo F: Frequency, types and severity of medication use-related problems among medical outpatients in Nigeria. Int J Clin Pharm. 2011, 33: 558-564. 10.1007/s11096-011-9508-z.CrossRefPubMed
Metadata
Title
A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples
Authors
Nahathai Wongpakaran
Tinakon Wongpakaran
Danny Wedding
Kilem L Gwet
Publication date
01-12-2013
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2013
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-13-61

Other articles of this Issue 1/2013

BMC Medical Research Methodology 1/2013 Go to the issue