The American Psychiatric Association (APA) has updated its Privacy Policy and Terms of Use, including with new information specifically addressed to individuals in the European Economic Area. As described in the Privacy Policy and Terms of Use, this website utilizes cookies, including for the purpose of offering an optimal online experience and services tailored to your preferences.

Please read the entire Privacy Policy and Terms of Use. By closing this message, browsing this website, continuing the navigation, or otherwise continuing to use the APA's websites, you confirm that you understand and accept the terms of the Privacy Policy and Terms of Use, including the utilization of cookies.

×
Published Online:

A lzheimer disease is a growing public health problem, 1 , 2 and its prevalence is increasing rapidly with the aging of the baby boomer generation. 3 Early diagnosis and treatment can reduce the burden this increase poses to the health care system and society. 4 However, Alzheimer’s disease is often underrecognized in community clinical practice settings 57 because the diagnosis can be difficult 8 and may require specialized training. Without fast and reliable screening instruments, it may be difficult for primary care physicians to identify patients who should be referred for a more comprehensive dementia workup.

The clock drawing test is widely used as a screening test for dementia. Neurologists have used clock-drawing and time telling tests extensively. 9 Several factors contribute to the test’s popularity, including administration and scoring ease and evaluation of multiple cognitive domains, 10 , 11 such as executive functioning. 1113 Compared to the Mini-Mental State Examination (MMSE), the clock drawing test is thought to have less educational bias 14 and is better able to detect cognitive decline due to Alzheimer’s disease and other dementias. 15 The clock drawing test has also been advocated over the MMSE as an office screening test for dementia in community clinics 4 and in acute hospital settings. 16 Furthermore, the clock drawing test is suitable for non-English speaking populations. 14

There are two general clock drawing test scoring approaches, including qualitative and quantitative, and varied scoring systems that emphasize different facets of the clock drawing process. Early quantitative scoring systems were validated to distinguish between subjects with moderate or severe Alzheimer’s disease and cognitive healthy comparison subjects and later adapted for use in mild cognitive impairment and mild Alzheimer’s disease. 1723 Previous studies of objective clock drawing test rating systems identified Alzheimer’s disease with overall diagnostic accuracy ranging from 59% to 85%. 24 However, such diagnostic accuracy has not been found in mild cognitive impairment cohorts with sensitivities ranging from 17% to 92%. 24 In a retrospective study comparing several clock drawing test scoring systems, 24 the scoring system by Mendez et al. 19 has been found to be the most accurate in distinguishing demented from nondemented individuals, followed closely by the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) system. 25

Though diagnostically useful, quantitative clock drawing test rating schemes 1820 , 22 , 23 are rarely used in clinical settings, as they take more time and require trained clinical personnel to score. Moreover, when using the clock drawing test to identify dementia, qualitative ratings of naive judges may be equal to or more accurate than many quantitative scoring systems. 24 Though it is widely assumed that dementia specialists are more reliable and valid than naive raters, there are no known studies that have evaluated the psychometric properties of clock drawing test ratings made by trained clinicians who use the clock drawing test as part of their regular clinical practice. Our present study was performed to determine the interrater reliability, sensitivity, and specificity of qualitative clock drawing test ratings made by clinicians specializing in the assessment of patients with dementia. Two qualitative rating approaches were utilized: a dichotomous rating of impaired versus nonimpaired and a 0–10 ordinal rating scale. A multidisciplinary consensus conference was the gold standard for dementia diagnosis in the current study.

METHODS

Participants

Archival data were extracted from the Boston University Alzheimer’s Disease Core Center registry, which is an institutional review board approved National Institute on Aging-funded Alzheimer’s disease registry, 2628 that longitudinally follows older adults with and without memory problems. Participants performed the clock drawing test as part of an annual neurological and neuropsychological exam. All participants were at least 55 years old, were English-speaking community dwellers with no history of major psychiatric or neurological illness or head injury involving loss of consciousness, and had adequate auditory and visual acuity to complete the examination. After data query, there were 506 eligible participants in the Boston University Alzheimer’s Disease Core Center patient/comparison registry who had been diagnosed by a multidisciplinary consensus team (including at least two board-certified neurologists and two neuropsychologists) based on a clinical interview with the participant and an informant, medical history review, and neurological and neuropsychological examination results.

Of the 506 participants, 168 were diagnosed as cognitively normal comparison subjects, 39 as cognitively normal comparison subjects with cognitive complaints reported by self or study partner (worried comparison subjects), 88 as “probable” mild cognitive impairment patients, 29 , 30 106 as “possible” mild cognitive impairment patients (no complaint of cognitive decline, but with objective impairment on one or more primary neuropsychological variables), 55 as probable Alzheimer’s disease patients, and 50 as possible Alzheimer’s disease patients. 31 Participants diagnosed as cognitive comparison subjects (with or without complaints) were included if they had a Clinical Dementia Rating of 0, 32 and an MMSE score ≥26 33 and if they were not impaired on any primary neuropsychological test variable (i.e., no scores fell more than 1.5 standard deviations below normative means). Exclusion criteria included dominant hand hemiparesis or other central or peripheral motor impairments or visual acuity impairment that would preclude clock drawing test completion. The current study utilized the participants’ most recent registry visit data.

Procedures

Trained psychometricians administered the clock drawing test in a standard way to include command (i.e., “I want you to draw the face of a clock, putting in all the numbers where they should go, and set the hands at 10 after 11”) and copy conditions. 27 Participants were allowed to make corrections and make attempts to draw the clock a maximum of two times. Only the command condition data were used for the current study.

For the purpose of this study, 25 command clocks were randomly selected from each of the six diagnostic strata described above, resulting in the inclusion of 150 clocks from 150 different subjects with 50 clocks equally divided among each of the three primary diagnostic groups (i.e., comparison, mild cognitive impairment, Alzheimer’s disease). The clocks were rated independently by four board-certified neurologists and a neurology nurse practitioner, all of whom specialize in dementia. Raters were blinded to participant diagnostic and demographic information. Ratings were made on a binary (normal/abnormal) and an ordinal scale (0–10 rating, where 0 signified no impairment and 10 signified complete impairment). The ordinal ratings were examined for potential outliers, and when widely discordant interexaminer ratings (score differences >5; arbitrarily selected) were found, these clocks were rerated by the original raters. The rerated clock drawing test scores were used in the analyses.

Statistical Analyses

All analyses were performed in SPSS 16 (Chicago) or SAS 9.1 (Cary, NC) software. A one-way analysis of variance (ANOVA) assessed for between-group differences in age, education, MMSE score, and Geriatric Depression Scale score. Significant findings were followed up with Tukey-Kramer post hoc tests to determine specific group differences. Gender and racial differences among the groups were analyzed using the chi-square test of independence.

Estimates of rater agreement were calculated for both the binary and ordinal rating scales. For the binary ratings, kappa statistics were computed. 3436 Because there were five raters, the multiple agreement function in SAS was used for kappa calculations for the binary ratings. For the ordinal ratings, Kendall’s intraclass correlation coefficient (ICC) of concordance was computed using the intraclass correlation functions in SAS, and Spearman rank correlations were generated between individual raters. Agreement for ordinal ratings were evaluated by calculating the absolute score differences between raters.

To examine the diagnostic utility (based on the three primary diagnostic groups) of the clinicians’ ratings, sensitivity, specificity, and positive likelihood ratios were calculated for both the dichotomous and ordinal ratings. For the dichotomous ratings, two methods were used to summarize the five raters’ data. First, we calculated individual sensitivity and specificity statistics for each rater and then created an average for all five raters. We also created a single summary rating of “impaired” versus “intact” based on whether the majority of the raters rated the clock as impaired or intact. For the ordinal ratings, an average was calculated to summarize the five clinicians’ ratings; then sensitivities and specificities were calculated for this average rating against the true diagnostic classification. The ordinal ratings were used to examine the cutoff score that optimized sensitivity and specificity.

RESULTS

Details of demographics and clinical characteristics are shown in Table 1 . There were significant between-group differences for age, female gender, education, and MMSE. There were no significant between-group differences in the Geriatric Depression Scale score.

TABLE 1. Participant Demographic Characteristics
TABLE 1. Participant Demographic Characteristics
Enlarge table

The clinicians’ interrater reliability was “almost perfect” 3437 for the ordinal (ICC=0.92) system and “substantial” 37 for the dichotomous system (k=0.85). The absolute difference scores between ordinal ratings are presented in Table 2 . The five clinicians’ ratings did not differ by more than three ordinal scale units for 69% of the clocks. An absolute score difference of 5 or less was observed in 91% of the clocks, and only five clocks (3%) had an absolute difference of more than seven units on the ordinal scale. Examples of participant clocks that were rated similarly and dissimilarly are presented in Figure 1 . Spearman rank correlations between each of the individual raters ranged from 0.64 to 0.82 for the ordinal scale ( Table 3 ).

TABLE 2. Absolute Difference of Scores for Individual Clock Ratings by Dementia Specialists
TABLE 2. Absolute Difference of Scores for Individual Clock Ratings by Dementia Specialists
Enlarge table
FIGURE 1. Concordant and Discordant Clinician Clock Drawing Test Ratings
TABLE 3. Spearman Correlations for Clock Rating (0–10) Between Clinicians
TABLE 3. Spearman Correlations for Clock Rating (0–10) Between Clinicians
Enlarge table

Based on the five clinicians’ average dichotomous rating, the clinicians differentiated comparison and Alzheimer’s disease participants with a sensitivity of 0.75 and a specificity of 0.81. In comparison, the dichotomous rating based solely on the majority of raters had a sensitivity of 0.84 and specificity of 0.84. In differentiating comparison subjects from mild cognitive impairment participants, the average sensitivity of the five raters’ dichotomous classifications was 0.47, and the average specificity was 0.81. Using a majority for the calculation of sensitivity and specificity resulted in values of 0.50 and 0.84, respectively ( Table 4 ).

TABLE 4. Sensitivity and Specificity of Dichotomous Ratings of Impairment
TABLE 4. Sensitivity and Specificity of Dichotomous Ratings of Impairment
Enlarge table

The sensitivities, specificities, and positive likelihood ratios for several cutoff scores on the ordinal scale are presented in Table 5 . For three of the four comparisons (i.e., Alzheimer’s disease versus comparison, mild cognitive impairment versus comparison, and Alzheimer’s disease + mild cognitive impairment versus comparison), a cutoff score of two or greater resulted in the maximization of sensitivity and specificity for differentiating diagnostic groups. For differentiating Alzheimer’s disease from mild cognitive impairment, a cutoff score of four or greater was the rating that maximized sensitivity and specificity. As can be expected, higher ordinal ratings (i.e., more impaired clocks) were associated with a greater likelihood of being diagnosed with either Alzheimer’s disease or mild cognitive impairment but at the expense of sensitivity.

TABLE 5. Sensitivity and Specificity for Various Cutoffs on the Ordinal Rating Scale (0–10)
TABLE 5. Sensitivity and Specificity for Various Cutoffs on the Ordinal Rating Scale (0–10)
Enlarge table

DISCUSSION

Our study sought to investigate the interrater reliability of qualitative clock drawing test ratings made by five dementia clinicians at Boston University Medical Center. The clinicians were reliable clock drawing test raters using both dichotomous (impaired versus intact) and ordinal (0–10 impairment scale) ratings. The interrater reliability for the dichotomous system achieved a kappa of 0.85, and the ordinal rating resulted in an intraclass correlation coefficient of 0.92. These statistics represent excellent interrater reliability values and are comparable to those obtained in our recent work comparing several widely used quantitative clock drawing test scoring systems. 27 Our findings demonstrate that in the absence of objective scoring methods, the clock drawing test can be rated reliably across a cognitive severity spectrum by clinicians who specialize in dementia.

Despite these excellent reliability values, there were several individual instances in which clinicians’ ratings were widely disparate. As seen in Table 2 , ratings of nine clocks (6%) differed by six or more points on the ordinal scale after rerating eliminated errors. There are multiple factors that may explain why the clinicians applied disparate ratings, including spatial configuration, participant self-corrected errors, and shape of the clock face as exemplified in Figure 1 . The discrepancy among the raters highlights the difficulty that clinicians face when scoring clocks subjectively.

The present study also examined the accuracy of clinician-rated clock drawing tests in differentiating among cognitively normal, mild cognitive impairment, and Alzheimer’s disease diagnostic categories. Despite the substantial 37 overall agreement between raters, the results demonstrate that the accuracy with which qualitative ratings can differentiate diagnostic group membership was less robust. Although Alzheimer’s disease patients and comparison subjects could be differentiated with a relatively high degree of accuracy, the ratings were considerably less useful when making the distinction between a diagnosis of mild cognitive impairment and comparison (less sensitive) or Alzheimer’s disease and mild cognitive impairment (less specific). Therefore, while the clock drawing test may be a good screening instrument for Alzheimer’s disease, it may not be a sensitive instrument for screening mild cognitive impairment, especially if clinicians use a dichotomous rating. When screening for mild cognitive impairment, the presence of an abnormal clock drawing test in isolation (based on subjective clinician rating) may result in a large number of false positive or false negative errors. For the mild cognitive impairment diagnosis, the sensitivity and specificity were somewhat improved by using a subjective ordinal rating scale with three or more cutoff points as compared to the dichotomous scale. We therefore suggest using a 3-point subjective ordinal clock drawing test rating scale such as “normal,” “suspicious,” and “impaired” to improve the mild cognitive impairment diagnosis rather than the existing dichotomous system.

The clinicians who served as raters for the current study are specialists in diagnosing dementia and work in a tertiary care clinical setting and research center. Therefore, these clinicians may represent a more reliable and diagnostically accurate group than nonspecialists in the community. Their expertise in dementia assessment may limit the extent to which the findings can be generalized to other settings and clinicians. Another limitation is that some clinicians were also members of the consensus team that formulated the original diagnostic impressions for our participant cohort. This overlap raises the possibility that the clinicians may not have been completely blinded to diagnostic group membership for the clocks being reexamined, assuming that the clinicians remembered the clocks that were presented in prior consensus conference meetings. However, this overlap would have only impacted the diagnostic utility statistics and not the interrater reliability, which was the primary focus of the current study. We excluded individuals with dementia other than Alzheimer’s disease, visual impairment and non-English speakers, which may have increased the diagnostic utility statistics while limiting the generalizability of our study.

Although the clock drawing test has many advantages as a screening instrument in the assessment of patients with suspected dementia, it is often used qualitatively, or subjectively, in clinical settings. As such, the reliability of these qualitative ratings between clinicians is brought into question. This is the first study to investigate the concordance among clock drawing test ratings by dementia specialists. The current study results indicate that dementia specialists can reliably rate clock drawing test performance using two different qualitative rating approaches. In contrast, the findings do not support the use of the clock drawing test as a stand-alone screening instrument, as the classification accuracy statistics presented suggest that in mild cognitive impairment, the clinician ratings may be susceptible to both false positive and false negative errors. However, the clinicians’ ratings had excellent sensitivity and specificity for distinguishing healthy comparison subjects from those with probable and possible mild Alzheimer’s disease. Future studies should compare the reliability and diagnostic accuracy of qualitative methods to empirically validated quantitative scoring systems.

Received August 21, 2008; revised January 5, 2009; accepted January 12, 2009. The authors are affiliated with the Alzheimer’s Disease Center at Boston University School of Medicine in Boston. Address correspondence to Anil K. Nair, M.D., Assistant Professor of Neurology, 715 Albany St., B7800, Boston, MA 02118; [email protected] (e-mail).

Copyright © 2010 American Psychiatric Publishing, Inc.

References

1. Hirtz D, Thurman DJ, Gwinn-Hardy K, et al: How common are the “common” neurologic disorders? Neurology 2007; 68:326–337Google Scholar

2. Brookmeyer R, Johnson E, Ziegler-Grahamm K, et al: Forecasting the global prevalence and burden of Alzheimer’s disease. Alzheimers Dement 2007; 3:S168Google Scholar

3. Alzheimer’s Association: Alzheimer’s disease facts and figures. Alzheimers Dement 2008; 4:110Google Scholar

4. Sager MA, Hermann BP, La Rue A, et al: Screening for dementia in community-based memory clinics. WMJ 2006; 105:25–29Google Scholar

5. Valcour VG, Masaki KH, Curb JD, et al: The detection of dementia in the primary care setting. Arch Intern Med 2000; 160:2964–2968Google Scholar

6. Borson S, Scanlan JM, Watanabe J, et al: Improving identification of cognitive impairment in primary care. Int J Geriatr Psychiatry 2006; 21:349–355Google Scholar

7. Chodosh J, Petitti DB, Elliott M, et al: Physician recognition of cognitive impairment: evaluating the need for improvement. J Am Geriatr Soc 2004; 52:1051–1059Google Scholar

8. Mok W, Chow TW, Zheng L, et al: Clinicopathological concordance of dementia diagnoses by community versus tertiary care clinicians. Am J Alzheimers Dis Other Demen 2004; 19:161–165Google Scholar

9. Eddy JR, Sriram S: Clock-drawing and telling time as diagnostic aids. Neurology 1977; 27:595Google Scholar

10. Cahn-Weiner DA, Williams K, Grace J, et al: Discrimination of dementia with Lewy bodies from Alzheimer’s disease and Parkinson’s disease using the clock drawing test. Cogn Behav Neurol 2003; 16:85–92Google Scholar

11. Royall DR, Mulroy AR, Chiodo LK, et al: Clock drawing is sensitive to executive control: a comparison of six methods. J Gerontol B Psychol Sci Soc Sci 1999; 54:P328–333Google Scholar

12. Juby A, Tench S, Baker V: The value of clock drawing in identifying executive cognitive dysfunction in people with a normal Mini-Mental State Examination score. CMAJ 2002; 167:859–864Google Scholar

13. Lavery LL, Starenchak SM, Flynn WB, et al: The clock drawing test is an independent predictor of incident use of 24-hour care in a retirement community. J Gerontol A Biol Sci Med Sci 2005; 60:928–932Google Scholar

14. Parker C, Philp I: Screening for cognitive impairment among older people in black and minority ethnic groups. Age Ageing 2004; 33:447–452Google Scholar

15. Ferrucci L, Cecchi F, Guralnik JM, et al: Does the clock drawing test predict cognitive decline in older persons independent of the Mini-Mental State Examination? The FINE Study Group. Finland, Italy, The Netherlands Elderly. J Am Geriatr Soc 1996; 44:1326–1331Google Scholar

16. Death J, Douglas A, Kenny RA: Comparison of clock drawing with Mini-Mental State Examination as a screening test in elderly acute hospital admissions. Postgrad Med J 1993; 69:696–700Google Scholar

17. Shulman KI, Shedletsky R, Silver IL: The challenge of time: clock-drawing and cognitive function in the elderly. Int J Geriatric Psychiatry 1986; 1:135–140Google Scholar

18. Sunderland T, Hill JL, Mellow AM, et al: Clock drawing in Alzheimer’s disease: a novel measure of dementia severity. J Am Geriatr Soc 1989; 37:725–729Google Scholar

19. Mendez MF, Ala T, Underwood KL: Development of scoring criteria for the clock drawing task in Alzheimer’s disease. J Am Geriatr Soc 1992; 40:1095–1099Google Scholar

20. Wolf-Klein GP, Silverstone FA, Levy AP, et al: Screening for Alzheimer’s disease by clock drawing. J Am Geriatr Soc 1989; 37:730–734Google Scholar

21. Rouleau I, Salmon DP, Butters N, et al: Quantitative and qualitative analyses of clock drawings in Alzheimer’s and Huntington’s disease. Brain Cogn 1992; 18:70–87Google Scholar

22. Tuokko H, Hadjistavropoulos T, Miller JA, et al: The clock test: a sensitive measure to differentiate normal elderly from those with Alzheimer’s disease. J Am Geriatr Soc 1992; 40:579–584Google Scholar

23. Lam LC, Chiu HF, Ng KO, et al: Clock-face drawing, reading and setting tests in the screening of dementia in Chinese elderly adults. J Gerontol B Psychol Sci Soc Sci 1998; 53:P353–357Google Scholar

24. Scanlan JM, Brush M, Quijano C, et al: Comparing clock tests for dementia screening: naive judgments vs formal systems—what is optimal? Int J Geriatr Psychiatry 2002; 17:14–21Google Scholar

25. Morris JC, Heyman A, Mohs RC, et al: The Consortium to Establish a Registry for Alzheimer’s Disease (CERAD). Part 1: clinical and neuropsychological assessment of Alzheimer’s disease. Neurology 1989; 39:1159–1165Google Scholar

26. Jefferson AL, Wong S, Bolen E, et al: Cognitive correlates of HVOT test performance differ between individuals with mild cognitive impairment and normal controls. Arch Clin Neuropsychol 2006; 21:405–412Google Scholar

27. Hubbard EJ, Santini V, Blankevoort CG, et al: Clock drawing performance in cognitively normal elderly. Arch Clin Neuropsychol 2008; 23:295–327Google Scholar

28. Ashendorf L, Jefferson AL, O’Connor MK, et al: Trail Making test errors in normal aging, mild cognitive impairment, and dementia. Arch Clin Neuropsychol 2008; 23:129–137Google Scholar

29. Winblad B, Palmer K, Kivipelto M, et al: Mild cognitive impairment: beyond controversies, toward a consensus: report of the International Working Group on Mild Cognitive Impairment. J Internal Med 2004; 256:240–246Google Scholar

30. Petersen RC, Smith GE, Waring SC, et al: Mild cognitive impairment: clinical characterization and outcome. Arch Neurol 1999; 56:303–308Google Scholar

31. McKhann GM, Drachman D, Folstein MF, et al: Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA work group. Neurology 1984; 34:939–944Google Scholar

32. Morris JC: The Clinical Dementia Rating (CDR): current version and scoring rules. Neurology 1993; 43:2412–2414Google Scholar

33. Folstein MF, Folstein SE, McHugh PR: “Mini-Mental State”: a practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res 1975; 12:189–198Google Scholar

34. Fleiss JL, Slakter MJ, Fischman SL, et al: Inter-examiner reliability in caries trials. J Dent Res 1979; 58:604–609Google Scholar

35. Fleiss JL: Statistical Methods for Rates and Proportions. New York, John Wiley & Sons, 2003Google Scholar

36. Kendall MG: Rank Correlation Methods. London, Griffin, 1963Google Scholar

37. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics 1977; 33:159–174Google Scholar