Inter-Rater Reliability of Case-Note Audit: A Systematic Review

Review
Inter-rater reliability of case-note audit: a systematic review

Richard Lilford, Alex Edwards, Alan Girling, Timothy Hofer1, Gian Luca Di Tanna, Jane Petty2, Jon Nicholl3
Department of Public Health & Epidemiology, University of Birmingham, Birmingham, UK; 1Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI, USA; 2School of Psychology, University of Birmingham, UK; 3Medical Care Research Unit, University of Shefeld, Shefeld, UK
Objective: The quality of clinical care is often assessed by retrospective examination of case-notes (charts, medical records). Our objective was to determine the inter-rater reliability of case-note audit. Methods: We conducted a systematic review of the inter-rater reliability of case-note audit. Analysis was restricted to 26 papers reporting comparisons of two or three raters making independent judgements about the quality of care. Results: Sixty-six separate comparisons were possible, since some papers reported more than one measurement of reliability. Mean kappa values ranged from 0.32 to 0.70. These may be inated due to publication bias. Measured reliabilities were found to be higher for case-note reviews based on explicit, as opposed to implicit, criteria and for reviews that focused on outcome (including adverse eects) rather than process errors. We found an association between kappa and the prevalence of errors (poor quality care), suggesting alternatives such as tetrachoric and polychoric correlation coecients be considered to assess inter-rater reliability. Conclusions: Comparative studies should take into account the relationship between kappa and the prevalence of the events being measured.
Journal of Health Services Research & Policy Vol 12 No 3, 2007: 173180 r The Royal Society of Medicine Press Ltd 2007
Introduction
Improving the quality and safety of health care has become the focus of much scientic, management and policy effort. In order to establish whether or not a change in policy or management practice is effective, it is necessary to develop metrics of quality.1 Similarly, performance management and quality assurance programmes are based on measurements. In each case, the following two questions can be asked:

Is the measure valid? i.e. does it measure the underlying construct we wish to examine, namely the quality of care? Is it reliable? i.e. what is the intra- and interobserver variation? This article is concerned with inter-rater reliability.
Richard Lilford PhD, Professor of Clinical Epidemiology, Alex Edwards PhD, Research Fellow, Alan Girling MA, Senior Research Fellow, Gian Luca Di Tanna MPhil, Research Fellow, Department of Public Health & Epidemiology, Jane Petty BSc, Research Assistant, School of Psychology, University of Birmingham, Birmingham B15 2TT, UK; Timothy Hofer MD, Associate Professor, Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI, USA; Jon Nicholl MA, Professor of Health Services Research, Medical Care Research Unit, University of Shefeld, Shefeld, UK. Correspondence to: R.J.Lilford@bham.ac.uk
Quality of care may be measured in many ways including direct observation, prospective data collection by staff, the use of simulated patients and evaluation of videotapes of episodes of care.2,3 However, case-notes (referred to as medical records and, in North America, as charts) are the most widely used (and studied) source of information. Review of casenotes is used routinely by the worlds largest provider of medical services (Medicare). It formed the basis of the Harvard Medical Practice Study,4 the related ColoradoUtah Study5 and the Quality in Australian Health Care Study.6 It does not disrupt normal patterns of care and can be conducted independently of care-givers to reduce observer bias.7 Given the importance of case-note review in both quality improvement and research, it is important to know how reliable it is. Goldman8 conducted a review of 12 studies published between the years 19591991.4,919 We extended and updated that review and sought to introduce a taxonomical layer by examining different types of quality measurement with respect to what is being assessed and how it is being assessed. First, we consider what is being assessed. We distinguish three types of endpoint. First, quality may be assessed with respect to clinical processes where the correct standard of care is judged to be violated, because the right care was not given (error of omission)
J Health Serv Res Policy Vol 12 No 3 July 2007
173
Review
Inter-rater reliability of case-note audit
or the wrong care was given (error of commission). We will refer to these assessments as measurements of process. Second, quality may be assessed in terms of the occurrence of an adverse event or outcome. Lastly, quality may be assessed in terms of adverse events that can be attributed to clinical process error these we call measures of causality. Next, we turn to how assessments are made. We discern two broad methods; explicit (algorithmic) methods, based on highly specied and detailed checklists and implicit (holistic) methods, based on expert judgement. The latter may be subclassied as unstructured (where there is no predetermined guidance for the reviewer) or structured (where the reviewer is guided to look for certain categories of error or adverse event). We will refer to the [Process, Causality, Adverse Event] axis as Focus and the [Implicit, Explicit] axis as Style. This typology is summarized in Figure 1. Intuitively, one might hypothesize that the explicit methods would be more reliable than the implicit methods and that outcome measures would be more reliable than either process or causality. We examine these hypotheses.
Methods Summary measures of reliability

The statistic used for the calculation of reliability can affect the measurement obtained.20 We were constrained in our choice, since the most widely used method was Cohens kappa,21 either in its weighted or unweighted form. Also, some studies used the intraclass correlation coefcient (ICC) as the measure of inter-rater agreement. These two methods are equivalent for the case where ratings have multiple ordered categories.22 Moreover, an ICC calculated for binary (0/1) data is identical to kappa calculated under the assumption of rater homogeneity the so-called intraclass kappa.23 Accordingly, we use the term kappa generically, to encompass all versions of Cohens kappa as well as the few studies that calculate an ICC. Kappa is affected by the prevalence of the event on which reviewers judgements are required. An increase in the prevalence of an event being observed will, of
itself, generate an increase in kappa until the event rate reaches 50%, following which it will again decline.24 Prevalence is estimated from the marginal distribution of the individual raters assessments.2528 Thus, kappa may depend on the overall frequency of error or adverse event in a study, even though the raters fundamental measurement process does not change. We therefore analysed our data-set with a view to detecting any effect of prevalence on kappa and adjusted for prevalence in the analysis where possible. In Cohens original paper,21 reliability is dened in terms of a direct comparison between the judgements of two reviewers. However, more than two reviewers may be involved. For instance, the Harvard Medical Practice Study29 averaged the rating of two reviewers and then calculated reliability between that average rating and an average (consensus) rating obtained independently from a group of experts. Takayanagi et al.30 compared panels of raters of the quality of care. Likewise, Thomas et al.3 compared panels of reviewers. Rubin et al.20 constructed yet another variant. They quoted the reliability of a single review compared with an average of a panel of reviewers. Higher levels of agreement will occur when a measurement is an average over several raters than when individual raters are compared, as evidenced by the SpearmanBrown formula. Thus, including studies such as the Harvard Practice Study or Rubins method would have resulted in higher (more attering) measurements of agreement than those found in other studies. Therefore, we restricted ourselves to comparisons of two or three reviewers making separate judgements about the quality of care. (In the event, the Harvard Medical Practice study data was re-analysed by Localio et al.31 in a way that did allow just two raters to be compared, and this study was included in the analysis.)
Search strategy
We used the National Library of Medicine (NLM) Gateway facility to search the MEDLINE and SCISEARCH databases using 21 search strings shown in Table 1. As can be seen, many produced massive yields
Comparison
Focus
Style
Process Error (Clinical)
Causality
Adverse Event
Explicit (Algorithmic)
Implicit (Holistic)
Structured
Unstructured
Figure 1 Taxonomy of comparison type for studies of inter-rater reliability. Each instance where inter-rater agreement was measured was classied according to focus and then again according to style. Within these categories (focus and style), the material was broken down into the mutually exclusively categories shown. Note that a single paper might contain more than one measurement of inter-rater agreement
174

Table 1 Productive search strings used for the location of Goldman papers via Gateway String\Goldman ref (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) Kappa care Medical records peer review* care Peer review quality medical care records Outcome and process assessment (MH) peer review Care quality medical records review* Medical records (MH) review* care quality Peer review Medical review* care Care quality medical records Peer medical review* care Medical records review* care Medical records (MH) Kappa Outcome and process assessment Care quality medical review* Care quality Medical review Measurement of care (patient or medical) and records Care Review* 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 11 5 7 3 1 1 1 9 4 62 8 1 1 63 64 C 3 3 2 1 4 3 4 7 5 4 4 5 3 2 5 6 7 1 2 11 10 P 3 6 6 7 8 8 8 10 11 11 11 11 11 11 11 11 11 11 11 11 11 Total
Review
Prod 4.55 2.52 2.17 1.64 1.56 1.37 1.10 0.88 0.62 0.58 0.51 0.43 0.37 0.30 0.28 0.24 0.18 0.11 0.09 0.05 0.03
66 119 92 61 257 219 3647 7989 808 685 780 11520 8044 6705 1787 24495 39132 8927 21616 233410 354651
The table is ordered by decreasing productivity. The column headed C (for Coverage) contains the number of Goldman papers found by the P string in the rst column. The column headed contains the cumulative Coverage of all strings up to and including the string in the rst column. Wildcard search string MH, MeSH term
and scrutinizing the abstracts to identify relevant papers would have been a huge task. We therefore used Goldmans papers as a benchmark we selected those strings that were most efcient at uncovering his original 12 references (we were unable to replicate Goldmans method since he did not report his search method explicitly). Table 1 shows the search strings that were found collectively to locate 11 of the 12 Goldman papers we were able to nd via Gateway. The most productive search strings are dened as those that gave the highest result for the ratio: Productivity 100 No: of Goldman papers found Total no: of papers found
of case-notes, and the topic of study was quality of care as reected by process, adverse event or causality. Additional papers were uncovered by examining reference lists in the retrieved articles. In all, 32 eligible papers were found, including nine of the 12 original Goldman papers. We were also previously aware of one further paper31 that we added to our total, yielding 33 papers in all.
Data extraction
The papers were read independently by two investigators (AE and JP) who extracted the data (Appendix A). Where discrepancies occurred, RL read the article and arbitrated. The most frequent point of disagreement was between the unstructured and structured categories of implicit reviews. In some cases, inter-rater agreement had to be calculated from source data, and this was carried out by AE and conrmed by AG.
Searches using the most productive strings and their variants were continued until the level of productivity became prohibitively costly of resources. For example, using the string peer review quality medical care records, which had a productivity of 2.17 (Table 1), delivered two Goldman papers among 92 hits, which meant that on average 46 papers needed to be inspected before a potentially useful paper was found. On the other hand, use of the string care quality with a productivity of 0.24 required, on average, the inspection of 408 papers for every potentially useful paper found and this clearly represented an unacceptable level of cost. There was a step decrease in productivity beyond the sixth string (Table 1). We reviewed the abstracts of the papers identied by the most productive six strings. This resulted in identication of 54 papers that appeared promising. We obtained all papers that were available via the Internet or from the University of Birmingham library. These papers were included in our analysis if they contained information about the degree of agreement between reviewers. The object of investigation was a set
Results Excluded papers

Our search yielded 33 papers.24,6,9,10,1214,1720,24,2947 Six papers3,12,17,20,29,30 were excluded because they used large numbers of raters. The remaining 27 papers had all used two or three raters. One paper32 analysed 70 items relating to quality of care, but did not measure inter-rater agreement for each of the items and failed to provide information necessary for classication according to style or focus it gave only an average kappa value across all items (0.9) and the kappa value for the lowest scoring item (0.6). We excluded this paper, leaving 26 papers that yielded comparable data these are listed in Appendix A.
175
Review
Statistical methods
In some papers, the assessments entailed the same combination of focus and style; in others, kappas from two or more focus/style combinations were reported. For example, the reliability of focus and style measurements might have been assessed for more than one clinical domain. For analytical purposes, all assessments from the same focus/style typology within a given paper were combined to form a cluster. A nested hierarchical ANOVA (analysis of variance [assessment within cluster within focus style]) was conducted using MINITABs Release 14. The impact of prevalence was explored further by introducing prevalence as a covariate into the ANOVA for kappa.
Inter-rater reliability
Between them, the 26 papers reported kappa values from 66 separate assessments (Appendix A). A summary of the data, cross-classied by focus and style, is presented in Table 2. As hypothesized, concordance between raters appears to be greater for reviews guided by explicit criteria as compared with implicit reviews. The reliability is also higher for reviews that are focused towards outcome rather than process. These results are shown graphically in Figure 2. Just eight papers reported kappas from two or more focus/ style combinations, and a total of 39 clusters were formed for the hierarchical analysis. A signicant cluster effect was obtained (P 0.034), with an ICC of 0.38, reecting the degree of similarity between assessments within a paper. Signicant results for both the effect of focus (P 0.033) and style (P 0.034) were found, after allowing for the cluster effect. The interaction effect (style focus) was not signicant (P 0.973). The following conclusions may be drawn:
In order to take into account the fact that differences in the frequency of events may account for systematic changes in kappa values, we examined the effect of event rates by examining each paper to ascertain the proportion of ratings classied as unsatisfactory we refer to this as prevalence. Prevalence could be ascertained for 36 of the 66 assessments. The overall correlation between prevalence and kappa was 0.44 (Po0.008), thereby conrming the (expected) positive correlation between kappa value and prevalence. After adjusting for prevalence, the effects of Focus (P 0.114) and Style (P 0.112) were no longer signicant. The directions of effect are, however, the same as before and these comparisons are of low statistical power, since prevalence was not recorded in all cases. Furthermore, if we combine the two categories of implicit review into one larger category, a signicant difference between styles of review (P 0.024) is found even after allowing for prevalence.
Explicit Style Implicit (structured) Style Implicit (unstructured) Style 1.0 0.8 Kappa 0.6 0.4 0.2 0.0 Adverse Event Causality Process Focus

kappa tends to be higher for explicit than for implicit reviews; kappa tends to decline where greater concern for process is present in the assessment; and these effects operate independently of one another.
Figure 2 Mean kappa values and two standard error bars by focus and style
Table 2 Summary of kappa data classied by focus and style from 66 assessments in 26 papers Adverse event Explicit Implicit Structured Implicit Unstructured Total No. of assessments (clusters) Cases per assessment: median kappa: mean (SD) No. of assessments (clusters) Cases per assessment: median kappa: mean (SD) No. of assessments (clusters) Cases per assessment: median kappa: mean (SD) No. of assessments (clusters) Cases per assessment: median kappa: mean (SD) 4 (3) 31 0.70 (0.13) 7 (5) 237 0.56 (0.16) 3 (2) 37 0.51 (0.27) 14 (10) 166 0.59 (0.18) Causality 1 (1) 15 0.64 (.) 9 (6) 140 0.39 (0.14) 7 (5) 225 0.40 (0.12) 17 (12) 140 0.41 (0.14) Process 4 (3) 28 0.55 (0.21) 21 (9) 95 0.35 (0.19) 10 (5) 171 0.32 (0.18) 35 (17) 89 0.37 (0.20) Total 9 (7) 25 0.62 (0.17) 37 (20) 132 0.40 (0.19) 20 (12) 171 0.38 (0.18) 66 (39) 108 0.42 (0.20)
Each paper provides one or more assessments, each with a kappa value classied by style and focus. Cases refers to the numbers of medical records reviewed in each assessment. Means and standard deviations (SDs) for kappa are calculated across assessments. Clusters were formed within the papers (average size of cluster 66/39=1.7) by pooling assessments with the same focus and style
176
Review
Discussion
Mean values of kappa are higher in studies that evaluate adverse events rather than processes and that use explicit review rather than either type of implicit review. There are three broad reasons that could explain these ndings. The rst reason is that adverse events and explicit criteria are both more clearly dened and discernible than their alternatives. The second reason is that these are artefacts of changing prevalence given the use of Kappa as a measure of inter-rater agreement. The third reason could be that some other type of confounder is at work for example, that adverse events are measured in one type of disease and process in another. The data-set we had was far too small to test for the effects of disease or clinical setting on reliability. However, in over half the cases it was possible to control for prevalence. After controlling for prevalence, the statistically signicant associations between kappa and both style and focus disappeared. That said, the direction of effect remained unchanged and, in the case of style, the statistical association remained signicant if semi-structured and unstructured methods were combined. Tentatively, we think it is reasonable to conclude that the associations that we observed (both for style and focus) are partially, but perhaps not totally, explained by the inherent nature of the tasks. Intuitively, one might expect explicit review to yield higher reliability since it is based on the reviewer being guided in an overt way by a prescribed set of norms of good practice and then applying these norms in the evaluation of a set of case-notes. Implicit review, on the other hand, relies on the reviewer making judgements through the employment of relatively uncodied knowledge held in his or her mind and perhaps tailored to the circumstances of a specic case. The measurement process for explicit review does not include the development of the criteria for forming the algorithms. The criteria and algorithms that result are taken as xed, and thus any imprecision from that step is removed from the estimate of reliability. The measurement procedure in implicit review requires the reviewers to form their criteria and apply them, and thus both sources of variability are included in the measurement of reliability. Although explicit methods appear to yield higher reliability than implicit alternatives, they have the disadvantage of missing elements not taken into account during the prescription of an explicit framework. A way around this might be to adopt a mixed strategy in which a reviewer addresses a set of designed explicit requirements, but is then invited to make any further implicit contribution that might be forthcoming. The mean kappa values in our review are moderate to good ranging from 0.32 (Implicit review of Process) to 0.70 (Explicit review of Adverse events). It is possible that all these results are somewhat inated, since studies reporting low reliability may be submitted or accepted for publication less frequently than those with better results.
In Cohens21 original work, kappa was introduced as a measure of agreement for use with nominal (i.e. unordered) categories. But it is now recognized that there are better alternatives to kappa for ordered categories48 which include the tetrachoric and polychoric correlation coefcients.49 A major advantage of these measures is that they overcome the problem of sensitivity to prevalence. The association between kappa and the prevalence of error in our review suggests that further consideration be given to using tetrachoric and polychoric correlation coefcients in studies of inter-rater reliability, particularly in comparative studies across different clinical settings. In the meantime, it is worth noting that good care, with low error rates, is likely to be associated with lower interrater agreement, because of the correlation between kappa and prevalence.
Acknowledgements
This study was supported by a MRC Patient Safety Research Grant and the National Health Service R & D grant supporting the Coordinating Centre for Research Methodology.
References
1 Lilford R, Mohammed MA, Spiegelhalter D, Thomson R. Use and misuse of process and outcome data in managing performance of acute medical care: avoiding institutional stigma. Lancet 2004;363:114754 2 Michel P, Quenon JL, de Sarasqueta AM, Scemama O. Comparison of three methods for estimating rates of adverse events and rates of preventable adverse events in acute care hospitals. BMJ 2004;328:199 3 Thomas EJ, Lipsitz SR, Studdert DM, Brennan TA. The reliability of medical record review for estimating adverse event rates. Ann Intern Med 2002;136:8126 4 Brennan TA, Localio RJ, Laird NL. Reliability and validity of judgments concerning adverse events suffered by hospitalized patients. Med Care 1989;27:114858 5 Thomas EJ, Studdert DM, Burstin HR, et al. Incidence and types of adverse events and negligent care in Utah and Colorado. Med Care 2000;38:26171 6 Wilson RM, Runciman WB, Gibberd RW, Harrison BT, Newby L, Hamilton JD. The quality in Australian Health Care Study. Med J Aust 1995;163:45871 7 Lilford RJ, Mohammed MA, Braunholtz D, Hofer TP. The measurement of active errors: methodological issues. Qual Saf Health Care 2003;12(Suppl 2):ii812 8 Goldman RL. The reliability of peer assessments of quality of care. JAMA 1992;267:95860 9 Empire State Medical, Scientic and Educational Foundation, Inc. Rochester region perinatal study. Medical review project. N Y State J Med 1967;67:120510 10 Bigby J, Dunn J, Goldman L, et al. Assessing the preventability of emergency hospital admissions. A method for evaluating the quality of medical care in a primary care facility. Am J Med 1987;83:10316 11 Brook RH. Quality of Care Assessment: A Comparison of Five Methods of Peer Review. Washington, DC: US Department of Health, Education and Welfare, Public Health Service, Health Resources Administration, Bureau of Health Services Research and Evaluation, US Dept of Health, Education, and Welfare publication HRA74-3100 1973
177
Review
12 Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of appropriateness of care. JAMA 1991;265:195760 13 Dubois RW, Brook RH. Preventable deaths: who, how often, and why? Ann Intern Med 1988;109:5829 14 Hastings GE, Sonneborn R, Lee GH, Vick L, Sasmor L. Peer review checklist: reproducibility and validity of a method for evaluating the quality of ambulatory care. Am J Public Health 1980;70:2228 15 Horn SD, Pozen MW. An interpretation of implicit judgments in chart review. J Community Health 1977;2: 2518 16 Morehead MA, Donaldson RS, Sanderson S, Burt FE. A Study of the Quality of Hospital Care Secured by a Sample of Teamster Family Members in New York City. New York, NY: Columbia University School of Public Health and Administrative Medicine, 1964 17 Posner KL, Sampson PD, Caplan RA, Ward RJ, Cheney FW. Measuring interrater reliability among multiple raters: an example of methods for nominal data. Stat Med 1990;9:110315 18 Rosenfeld LS. Quality of medical care in hospitals. Am J Public Health 1957;47:85665 19 Posner KL, Caplan RA, Cheney FW. Physician agreement in judging clinical performance. Anesthesiology 1991;75(Suppl 3A):A1058 20 Rubin HR, Rogers WH, Kahn KL, Rubenstein LV, Brook RH. Watching the doctor-watchers. How well do peer review organization methods detect hospital care quality problems? JAMA 1992;267:234954 21 Cohen J. A coefcient of agreement for nominal scales. Educ Psychol Meas 1960;20:3746 22 Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefcient as measures of reliability. Educ Psychol Meas 1973;55:6139 23 Shoukri MM. Measures of Interobserver Agreement. Boca Raton, FL: CRC Press, Chapman & Hall, 2003 24 Hayward RA, Hofer TP. Estimating hospital deaths due to medical errors: preventability is in the eye of the reviewer. JAMA 2001;286:41520 25 Fleiss JL, Nee JCM, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychol Bull 1979;86:9747 26 Hutchinson TP. Focus on psychometrics. Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Res Nurs Health 1993;16:3136 27 Tanner MA, Young MA. Modelling agreement among raters. J Am Stat Assoc 1985;80:17580 28 Williamson JM, Manatunga AK. Assessing interrater agreement from dependent data. Biometrics 1997;53:70714 29 Brennan TA, Leape LL, Laird NM, et al. Incidence of adverse events and negligence in hospitalized patients. Results of the Harvard Medical Practice Study I. N Engl J Med 1991;324:3706 30 Takayanagi K, Koseki K, Aruga T. Preventable trauma deaths: evaluation by peer review and a guide for quality improvement. Emergency Medical Study Group for Quality. Clin Perform Qual Health Care 1998;6:1637 31 Localio AR, Weaver SL, Landis JR, et al. Identifying adverse events caused by medical care: degree of physician agreement in a retrospective chart review. Ann Intern Med 1996;125:45764 32 Rottman SJ, Schriger DL, Charlop G, Salas JH, Lee S. Online medical control versus protocol-based prehospital care. Ann Emerg Med 1997;30:628 33 Dobscha SK, Gerrity MS, Corson K, Bahr A, Cuilwik NM. Measuring adherence to depression treatment guidelines

in a VA primary care clinic. Gen Hosp Psychiatry 2003;25: 2307 Forbes SA, Duncan PW, Zimmerman MK. Review criteria for stroke rehabilitation outcomes. Arch Phys Med Rehabil 1997;78:11126 Hofer TP, Bernstein SJ, DeMonner S, Hayward RA. Discussion between reviewers does not improve reliability of peer review of hospital quality. Med Care 2000;38: 15261 Hofer TP, Asch SM, Hayward RA, et al. Measuring quality of care: is there a role for peer review? BMC Health Serv Res. http://www.biomedcentral.com/1472-6963/4/9 2004 Rubenstein LV, Kahn KL, Reinisch EJ, et al. Changes in quality of care for ve diseases measured by implicit review, 1981 to 1986. JAMA 1990;264:19749 Notes: COMMENTS: Comment in: JAMA 1990 October 17; 264 (15):19956 Ashton CM, Kuykendall DH, Johnson ML, Wray NP. An empirical assessment of the validity of explicit and implicit process-of-care criteria for quality assessment. Med Care 1999;37:798808 Baker GR, Norton PG, Flintoft V, et al. The Canadian Adverse Events Study: the incidence of adverse events among hospital patients in Canada. CMAJ 2004;170: 167886 Weingart SN, Davis RB, Palmer RH, et al. Discrepancies between explicit and implicit review: physician and nurse assessments of complications and quality. Health Serv Res 2002;37:48398 Saliba D, Kington R, Buchanan J, et al. Appropriateness of the decision to transfer nursing facility residents to the hospital. J Am Geriatr Soc 2000;48:15463 Lorenzo S, Lang T, Pastor R, et al. Reliability study of the European appropriateness evaluation protocol. Int J Qual Health Care 1999;11:41924 Hayward RA, McMahon Jr LF, Bernard AM. Evaluating the care of general medicine inpatients: how good is implicit review? Ann Intern Med 1993;118:5506 Bair AE, Panacek EA, Wisner DH, Bales R, Sakles JC. Cricothyrotomy: a 5-year experience at one institution. J Emerg Med 2003;24:1516 Smith MA, Atherly AJ, Kane RL, Pacala JT. Peer review of the quality of care. Reliability and sources of variability for outcome and process assessments. JAMA 1997;278: 15738 Camacho LA, Rubin HR. Reliability of medical audit in quality assessment of medical care. Cad Saude Publica 1996;12(Suppl 2):8593 Pearson ML, Lee JL, Chang BL, Elliott M, Kahn KL, Rubenstein LV. Structured implicit review: a new method for monitoring nursing care quality. Med Care 2000;38:107491 Kraemer HC, Periyakoil VS, Noda A. Kappa coefcients in medical research. Stat Med 2002;21:210929 Agresti A. Modelling ordered categorical data: recent advances and future challenges. Stat Med 1999;18: 2191207
34 35
36 37
38
39
40
41 42 43 44 45
46 47
48 49
Appendix A
Papers included in our analysis of the reliability of quality assurance, broken down by the individual assessment, i.e. comparisons of inter-rater reliability according to style, focus, prevalence and number of raters.
178

Appendix A Author Focus P Dubois and Brook (1998) Ca X X X X AE Style E I T X X X X Design R 3 3 3 2 C 2 2 2 4 S 105 140 132 105 70 112 10 110 42 225 225 1258 1198 25 423 423 423 423 145 145 342 2574 4207 4207 4207 37 16 55 56 40 37 59 89 85 100 100 0.4 0.3 0.2 0.325 0.326 0.342 0.35 0.33 0.42 0.57 0.34 0.28 0.18 0.87 0.11 0.58 0.40 0.39 0.83 0.31 0.57 0.67 0.55 0.33 0.42 0.13 0.55 0.23 0.46 0.26 0.46 0.16 0.43 0.49 0.68 0.78 0.58 0.20 0.55 0.25 0.28 0.437 0.20 0.51 0.16 0.12 0.32 0.028 0.340 0.222 0.036 0.15 0.20 0.25 0.543 0.387 0.589 0.14 0.16 K Prevalence Notes
Review
Three different kappas for three different diseases Three different pairs of raters in three different specialities
Rosenfeld (1957)
Hastings (1980) Bigby (1987) Posner (1991) Brennan (1989)
X X X X X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
2 2 2 4 4 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 2 2 2 2
3 2 3 2 2 2 2 2 5 2 2 2 2 2 4 2 2 3 3 2 2 2 6 6 6 6 2 2 2 2
Two kappas for causality using a high and low cut off Two kappas one for mothers and one for infants
Rochester study (1967)
Bair (2003) Camacho and Rubin (1996)
Paper reports ve kappas but fth Quality kappa not explained, so not included The P/O study was based on a 25% sample
Michel (2004) X Rubenstein (1990) Wilson (1995) X
Nurses rst screened charts for adverse events Doctors evaluation of outcome. Preventability Causation Three different diseases Four different diseases
Ashton (1999)
X X X X X X X X X
Hofer (2004)
Pearson (2000)
Two kappas for two diseases First kappa for appropriateness of transfer to hospital. The results were repeated taking into account advance directions for this population in long-term care and thus produced very similar results Process of care acute long-term care frail older people
Saliba (2000)
X X
Smith (1997)
180
0.17
179
Review
Continued. Author Focus P Smith (1997) Ca AE X Style E I T X Design R 2 C 5 S 180
Prevalence
Notes
0.42
Outcome of care acute long-term care frail older people Four kappas (2 nurse, 2 physician) for quality of care Four kappas (2 nurse, 2 physician) for complications of treatment 0.51
Weingart (2002) X X
X X X X X X
X X X X X X X X
2 2 2 2 2 2 2 2 2 2 27 5 5
37 37 37 37 19 19 19 37 60 15 62
0.70 0.22 0.62 0.22 0.55 0.76 0.41 0.15 0.81 0.64 0.20 0.23
Dobscha (2003) Forbes (1997) Hayward & Hofer (2001)
X X X
X X
Kappa for single reviewer derived from that for average of two (0.34) using SpearmanBrown formula Overall quality Death preventable Errors in following medical orders Readiness for discharge Post-discharge follow-up Care for presenting problem Appropriate response new information Adequacy documented Whether laboratory result iatrogenic Overall quality Appropriateness of admission Appropriateness of care Nurses and then doctors Prevalence of adverse events Causality Preventability Kappa quoted only for cases reviewed by senior physicians. Raw data given for all 7533 cases in study
Hayward (1993)
X X X X X X X X
X X X X X X X
?2 ?2 ?2 ?2 ?2 ?2 ?2 ?2
6 5 5 3 5 5 5 5 5 6 2 2 2 6 6 6 6
171 34 171 171 171 171 171 171 95 95 19 31 375 151 151 151 237
0.5 0.5 0.1 0.1 0.3 0.2 0.3 0.3 0.46 0.35 0.61 0.58 0.70 0.47 0.45 0.69 0.50
0.11 0.09 0.08 0.16 0.07 0.21 0.14 -
Hofer (2000) X Lorenzo (1999) X X Baker (2004)
X X X X X X X X X X X X
2 2 2 2 2 2 2 2 2
0.37 0.42
X X Localio (1996)
0.42 0.18
Papers not included in analysis: Thomas et al.3; Takayanagi30; Rottman et al.32; Rubin et al.20; Brennan et al.29; Caplan et al.12; Posner et al.17. Key: P=process; Ca=process from outcome; AE=outcome; E=explicit; I=implicit; T=structured implicit (semi-implicit); R=number of reviewers; C=Classes see text; ?=Exact numbers of raters is unclear
180

Inter-Rater Reliability of Case-Note Audit: A Systematic Review

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Inter-Rater Reliability of Case-Note Audit: A Systematic Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inter-Rater Reliability of Case-Note Audit: A Systematic Review

Uploaded by

Copyright:

Available Formats

Review

Inter-rater reliability of case-note audit: a systematic review

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

Methods Summary measures of reliability

Process Error (Clinical)

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

Results Excluded papers

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

Hastings (1980) Bigby (1987) Posner (1991) Brennan (1989)

Rochester study (1967)

Bair (2003) Camacho and Rubin (1996)

Michel (2004) X Rubenstein (1990) Wilson (1995) X

J Health Serv Res Policy Vol 12 No 3 July 2007

Inter-rater reliability of case-note audit

Dobscha (2003) Forbes (1997) Hayward & Hofer (2001)

0.11 0.09 0.08 0.16 0.07 0.21 0.14 -

Hofer (2000) X Lorenzo (1999) X X Baker (2004)

J Health Serv Res Policy Vol 12 No 3 July 2007

You might also like