Printer Friendly
The Free Library
5,677,005 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Kappa coefficient calculation using multiple ratings per subject: a special communication.


Kappa Coefficient Calculation Using Multiple Ratings Per Subject: A Special Communication The agreement among clinicians on the scores of observational clinical measures is an increasingly important concern in the development of dependable and valid tests used in physical therapy. [1] Developers of check lists and rating scales used for clinical assessment and documentation should demonstrate that items and aggregate indexes can be scored accurately by independent raters. Especially in the early stages of test development, individual items of a clinical test should be subjected to an interobserver reliability analysis. Determination of agreement is important in order to examine the quality of the administrative, scoring, and rating procedures and to make decisions regarding needed revisions in the items for which raters demonstrate poor agreement.

The primary purpose of this special communication is to describe the use of the Kappa statistic statistic,
n a value or number that describes a series of quantitative observations or measures; a value calculated from a sample.


statistic

a numerical value calculated from a number of observations in order to summarize them.
 for the estimation of interobserver reliability. Kappa is a preferred statistic to estimate interobserver agreement for nominal or ordinal scale ordinal scale (or´dn  data. Some large statistical packages (eg, BMDP BMDP - BioMeDical Package  statistical software) [2] provide a method to calculate Kappa for instances in which there are two raters per subject. However, this is a limited condition; many projects involve more than two ratings per subject, or not all subjects are rated by the same two observers. We are not aware of any statistical package that calculates Kappa in designs in which there are more than two ratings per subject. The objectives of this communication are to describe the applications of Kappa and to describe a short FORTRAN program Noun 1. FORTRAN program - a program written in FORTRAN
computer program, computer programme, programme, program - (computer science) a sequence of instructions that a computer can interpret and execute; "the program required several hundred lines of code"
 that calculates Kappa for conditions of multiple ratings per subject.

Kappa Statistic

The initial consideration in the choice of an appropriate agreement statistic is to determine the type or level of the data. The estimation of the agreement and parallelism An overlapping of processing, input/output (I/O) or both.

1. parallelism - parallel processing.
2. (parallel) parallelism - The maximum number of independent subtasks in a given task at a given point in its execution. E.g.
 of continuous (interval or ratio) data can be achieved by the use of an intraclass correlation In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups.  coefficient (ICC ICC

See: International Chamber of Commerce
) via an analysis-of-variance model. [3,4] Krebs describes the different models of the ICC and provides a short program for the determination of the variance components and the error terms needed for the calculation of the ICC. [5] In studies in which the data are categorical That which is unqualified or unconditional.

A categorical imperative is a rule, command, or moral obligation that is absolutely and universally binding.

Categorical is also used to describe programs limited to or designed for certain classes of people.
 (nominal or ordinal (mathematics) ordinal - An isomorphism class of well-ordered sets. ), other indexes such as Kappa are needed. [6-8] Many test items and rating scales used in the clinical practice of physical therapy have nominal or ordinal scale properties. Examples include ratings of muscle tone, grades of manual muscle testing, and levels of physical assistance in functional scales.

There are three general approaches to the estimation of the reliability of categorial data. [9] The first approach is to use a descriptive, or proportional, index of agreement. Calculation of the simple percentage of agreement between raters is the most common descriptive method for estimating agreement of categorical data categorical data

data relating to category such as qualitative data, e.g. dog, cat, female. It may be nominal when a name is used, e.g. location, breed, or ordinal when a range of categories is used, e.g. calf, yearling, cow.
. There are two major limitations in using the percentage of agreement in the estimation of interobserver agreement. The percentage of agreement has a potentially infinite range; therefore, reported values cannot be interpreted within a standard range such as 0.0 to [+ or -] 1.0. An additional limitation is the lack of attention to the factor of chance agreement. Chance agreement can be quite large if the variability of observations is limited and few categories are used by the raters. For example, ratings of abnormal gait patterns may demonstrate very high percentages of agreement in healthy subjects. This result may be due largely to the small proportions of gait deviations seen in the healthy population. In such instances, we have almost no information on the agreement of the rating of gait deviations. Variability of observations across the entire scale of ratings must occur to adequately assess reliability. Percentages of agreement can be particularly misleading where there is not an even distribution of scale points. Although percentage of agreement has appeal because of its ease of computation and apparent straightforward interpretation, several authors have described its limitations and lack of statistical suitability. [8,10,11]

A second general approach to the estimation of reliability of categorial data is use coefficients of association. Rank-order correlations Noun 1. rank-order correlation - the most commonly used method of computing a correlation coefficient between the ranks of scores on two variables
rank-difference correlation, rank-difference correlation coefficient, rank-order correlation coefficient
 and chisquare indexes have also been proposed to examine interobserver reliability. These indexes, however, indicate degrees of association of paired scores, not agreement. The covariance Covariance

A measure of the degree to which returns on two risky assets move in tandem. A positive covariance means that asset returns move together. A negative covariance means returns vary inversely.
 of paired ratings, if systematic error is present, may be very different than actual agreement. hartman [9] and Hollenbeck [10] have reviewed the numerous indexes of association that have been used in the estimation of reliability.

The most useful approach to the estimation of reliability of categorial data is the application of a correlational-type statistic, such as Kappa, that corrects for chance. Kappa is interpreted as the proportion of agreement among raters after chance agreement has been removed [12] and is expressed symbolically by the following equation:

Kappa = Proportion of observed agreement - change agreement divided by 1 - change agreement

Chance agreement is estimated by the proportion of agreements that would be expected if the observer's ratings were completely random. Chance agreement increases as the variability of observed ratings decreases.

The use of kappa requires minimal assumptions about the underlying nature of the data. Three data-collection conditions should be met: 1) The subjects to be rated are independent of each other, 2) the raters score the subjects in an independent fashion, and 3) the rating categories are mutually exclusive Adj. 1. mutually exclusive - unable to be both true at the same time
contradictory

incompatible - not compatible; "incompatible personalities"; "incompatible colors"
 and exhaustive. [11] The flexibility of different forms of Kappa is also a major advantage. Kappa is appropriate for dichotomous di·chot·o·mous  
adj.
1. Divided or dividing into two parts or classifications.

2. Characterized by dichotomy.



di·chot
 (nominal) and polychotomous (ordinal) data, where there are two or more raters per subject. [12,13] Kappa can be calculated for each scale point or averaged into a generalized Kappa across the entire set of ratings. A weighted Kappa format has also been developed to examine the error patterns of ratings in situations in which certain errors are considered more critical so that the more serious the error, the larger the weight. [14] A complete sampling theory has been devised for Kappa Standard error estimates can be calculated, thus providing the opportunity to determine whether the statistic is significantly different than chance. Its metric properties are closely linked to the ICC. [8] Theoretical Kappa values range from - 1 to +1; however, extreme values are often restricted by reduced variability of the data. A value approximating zero is interpreted as close to chance agreement, whereas values less than zero are interpreted as worse than chance agreement. Values of Kappa must be interpreted view of the frequent restriction in higher values. Landis and Koch interpreted Kappa values as follows [is less than] .40 = poor, [is greater than or equal to] .40 to [is less than] .75 = fair to good, and [is greater than or equal to] .75 = excellent. [7]

Application of Kappa

Rehabilitation medicine rehabilitation medicine Physiatry, physiotherapy A field of therapeutics that bridges the gap between conventional and nonconventional medicine; rehabilitation physicians may adminsiter or prescribe mechanical–eg, massage, manipulation, exercise, movement,  researchers have used Kappa to examine the agreement of raters in numerous situations. Plewis and Bax discussed many of the "abuses" that have been present in the developmental medicine literature, and they advocated the use of Kappa for reliability studies. [15] Sheikh sheikh
 or shaykh

Among Arabic-speaking tribes, especially Bedouin, the male head of the family, as well as of each successively larger social unit making up the tribal structure. The sheikh is generally assisted by an informal tribal council of male elders.
 promoted the use of Kappa determine the reliability of disability scales. [16] Haley et al used Kappa to present the findings of the interobserver and intraobserver reliability of items on an infant movement assessment tool. [17] Although Kappa has many appropriate applications, it has certain limitations. For instance, Kappa treats all of the raters similarly so that when one or more of the ratings are considered to be standard, other procedures may be more appropriate. [18,19]

An example of the use of Kappa for a two-rater condition, adapted from a study on infant movement assessment, [17] is presented in Table 1. Agreements between the two raters are located in the main diagonal Noun 1. main diagonal - the diagonal of a square matrix running from the upper left entry to the lower right entry
principal diagonal

diagonal - an oblique line of squares of the same color on a checkerboard; "the bishop moves on the diagonals"
. The disagreements are located in the off diagonal cells. A strong advantage of the Kappa format is the ease of examining not only the degree of error but also the pattern of errors. In this example, the Kappa value of .51 is acceptable. Note how the percentage of agreement and the Kapa coefficient differ and how chance agreement is factored into the calculation of the Kappa coefficient.

Use of the FORTRAN Progrm

Fleiss described an adaptation and calculation procedure for Kappa to be used with multiple ratings per subject in which ratings can be performed by the same set of raters or by different raters. [12] Individual Kappa values are calculated for each scale value and then combined into an overall Kappa coefficient. When the number of ratings per subject is equal, approximate standard errors can be calculated and used to compute a Z statistic. With a normal distribution, the Z statistic can be used to test the null hypothesis null hypothesis,
n theoretical assumption that a given therapy will have results not statistically different from another treatment.

null hypothesis,
n
 that Kappa is equal to zero. A FORTRAN program was written as an alternative to the hand-calculation procedure. The output of the computer program provides a generalized Kappa value for the entire item, an estimate of standard error, and a Z score. To verify the accuracy of the FORTRAN program calculations, we used data from an example provided by Fleiss involving five ratings for 10 subjects (Tab. 2). [12] The FORTRAN program requires a FORTRAN compiler Noun 1. Fortran compiler - a compiler for programs written in FORTRAN
compiling program, compiler - (computer science) a program that decodes instructions written in a higher order language and produces an assembly language program
 and can be modified to include various numbers of subjects and ratings per subject. The results we obtained with this modified program * are identical to the results described by Fleiss. [12]

To demonstrate the application of Kappa, we used this program to assess the interobserver reliability of three independent raters on a new motor performance measure, the Tufts Assessment of Motor Performance. [20] Table 3 shows the summary output obtained from the raw data for 40 subjects rated on onel item of a seven-point scale by each of three raters. The data output is formatted so that subjects who do not receive the same rating by all three raters are easily identified, as is the pattern of errors. For this item, the majority of responses occurred for rating 6; however, there was sufficient variability to examine the entire scale. In this example, Kappa (K) was vert VERT. Everything bearing green leaves in a forest. Bac. Ab. Courts of the Foreat; Manwood, 146.  high (K = .92) and significant (Z = 17.02, p [is less than] .001).

Summary

This special communication presents background information on the Kappa coefficient and discusses the use of a FORTRAN program to calculate Kappa coefficients for data sets that involve multiple ratings per subject. Kappa has many desirable statistical properties, and its greatest advantage is the incorporation of chance agreement into its calculation. Clinicians and researchers in physcal theraphy are encouraged to use Kappa coefficients to determine the dependability and accuracy of their clinical rating scales and individual test items when nominal or ordinal data are involved.

Acknowledgements

Appreciation is extended to Carla DiScala, PhD, and Bruce Gans, MD, for their support and technical review of this manuscript.

(*1) A diskette The official name for the floppy disk. See floppy disk.

diskette - floppy disk
 of the modified FORTRAN program is available from Dr Haley.

References

[1' Campbell SK: On the importance of being earnest about measurement, or how can we be sure that what we know is true? Phys Ther 67:1831-1833, 1987

[2] Dixon WJ, Engelman L, Frane JW, et al: BMDP Statistical Software, Berkeley, CA, University of California Press "UC Press" redirects here, but this is also an abbreviation for University of Chicago Press

University of California Press, also known as UC Press, is a publishing house associated with the University of California that engages in academic publishing.
, 1985, pp 143-206

[3] Bartko JJ: On various intraclass correlation reliability coefficients. Psychol Bull 83:762-765, 1976

[4] Shrout PE, Fleiss JL: Intraclass correlations: Uses in assessing rater rat·er  
n.
1. One that rates, especially one that establishes a rating.

2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. 
 reliability. Psychol Bull 86:420-428, 1979

[5] Krebs DE: Intraclass correlation coefficients: Use and calculation. Phys Ther 64:1581-1589, 1984

[6] Cohen cohen
 or kohen

(Hebrew: “priest”) Jewish priest descended from Zadok (a descendant of Aaron), priest at the First Temple of Jerusalem. The biblical priesthood was hereditary and male.
 J: A coefficient for agreement for nominal scales See: principal scale; scale. . Educational and Psychological Measurement 20:37-46, 1960

[7] Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics 33:159-174, 1977

[8] Bartko JJ, Carpenter WT: On the methods and theory of reliability. J Nerv Ment Dis 163:307-317, 1976

[9] Hartman DP: Considerations in the choice of interobserver reliability estimates. J Appl Behav Anal anal (a´n'l) relating to the anus.

a·nal
adj.
1. Of, relating to, or near the anus.

2.
 10:103-116, 1977

[10] Hollenbeck AR: Problems of reliability in observational research. In Sackett GP (ed): Observing Behavior: Data Collection and Analysis Methods. Baltimore, MD, University Park Press, 1978, vol 2, pp 429-443

[11] Soeken KL, Prescott PA: Issues in the use of Kappa to estimate reliability. Med Care 24:733-741, 1986

[12] Fleiss JL: The measurement of interrater agreement. In Fleiss JL: Statistical Methods for Rates and Proportions. New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
, NY, John Wiley John Wiley may refer to:
  • John Wiley & Sons, publishing company
  • John C. Wiley, American ambassador
  • John D. Wiley, Chancellor of the University of Wisconsin-Madison
  • John M. Wiley (1846–1912), U.S.
 & Sons Inc, 1981, pp 212-236

[13] Conger AJ: Integration and generalization gen·er·al·i·za·tion
n.
1. The act or an instance of generalizing.

2. A principle, a statement, or an idea having general application.
 of Kappa for multiple raters. Psychol Bull 88:322-328, 1980

[14] Cohen J: Weighted Kappa: Nominal scales agreement with provision for scaled disagreement or partial credit. Psychol Bull 70:213-220, 1968

[15] Plewis I, Bax M: The uses and abuses of reliability measures in developmental medicine. Dev Med Child Neurol 24:388-389, 1982

[16] Sheikh K: Disability scales: Assessment of reliability. Arch Phys Med Rehabil 67:245-249, 1986

[17] Haley SM, Harris SR, Tada WL, et al: Item reliability of the Movement Assessment of Infants. Physical and Occupational Therapy in Pediatrics 6(1):21-39, 1986

[18] Williams GW: Comparing the joint agreement of several raters with another rater. Biometrics 32:619-627, 1976

[19] Wackerly DD, McClave JT, Rao PV: Measuring nominal scale agreement between a judge and a known standard. Psychometricia 43:213-223, 1978

[20] Gans BM, Haley SM, Hallenborg SC, et al: Description and interobserver reliability of the Tufts Assessment of Motor Performance. Am J Phys Med Rehabil 67:202-210, 1988

S Haley, PhD, PT, is Acting Director, Research and Training Center in Rehabilitation rehabilitation: see physical therapy.  and Childhood Trauma, Tufts University School of Medicine The Tufts University School of Medicine is one of the eight schools that comprise Tufts University. Located on the university's health sciences campus in the Chinatown district of Boston, Massachusetts, the medical school has clinical affiliations with thousands of doctors and , New England New England, name applied to the region comprising six states of the NE United States—Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, and Connecticut. The region is thought to have been so named by Capt.  Medical Center Hospitals, Boston, MA 02111 (USA).

JS Osberg, MA, is a research analyst in the Department of Rehabilitation Medicine, New England Medical Center Hospitals.

This research was supported in part by Grant No G00830042 from the National Institute on Disability and Rehabilitation Research National Institute on Disability and Rehabilitation Research (NIDRR) is a United States governmental institution that provides leadership and support for a comprehensive program of research related to the rehabilitation of individuals with disabilities. , US Department of Education, and in part by Grant No RR-00054 from the General Clinical Research Centers Program, Division of Research Resources, National Institutes of Health. Data analysis was performed at the Clinical Study Unit's computing facilities.

This article was submitted January 24, 1989; was with the authors for revision for four weeks; and was accepted May 22, 1989.
COPYRIGHT 1989 American Physical Therapy Association, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1989, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Osberg, J. Scott
Publication:Physical Therapy
Date:Nov 1, 1989
Words:2308
Previous Article:Diagnosis and classification by physical therapists: a special communication.
Next Article:Footprints. (Mary McMillan Lecture at the 64th Annual Conference of American Physical Therapy Association)
Topics:



Related Articles
Interrater and test-retest reliability of two pediatric balance tests.
Interrater reliability of videotaped observational gait-analysis assessments.
Physical therapy assessment and treatment protocol for nursing home residents.
Validity of goal attainment scaling in infants with motor delays. (includes commentary and author response)
Reliability, validity, and responsiveness of functional tests in patients with total joint replacement. (includes commentary and author response)
Making Geriatric Assessment Work: Selecting Useful Measures.
Reliability of safe maximum lifting determinations of a functional capacity evaluation. (Research Report).(Statistical Data Included)
Reliability of the PEDro scale for rating quality of randomized controlled trials. (Research Report).(Physiotherapy Evidence Database)
Hepatic plasmacytosis as a manifestation of relapse in multiple myeloma treated with thalidomide.(Case Report)
The kappa statistic in reliability studies: use, interpretation, and sample size requirements.(Perspective)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles