On "Test-retest reliability and minimal detectable change on balance ..." Steffen T, Seney M. Phys Ther. 2008;88:733-746.Translating reliability coefficients into clinically meaningful representations of measurement error is a necessary and important step when the goal is to link clinical research to clinical practice. The study by Steffen and Seney (1) investigates the reliability of several balance and ambulation am·bu·late intr.v. am·bu·lat·ed, am·bu·lat·ing, am·bu·lates To walk from place to place; move about. [Latin ambul tests and converts the obtained coefficients into minimal detectable change (MDC (1) (Mobile Daughter Card) See riser card. (2) See Meta Data Coalition. ) estimates. The authors apply Shrout and Fleiss (2) type 3,k intraclass correlation In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups. coefficients (ICC ICC See: International Chamber of Commerce ) to quantify Quantify - A performance analysis tool from Pure Software. relative reliability and, from these estimates, they calculate the standard error of measurement (SEM) to quantify measurement error in the same units as the original measurement. For some of the balance and ambulation tests, 2 trials were performed on each of 2 occasions (eg, Timed "Up & Go" Test [TUG]); for other tests (eg, Six-Minute Walk Test six-minute walk test an assessment of a dog's ability to undertake daily activities. [6MWT MWT Maintenance of Wakefulness Test MWT MicroWave Technology Inc., (Fremont, CA) MWT Movable Weight Technology (Taylor Made Golf Company, Inc. ]), a single measurement was performed on each of 2 occasions. In the former case, the authors reported a type 3,2 ICC; in the latter case, they presented a type 3,1 ICC. The authors' rationale for applying the type 3,k ICC was "The ICC(3,k) was used instead of the Pearson correlation coefficient Correlation Coefficient A measure that determines the degree to which two variable's movements are associated. The correlation coefficient is calculated as: (r) for test retest re·test tr.v. re·test·ed, re·test·ing, re·tests To test again. n. A second or repeated test. reliability because it assesses rating reliability by comparing the variability of different ratings of the same subject with the total variation across all ratings and all subjects." (1)(pp740-741) In fact, the type 3,1 ICC provides an estimate of reliability similar to the Pearson r because neither coefficient coefficient /co·ef·fi·cient/ (ko?ah-fish´int) 1. an expression of the change or effect produced by variation in certain factors, or of the ratio between two different quantities. 2. accounts for a systematic difference in scores between the replicate rep·li·cate v. 1. To duplicate, copy, reproduce, or repeat. 2. To reproduce or make an exact copy or copies of genetic material, a cell, or an organism. n. A repetition of an experiment or a procedure. measures (eg, either trials or occasions in Steffen and Seney's study). Presumably pre·sum·a·ble adj. That can be presumed or taken for granted; reasonable as a supposition: presumable causes of the disaster. , in a test-retest reliability test-retest reliability Psychology A measure of the ability of a psychologic testing instrument to yield the same result for a single Pt at 2 different test periods, which are closely spaced so that any variation detected reflects reliability of the instrument study, one is interested in both systematic and random errors, and, if this is true, the type 2,k ICC is the better choice because it includes both sources of variance in the reliability coefficient calculation. When the systematic error is zero, the type 2,k and 3,k ICCs provide identical estimates of reliability. However, when systematic error is present, as in the case of Steffen and Seney's 6MWT data, the type 2,k ICC will be less than the type 3,k ICC. My second reflection addresses the use of the Shrout and Fleiss classification system in situations where 2 or more facets exist, such as for the TUG data. Here, the facets are trials and occasions. A dilemma occurs when attempting to interpret the meaning of the type 3,2 ICC reported by Steffen and Seney. It is not clear if the second digit (2) refers to 2 trials, 2 occasions, or 2 trials performed on each of 2 occasions (ie, a total of 4 measurements). I propose that a generalizability (3) approach to the analysis has the potential to provide a clearer picture of the sources of variance, their magnitude, and the relative merits of averaging over either trials or occasions, or both. To illustrate the points raised above, I have generated synthetic data for the TUG. Paralleling the design of Steffen and Seney, the synthetic data represent 2 TUG trials performed on each of 2 occasions for 10 persons. The data presented in Table 1 were contrived con·trived adj. Obviously planned or calculated; not spontaneous or natural; labored: a novel with a contrived ending. con·triv to illustrate a systematic difference between occasions, but no systematic difference between trials. Table 2 reports the mean scores for trials and occasions. Of interest is that the trial means averaged over occasions are almost identical; however, the occasion means differ. Stated another way, a systematic difference exists between occasions, but not between trials averaged over occasions. Table 3 displays Shrout and Fleiss type 2,1 and type 3,1 ICCs obtained by performing randomized ran·dom·ize tr.v. ran·dom·ized, ran·dom·iz·ing, ran·dom·iz·es To make random in arrangement, especially in order to control the variables in an experiment. block analysis of variance (ANOVA anova see analysis of variance. ANOVA Analysis of variance, see there ). Negative variance estimates were set to zero for all analyses. Pearson r values also are reported in this table. That the inter-trial type 2,1 and 3,1 ICCs are identical to 2 decimal places decimal place n. The position of a digit to the right of a decimal point, usually identified by successive ascending ordinal numbers with the digit immediately to the right of the decimal point being first: reflects the similarity of trial means shown in Table 2. By contrast, the inter-occasion means shown in Table 2 differed, and this systematic difference is not reflected in the type 3,1 ICC or in the Pearson r. Accordingly, the type 3,1 ICC is greater than the type 2,1 ICC because the variance due to occasion is greater than zero. The following section illustrates a generalizability analysis that includes both trials and occasions in a single analysis. I applied a 3-way random effects Random effects can refer to:
1. to spread throughout the body, as when local disease becomes systemic. 2. to form a general principle; to reason inductively. beyond the persons, trials, and occasions composing com·pose v. com·posed, com·pos·ing, com·pos·es v.tr. 1. To make up the constituent parts of; constitute or form: the study sample. The ANOVA and variance components were calculated using MINITAB statistical software *, and the results appear in Table 4. Once again, negative variance estimates were set to zero. Inspection of the variance components reveals the following important findings: (1) there is a large variance among persons, and this is desirable, (2) the variance between trials averaged over occasions is zero (this reflects the near identical means reported in Table 2), (3) there is a relatively large variance due to occasions (this reflects the difference in occasion means reported in Table 2), (4) the person by occasion (P x O) variance is substantially greater Equation 1: [MATHEMATICAL EXPRESSION A group of characters or symbols representing a quantity or an operation. See arithmetic expression. NOT REPRODUCIBLE re·pro·duce v. re·pro·duced, re·pro·duc·ing, re·pro·duc·es v.tr. 1. To produce a counterpart, image, or copy of. 2. Biology To generate (offspring) by sexual or asexual means. IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ] Equation 2: [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] than the person by trial (P x T) variance (this suggests that averaging over occasion will have a greater effect than averaging over trials), and (5) the residual error (Mensuration) See Error, 6 See also: Residual is relatively small compared with the person variance. The variance components reported in Table 4 can be applied to calculate generalizability coefficients that represent inter-trial and inter-occasion reliability. They also can be used to examine the distinct effect of averaging over trials, occasions, or both. The theoretical inter-trial reliability (generalizability) for a single trial is obtained by substituting the variance components into Equation 1 and by setting n and n,, to 1. The obtained value is. (97), and this is analogous analogous /anal·o·gous/ (ah-nal´ah-gus) resembling or similar in some respects, as in function or appearance, but not in origin or development. a·nal·o·gous adj. to the Shrout and Fleiss type 2,1 inter-trial ICCs of .96 reported in Table 3. The inter-trial reliability for an average of 2 trials can be obtained by setting n to 2 and no to 1. This yields an inter-trial reliability of .98, which is analogous to a Shrout and Fleiss type 2,2 ICC. When the goal is to draw inferences about the change status of a person, as is the case when MDC is applied, the inter-occasion reliability (generalizability) coefficient is of interest. It is calculated by applying Equation 2. The theoretical inter-occasion reliability for a single trial is obtained by substituting the variance components into Equation 2 and by setting n and n to 1. This gives an inter-occasion reliability of .74, which is the average of the 2 inter-occasion reliability estimates reported in Table 3. The inter-occasion reliability for a single trial performed on each of 2 occasions is obtained by setting [n.sub.t] to 1 and [n.sub.o] to 2. This yields an inter-occasion reliability of .85. Finally, one can examine the inter-occasion reliability for the average of 2 trials on each of 2 occasions. This is accomplished by setting [n.sub.t] to 2 and [n.sub.o] to 2 in Equation 2. A value of .86 is obtained, and, to my knowledge, there is no equivalent Shrout and Fleiss coding scheme to represent this combination. * Minitab lnc, Quality Plaza, 1829 Pine Hall Rd, State College, PA 16801-3008. Paul W Stratford PW Stratford, PT, MSc, is Professor, School of Rehabilitation rehabilitation: see physical therapy. Science, McMaster University McMaster University, at Hamilton, Ont., Canada; nondenominational; founded 1887. It has faculties of humanities, science, social sciences, business, engineering, and health sciences, as well as a school of graduate studies and a divinity college. , Hamilton, Ontario, Canada. This letter was posted as a Rapid Response on June 3, 2008, at www.ptjournal.org. References (1) Steffen T, Seney M. Test-retest reliability and minimal detectable change on balance and ambulation tests, the 36-Item Short-Form Health Survey, and the Unified Parkinson Disease Parkinson Disease Definition Parkinson disease (PD) is a progressive movement disorder marked by tremors, rigidity, slow movements (bradykinesia), and posture instability. Rating Scale in people with parkinsonism. Phys Ther. 2008;88:733-746. (2) Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater rat·er n. 1. One that rates, especially one that establishes a rating. 2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. reliability. Psychol Bull. 1979;86:420--428. (3) Brennan RL. Elements of Generalizability Theory Generalizability theory (G Theory) is a statistical framework for conceptualizing, investigating, and designing reliable observations. It was originally introduced by Lee Cronbach and his colleagues. . Iowa City, Iowa Iowa City is a city in Johnson County, Iowa, United States. It is the principal city of the Iowa City, Iowa Metropolitan Statistical Area which encompasses Johnson and Washington counties. : ACT Publications; 1983. [DOI (Digital Object Identifier) A method of applying a persistent name to documents, publications and other resources on the Internet rather than using a URL, which can change over time. : 10.2522/ptj.2008.88.7.888] Author Response Using a type 2,1 intraclass correlation coefficient (ICC) rather than a type 3,1 ICC changed the ICCs for 13 of the 24 tests less than one hundredth of a point. An ICC(2,1) increased the reliability coefficients for the Berg Balance Scale and the Sharpened sharp·en tr. & intr.v. sharp·ened, sharp·en·ing, sharp·ens To make or become sharp or sharper. sharp Romberg Test with eyes open, reducing the minimal detectable change (MDC) scores by 1 point each. ICC(2,1) decreased the remaining ICCs by one hundredth of a point, which increased the MDC scores of 6 tests by 1 point; 2 showed no change, and the Six-Minute Walk Test (6MWT) increased to 86 meters. Dr Stratford was sent the gait speed data to utilize his suggested ICC(2,2) formula for tests that incorporated averaged scores. This ICC formula is not available in the SPSS A statistical package from SPSS, Inc., Chicago (www.spss.com) that runs on PCs, most mainframes and minis and is used extensively in marketing research. It provides over 50 statistical processes, including regression analysis, correlation and analysis of variance. software we utilized. The analysis did not change the ICC values or the MDCs for the gait speed tests. Our article states that gait speed is the strongest gait outcome variable in the population with parkinsonism, and Stratford's analysis supports this. We understand Stratford's suggestion on ICCs that test-retest reliability should always use ICC(2,k) formula. However, the article by Shrout and Fleiss (1) did not suggest an ICC formula for test-retest reliability, and changing the ICC formula had little effect on our study. Considering the same rater performed the same test each session, the formula for intrarater reliability ICC(3,k) was used. We appreciate Stratford's correction that arose from our report of the 6MWT being the only test to demonstrate a small learning effect. The incorrect use of the ICC formula can affect test-retest reliability when a systematic error occurs. The Table reports ICC(3,k), ICC(2,k), and minimal detectable change values using a 95% confidence interval confidence interval, n a statistical device used to determine the range within which an acceptable datum would fall. Confidence intervals are usually expressed in percentages, typically 95% or 99%. (MDC95) for all the tests. Eleven MDC95 values had no change, 6 decreased, and 7 increased utilizing ICC(2,k) rather than ICC(3,k). Teresa M Steffen and Megan Seney TM Steffen, PT, PhD, is Professor in Physical Therapy at Concordia University Wisconsin Concordia University Wisconsin is a higher education institution and an affiliate of the 10-member Concordia University System, which is operated by the second-largest Lutheran church body in the United States, the Lutheran Church - Missouri Synod (LCMS). , Mequon, WI. This letter was posted as a Rapid Response on June 3, 2008, at www.ptjournal.org. Reference (1) Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428. [DOI: 10.2522/ptj.2008.88.7.890]
Table 1.
Synthetic Timed "Up & Go" Data
Person Occasion 1 Occasion 2
Trial 1 Trial 2 Trial 1 Trial 2
Person 1 26.7 25.2 27.6 25.8
Person 2 4.6 6.9 7.6 7.1
Person 3 8.7 6.1 12.5 15.9
Person 4 18.1 19.1 26.1 28.5
Person 5 11.1 8.0 16.6 14.7
Person 6 20.7 24.0 20.4 22.6
Person 7 16.4 16.8 15.4 18.9
Person 8 4.3 6.4 16.0 14.2
Person 9 13.8 12.6 16.0 17.8
Person 10 25.7 24.8 34.5 34.6
Mean 15.0 15.0 19.3 20.0
Table 2.
Trial and Occasion Means
Order
1 2
Trial 17.1 17.4
Occasion 15.0 19.6
Table 3.
Type 2, 1 and 3, 1 Inter-trial and Inter-occasion Intraclass
Correlation Coefficients (ICC)
Occasion 1 Occasion 2
Inter-trial reliability
Type 2, 1 ICC .96 .96
Type 3, 1 ICC .96 .96
Pearson r .96 .96
Trial 1 Trial 2
Inter-occasion reliability
Type 2, 1 ICC .76 .72
Type 3, 1 ICC .86 .85
Pearson r .86 .85
Table 4.
Analysis of Variance and Variance Components
Source Sum of Squares Degrees of Mean Square
Freedom
Person (P) 2114.88 9 234.99
Trials (T) 1.30 1 1.30
Occasion (O) 215.30 1 215.30
P x T (to) 23.17 9 2.58
P x O (po) 143.23 9 15.92
T x O (to) 1.44 1 1.44
Error (e) 19.40 9 2.16
Source Variance
Components
([rho.sup.2])
Person (P) 54.66
Trials (T) 0
Occasion (O) 10.00
P x T (to) 0.21
P x O (po) 6.88
T x O (to) 0
Error (e) 2.16
Table.
Intraclass Correlation Coefficients (ICC) for Test-Retest Reliability
and Minimal Detectable Change Scores Utilizing a 95% Confidence
Interval (MDC95) for Functional Tests, a Quality-of-Life Measure, and
Disease Severity Rating Scale in People With Parkinsonism (a)
Test Performed ICC(3,k) [MDC.sub.95]
Balance tests
Berg Balance Scale (b) .94 5
(0-56 points)
Activities-specific Balance .94 13
Confidence Scale (b) (%)
Functional Reach Test (c) (cm)
Forward .73 9
Backward .67 7
Romberg Test (b) (s)
Eyes open .86 10
Eyes closed .84 19
Sharpened Romberg Test (b) (s)
Eyes open .70 39
Eyes closed .91 19
Mobility tests
Six-Minute Walk Test (b) (m) .96 82
Timed "Up & Go" Test (c) (s) .85 11
Gait speed (c) (m/s)
Comfortable .96 .18
Fast .97 .25
SF-36 (b) (0-100 points)
Physical Functioning .80 28
Role-Physical .85 45
Bodily Pain .89 25
General Health .85 28
Vitality .88 19
Social Functioning .71 29
Role-Emotional .84 45
Mental Health .83 19
UPDRS (b) (points)
Mentation, Behavior, and Mood (0-16) .89 2
Activities of Daily Living (0-52) .93 4
Motor Examination (0-108) .89 11
Total Score (0-176) .91 13
Test Performed ICC(2,k) [MDC.sub.95]
Balance tests
Berg Balance Scale (b) .95 4
(0-56 points)
Activities-specific Balance .94 13
Confidence Scale (b) (%)
Functional Reach Test (c) (cm)
Forward .72 9
Backward .67 7
Romberg Test (b) (s)
Eyes open .86 10
Eyes closed .85 19
Sharpened Romberg Test (b) (s)
Eyes open .71 38
Eyes closed .90 19
Mobility tests
Six-Minute Walk Test (b) (m) .95 86
Timed "Up & Go" Test (c) (s) .85 11
Gait speed (c) (m/s)
Comfortable .96 .18
Fast .97 .25
SF-36 (b) (0-100 points)
Physical Functioning .80 29
Role-Physical .85 44
Bodily Pain .89 24
General Health .84 29
Vitality .87 20
Social Functioning .70 30
Role-Emotional .83 46
Mental Health .83 18
UPDRS (b) (points)
Mentation, Behavior, and Mood (0-16) .89 2
Activities of Daily Living (0-52) .93 4
Motor Examination (0-108) .89 10
Total Score (0-176) .90 14
(a) SF-36=36-Item Short Form Health Survey, UPDRS=Unified Parkinson
Disease Rating Scale.
(b) ICC: 3,1 and 2,1.
(c) ICC: 3,2 and 2,2.
|
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion