Printer Friendly
The Free Library
5,677,005 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

The application of generalizability theory to reliability assessment: an illustration using isometric force measurements.


Key Words: Analysis of variance, Dynamometry dy·na·mom·e·ter  
n.
Any of several instruments used to measure mechanical power.



[French dynamomètre : Greek dunamis, power; see dynamic + -mètre, -meter.
, Generalizability, Reliability, Standard error of measurement

Use of Measuring Devices This is an incomplete list of measuring devices.

word Measures
accelerometer acceleration
actinometer heating power of sunlight
alcoholometer alcoholic strength of liquids
altimeter altitude
ammeter electric current, amperage
 

In the last decade in the fields of physical therapy and rehabilitation medicine rehabilitation medicine Physiatry, physiotherapy A field of therapeutics that bridges the gap between conventional and nonconventional medicine; rehabilitation physicians may adminsiter or prescribe mechanical–eg, massage, manipulation, exercise, movement, , the use of measuring devices for the assessment of functions such as joint mobility and muscle performance has increased. Various types of measuring devices have been commercially introduced. This has stimulated a growing interest in the subject of reliability of these types of measurements and the statistical techniques involved.[1-4] Studies on this subject show a number of interpretations of the concept of reliability and of the statistics used (standard deviation In statistics, the average amount a number varies from the average number in a series of numbers.

(statistics) standard deviation - (SD) A measure of the range of values in a set of numbers.
,[5] coefficient of variation Coefficient of Variation

A measure of investment risk that defines risk as the standard deviation per unit of expected return.
,[6-9] Pearson Product-Moment Correlation Coefficient Noun 1. Pearson product-moment correlation coefficient - the most commonly used method of computing a correlation coefficient between variables that are linearly related
product-moment correlation coefficient
,[1,2,10] intraclass correlation In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups.  coefficient [ICC ICC

See: International Chamber of Commerce
],[3,11-13] generalizability coefficient,[14,15] and standard error of measurement [SEM][4,16]). When a physical therapist wants to use a measurement, selecting information on reliability and reliability indexes that is appropriate for his or her clinical context is a problem. This article describes a practical method for detecting potential sources of measurement error and for estimating reliability in relation to the clinical use of a measuring device.

Reliability in Clinical Practice

A prerequisite for a measuring device is that measurements obtained with it are reliable. Reliability refers to the relative absence of measurement errors and was defined for the purpose of this study as consistency of measurement results.[16-18] In this article, this concept of reliability is interpreted for applications of measuring devices in clinical practice. An important feature of clinical measurements is that they are performed and interpreted on individual patients, not on groups. For example, for measurements of muscle force production on patients, the following clinical issues arise:

1. During the course of therapy, a physical therapist evaluates a patient's progress by force measurements. Actual and previous measurements by the same therapist on the same patient are compared. What should the magnitude of change be to conclude that the patient's muscle force production has improved?

2. During the course of therapy, a patient is referred from one physical therapist to another, for example, when a patient is discharged from a hospital and follow-up is provided in a private practice. Measurements by the second therapist are compared with those obtained by the first therapist to assess the patient's progress. What should the magnitude of change in measured muscle force be to conclude that the patient's muscle force production has improved?

Three important points emerge from these examples. First, the physical therapist focuses on the change in an individual patient during treatment and is not necessarily interested in differentiating among patients with respect to muscle force production. The therapist, therefore, is concerned about the reliability of individual measurement results. The index of reliability used must allow interpretation on the individual level.[4,16,17,19] Second, measurement error can arise from multiple sources,[20-22] such as variation among measurement sessions or among therapists. By assessing the relative magnitudes of various components of error, important sources of measurement error can be detected and actions can be taken to control them. Third, reliability has to be assessed for each specific application of a measurement. From this point of view, reliability is not an absolute quality of a measurement, but is dependent on the way a measurement will be interpreted.[13,16,23]

Index of Reliability

In the majority of studies on measurements in physical therapy, reliability is assessed by a coefficient, such as a Pearson Product-Moment Correlation Coefficient or an ICC.[1-3,10-12] It is accepted in the literature that a Pearson correlation coefficient Correlation Coefficient

A measure that determines the degree to which two variable's movements are associated.

The correlation coefficient is calculated as:
 is not appropriate for assessing reliability. An ICC is more appropriate, because in this index systematic variability is also treated as error.[3,13-16,24] Intraclass correlation coefficient is defined as a ratio of variance of interest over total variance (composed of variance of interest and error variance).[12,13,25,26] It is important to note that in studies on reliability of measurements in physical therapy using the ICC, variance among patients is often considered as the variance of interest.[3,11,12] For assessing reliability with respect to changes in individual patients, however, the magnitude of the between-patient variance is not

Instead of focusing on the ratio of variances (ICC), for measurements on individuals it is more informative to calculate one component of the ratio: the error variance. in this way, each measurement result can be considered with an error term.[4,16,17,19] The amount of measurement error can also be expressed as the SEM, which is derived by taking the square root of the error variance.[20,28,29] Similar to the ICC, the error variance and SEM include both random and systematic components of measurement error.[30] An advantage of the SEM is that it is expressed in the metric unit Noun 1. metric unit - a decimal unit of measurement of the metric system (based on meters and kilograms and seconds); "convert all the measurements to metric units"; "it is easier to work in metric"
metric
 of the measurements.

Assuming that measurement errors are distributed normally, confidence intervals confidence interval,
n a statistical device used to determine the range within which an acceptable datum would fall. Confidence intervals are usually expressed in percentages, typically 95% or 99%.
 can be calculated based on the SEM.[20,28,31] A confidence interval expresses the expected distribution of error around a measurement

Furthermore, confidence intervals for the difference between measurement results are calculated. This type of confidence interval indicates the smallest detectable difference between measurement results that represents a real (non-error) change in performance.[28,32,33] This index permits therapists to interpret differences among measurement results meaningfully.[16,21,32]

Method for Assessing Reliabillity

To calculate the error variance and the SEM, different methods are available. In classical test theory, the measurement error is estimated as a single component of the measurement Whether this component refers to intratester or intertester reliability depends on the design of the reliability study. Other aspects of the measurement procedure that influence the measurement error, however, are not assessed in classical test theory. This theory does not allow partition of the measurement error into different sources of error. In addition, the measurement error cannot be generalized to studies with slightly different measurement designs. A broader and more flexible notion of reliability underlies another approach known as generalizability theory Generalizability theory (G Theory) is a statistical framework for conceptualizing, investigating, and designing reliable observations. It was originally introduced by Lee Cronbach and his colleagues. . Following this approach, which is based on analysis of variance (ANOVA anova

see analysis of variance.

ANOVA Analysis of variance, see there
), multiple sources of measurement error are recognized and estimated.[22] Using the generalizability approach, reliability can be assessed and tailored to the proposed applications of a measuring device.[20-23,34,35]

Because generalizability theory is unfamiliar to many practitioners in physical therapy, some features of this approach are discussed. The method is illustrated by presenting an example on isometric isometric /iso·met·ric/ (-met´rik) maintaining, or pertaining to, the same measure of length; of equal dimensions.

i·so·met·ric
adj.
1.
 knee extension force measurements with a hand-held dynamometer dynamometer /dy·na·mom·e·ter/ (di?nah-mom´e-ter) an instrument for measuring the force of muscular contraction.

dy·na·mom·e·ter
n.
An instrument for measuring the degree of muscular power.
.

Features of Generalizability Theory

Generalizability theory distinguishes between generalizability (G) and decision (D) studies. A G study refers to the research phase, in which a measuring device or procedure is tested. in this phase, potential sources of measurement error are identified as factors and their interactions. The levels of the factors are called conditions. For example, if two therapists obtain force measurements, therapist (t) is a factor with two conditions: therapist A and therapist B. The magnitude of variance components corresponding to the sources of measurement error is estimated by an ANOVA. A D study applies to the subsequent practical use of measurements. In this type of study, measurements are taken for the purpose of making a decision, for example, when a physical therapist has to determine whether a subject's muscle force has improved. For a particular D study, the measurement error is assessed from variance components estimated in a prior G study. The G-study design should therefore be compatible with the intended D-study design and interpretation.[2-23]

G Study

Variance Components. By means of generalizability theory, multiple sources of error are estimated, such as occasion (o), therapist (t), and repetition (r). For that purpose, generalizability theory uses the factorial factorial

For any whole number, the product of all the counting numbers up to and including itself. It is indicated with an exclamation point: 4! (read “four factorial”) is 1 × 2 × 3 × 4 = 24.
 ANOVA model. In a G study, measurements are taken under a specific set of measurement conditions. Interest is focused on the generalization gen·er·al·i·za·tion
n.
1. The act or an instance of generalizing.

2. A principle, a statement, or an idea having general application.
 of results of a particular G study to measurement results taken under different conditions (other occasions, other therapists, other repetitions). Therefore, for each factor and interaction, associated variance components denoted as [[sigma].sup.2]) are estimated, using a random-effects ANOVA. For corresponding statistical terms and formulas, one is referred to the literature.[20,21,36-38]

An example of an estimated variance component is & [[sigma].sup.2] (t), the error attributable to the therapist performing a measurement. This variance component can be interpreted as the amount of variation among therapists in their measurement results. Comparably, the interaction component [[sigma].sup.2](st) implies the amount of variation between combinations of subjects (s) and therapists. This component reflects the fact that not all subjects perform the best for a specific therapist.

Apart from the error variance, which is related to the factors defined in the design, some residual error (Mensuration) See Error, 6 (b).

See also: Residual
 (e) will be present. This error term reflects, in part, nonsystematic or random error sources that are mostly unknown.[36] Residual error may arise, for example, from inaccuracy in·ac·cu·ra·cy  
n. pl. in·ac·cu·ra·cies
1. The quality or condition of being inaccurate.

2. An instance of being inaccurate; an error.
 of the measuring device according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 the technical specifications or from some disturbance of the measurements, as might occur when somebody enters an examination room in the middle of a test. Residual error also includes systematic influences from factors not explicitly included or controlled for in the G study.[36] For example, motivation may differ from one subject to the other. The residual error term cannot be calculated separately, but is completely confounded within the last interaction component [[sigma].sup.2] (sotr).[2-22]

In a G study, estimated variance components are themselves subject to sampling variability. This variability is estimated as standard errors for estimated variance components. The magnitude of these standard errors depends on the standard deviation and the number of measurement conditions used for a factor. in general, the smaller the number of measurement conditions, the larger the standard error for estimated variance components.[19-21]

Another effect of sampling error is that negative estimates of variance components may be derived by the equation method used. Because negative variances are conceptually impossible, several authors[20,21,36] discuss alternative equation methods to solve this problem. One approach, according to Cronbach et al,[20] is to set negative estimates to zero.

D Study

Assessing measurement error. Before conducting a decision study (D study), the magnitude of measurement error that a decision-maker (eg, a physical therapist) must take into account can be assessed from the variance components estimated in a G study. Therefore, those G-study components are selected that contribute to the total error variance in a planned D study. Whether specific components are regarded as contributing to the error variance depends on the way in which D-study measurement results will be interpreted. In this respect, two issues are important:

1. is the type of decision made an absolute or a comparative one? An absolute decision occurs when the measurement results of a subject, independent of the performance of other subjects, are being considered.[23,39] A comparative decision, in contrast, focuses on the relative ordering of a number of individuals.[14,22] In applications of muscle force measurements at the individual level, the type of decision is absolute. Measurements interpreted in this way are called domain-referenced measurements.[20,21] In absolute decisions, error is caused by the main effects of measurement factors as well as by their interaction effects. Therefore, all variance components, excluding the variance of interest, contribute to the error variance denoted by [[sigma].sup.2]([delta]).[20,21]

2. Is the decision-maker interested in generalizing to other measurement conditions or only to those that appear in the D study? With regard to this consideration, random and fixed measurement factors are distinguished.[14,21,23,38] Both types of factors apply to the example of isometric force measurements given in this article.

A factor is random if the measurement conditions in a D study are envisaged as a random sample from all possible conditions of interest to the therapist.[37,38] The therapist intends to generalize generalize /gen·er·al·ize/ (-iz)
1. to spread throughout the body, as when local disease becomes systemic.

2. to form a general principle; to reason inductively.
 to any other measurement condition than those included in the D study. The factor "therapist," for example, is random if a therapist wants to compare his or her measurements on a subject with measurements that have been obtained for the same subject by another therapist. In this comparison, the variance components attributable to the therapist &21t]) and his or her interaction with the subject ([[sigma].sup.2][st]) contribute to the error variance.

A factor is fixed if only the conditions of the factor in the D study are of interest for a decision. The therapist is not interested in generalizing beyond the conditions that appear in the D study.[23,38] The associated variance component and its interaction with the subject do not contribute to the total error variance.21 For example, if a subject has been measured repeatedly by the same therapist, a comparison of these measurement results is not affected by variance among therapists ([[sigma].sup.2][t]), nor by subject-therapist interaction ([[sigma].sup.2][st]).

Reducing measurement error. In a clinical application of a measurement (D study), minimalization of measurement error is desirable. Steps can be taken to decrease the magnitude of variance components by improving the standardization standardization

In industry, the development and application of standards that make it possible to manufacture a large volume of interchangeable parts. Standardization may focus on engineering standards, such as properties of materials, fits and tolerances, and drafting
 of measurement protocols or the instruction to therapists or subjects. Another method is to reduce the contribution of a variance component to the total error variance by increasing the number of measurement conditions (n) and then using the mean score over n conditions as the subject's measurement result. Compared with the error variance of a single score, the error variance of a mean score is reduced by a factor n.[36] In a multiple-factor design, this method can be applied to one or several measurement factors. The corresponding variance components and interaction components are divided by the number of conditions over which the mean is calculated.[20,21,25] For example, if a subject's measurement result is the mean over three repetitions of a strength measurement, the variance component related to repetition ([[sigma].sup.2[r]) and all interaction components that include the factor repetition ([[sigma].sup.2][sr], [[sigma].sup.2][orl, [[sigma].sup.2][tr], [[sigma].sup.2][sor], and so on) are divided by 3.

Illustration of G and D Studies

In this section, a pilot study on measuring isometric knee extension force with a hand-held dynamometer is presented to illustrate an application of generalizability theory.

Illustration of G Study

Design. Maximal max·i·mal
adj.
1. Of, relating to, or consisting of a maximum.

2. Being the greatest or highest possible.
 isometric knee extension force at a knee angle of 25 degrees of flexion flexion /flex·ion/ (flek´shun) the act of bending or the condition of being bent.

flex·ion
n.
1. The act of bending a joint or limb in the body by the action of flexors.

2.
 was tested on a sample of healthy women using a hand-held dynamometer. At separate testing sessions 1 hour apart, two physical therapists performed the measurements. On a second occasion I week later, a time period in which no change in muscular force was expected, there were two more test sessions. On the second occasion, the order of testing by the therapists was reversed. After careful explanation to the subjects and one or two test contractions, three repetitions were measured during each session. This study was conducted as a G study. The design is a completely random-effects design, in which the object of measurement is the subject ([n.sub.s] = 10) and the measurement factors are occasion ([n.sub.o] = 2), therapist ([n.sub.t] = 2), and repetition ([n.sub.r] = 3). Each subject is measured under all measurement conditions, denoted as a crossed four-way sXoXtXr design.

Subjects. Ten healthy female subjects with no history of knee disorders participated in the study. All subjects gave their informed consent to participate in the study. The mean age of the subjects was 29.5 years (SD = 7.1, range = 23-47). Their mean body weight was 66.0 kg (SD = 5.6, range = 59-76), and their mean body length was 1.73 m (SD=0.06, range = 1.64-1.84).

Measuring device. A functional prototype of a hand-held dynamometer was used. This device was composed of a force transducer transducer, device that accepts an input of energy in one form and produces an output of energy in some other form, with a known, fixed relationship between the input and output. , an electrogoniometer, and a computer. The force transducer was a commercially available device,(*) modified with an analogue electrical output of the force signal (accuracy: 0.5 kgf. The goniometer goniometer /go·ni·om·e·ter/ (go?ne-om´e-ter)
1. an instrument for measuring angles.

2. a plank that can be tilted at one end to any height, used in testing for labyrinthine disease.
 was a potentiogoniometer developed in our laboratory (accuracy: 1[degree]). Both devices were interfaced to an A/D converter (Analog/Digital converter) A device that converts continuously varying analog signals from instruments and sensors that monitor conditions, such as sound, movement and temperature into binary code for the computer. [dagger] with a sampling rate of 100 Hz. Data were processed on an MS-DOS MS-DOS
 in full Microsoft Disk Operating System

Operating system for personal computers. MS-DOS was based on DOS, developed in 1980 by Seattle Computer Products. Microsoft Corp. bought the rights to DOS in 1981, and released MS-DOS with IBM's PC that year.
 Personal Computer,(double dagger double dagger
n.
A reference mark () used in printing and writing. Also called diesis.

Noun 1.
] using an application of the Kinesiologic Measure and Analysis System (KIMAS KIMAS Knowledge Intensive Multi Agent Systems ). KIMAS is a computer program written in ASYST[section] and developed at the Kinesiologic Laboratory, Department of Rehabilitation rehabilitation: see physical therapy. , Free University Hospital (Amsterdam, the Netherlands).

Standardized standardized

pertaining to data that have been submitted to standardization procedures.


standardized morbidity rate
see morbidity rate.

standardized mortality rate
see mortality rate.
 posture and test protocol For each subject, body weight and body length were determined. Marks were placed on the skin to correspond with the hip joint (greater trochanter greater trochanter
n.
A strong process overhanging the root of the neck of the femur, giving attachment to the gluteus medius and minimus muscles, the piriform muscle, the internal and external obturator muscles, and the gemelli muscles.
), the knee joint (joint line on the lateral side), and the ankle joint ankle joint
n.
A hinge joint formed by the articulating of the tibia and the fibula with the talus below. Also called mortise joint, talocrural joint.
 (distal distal /dis·tal/ (-t'l) remote; farther from any point of reference.

dis·tal
adj.
1. Anatomically located far from a point of reference, such as an origin or a point of attachment.
 end of the lateral malleolus The lower extremity (distal extremity; external malleolus) of the fibula is of a pyramidal form, and somewhat flattened from side to side; it descends to a lower level than the medial malleolus. ). These marks were used to calibrate To adjust or bring into balance. Scanners, CRTs and similar peripherals may require periodic adjustment. Unlike digital devices, the electronic components within these analog devices may change from their original specification. See color calibration and tweak.  the electrogoniometer at 25 degrees of knee flexion, using a large standard goniometer. Leg length was measured as the distance between the marks on the knee and ankle joints. Total leg length was measured from the mark on the knee joint to the sole of the foot.

Maximal isometric knee extension tests were performed at a knee angle of 25 degrees of flexion. The right side of all subjects was measured. The subjects assumed a sitting position, with back and thighs supported and stabilized. The hip angle was 80 degrees of flexion. The starting knee angle (25[degrees] of flexion) was adjusted by numerical feedback from the PC monitor. The force device was positioned, perpendicular to the leg, on a mark at 80% of the leg length distally dis·tal  
adj.
1. Anatomically located far from a point of reference, such as an origin or a point of attachment.

2. Situated farthest from the middle and front of the jaw, as a tooth or tooth surface.
 from the knee.

The tests were performed as "make" tests, in which the dynamometer was held stationary by the therapist while the subject exerted a maximal force against it.[1,2,40] The test protocol was standardized according to the procedure proposed by Caldwell et al.[41] After a build-up build·up also build-up  
n.
1. The act or process of amassing or increasing: a military buildup; a buildup of tension during the strike.

2.
 phase of 2 seconds, the subject was required to maintain a steady maximal exertion exertion,
n vigorous action, a great effort, a strong influence.
 for 3 seconds.42 During testing, this process was controlled by the therapist by counting, with the aid of a metronome metronome (mĕ`trənōm'), in music, originally pyramid-shaped clockwork mechanism to indicate the exact tempo in which a work is to be performed. It has a double pendulum whose pace can be altered by sliding the upper weight up or down.  (69 beats per minute beats per minute Cardiac pacing The unit of measure for the frequency of heart depolarizations or contractions each minute–or pulse rate ). From the maintained maximal force level, the mean maximal force (F) was determined over 1 second with least force variation. Figure 1 shows schematically sche·mat·ic  
adj.
Of, relating to, or in the form of a scheme or diagram.

n.
A structural or procedural diagram, especially of an electrical or mechanical system.
 the required force exertion over time.

A correctly performed "make" test requires sufficient stabilization force by the therapist. Measurements of the relatively powerful knee extension muscles of healthy women may be limited by this requirement.[3,9,43,44] To accommodate for this problem, an extra weight of 3.94 kg was attached under the subject's foot.

Each subject performed three maximum contractions, which were separated by a 30-second rest interval. At the end of the session, the subject left the testing position, and all measuring devices and marks were removed.

Calculation of net muscular moment about the knee. For each measurement repetition, the net muscular knee moment ([M.sub.K]) was calculated from the mean maximal force measured (F) and the distance between the force device and the knee joint (a). This net moment was corrected for gravity on the leg and foot ([W.sub.L&F]), as well as for gravity on the extra weight ([W.sub.EW]), using data from Dempster[45] and previously published methods.[46-48] A schematic A graphical representation of a system. It often refers to electronic circuits on a printed circuit board or in an integrated circuit (chip). See logic gate and HDL.  representation of the test situation with external forces, lever arms, and net muscular knee moment, as well as the equations used to calculate the net muscular moment about the knee, is given in Figure 2.

Data processing data processing or information processing, operations (e.g., handling, merging, sorting, and computing) performed upon data in accordance with strictly defined procedures, such as recording and summarizing the financial transactions of a  and analysis. Data analysis was conducted on the corrected net muscular moments of 10 subjects, with 12 measurements for each subject. After checking for ANOVA assumptions,[37,38] an ANOVA for a random-effects design was carried out. Unbiased estimates of the variance components were obtained from the mean squares.[20,21,23] Analysis was performed using a PC version of the GENOVA program,[parallel] which was especially developed for generalizability analysis by Crick Crick , Francis Henry Compton 1916-2004.

British biologist who with James D. Watson proposed a spiral model, the double helix, for the molecular structure of DNA. He shared a 1962 Nobel Prize for advances in the study of genetics.
 and Brennan.[49] Variance components were calculated for subject (s), occasion (o), therapist (t), and repetition (r); the two-way interaction for subject and occasion (so), subject and therapist (st), subject and repetition (sr), occasion and therapist (ot), occasion and repetition (or), and therapist and repetition (tr); the three-way interaction for subject, occasion, and therapist (sot), for subject, occasion, and repetition (sor), for subject, therapist, and repetition (str), and for occasion, therapist, and repetition (otr); and the four-way interaction for subject, occasion, therapist, and repetition, confounded with the residual random error e (sotr,e).

According to the approach of Cronbach et al[20] and Brennan,[21] negative estimates of variance components, which were relatively small, were set to zero. For each estimate of a variance component, the standard error was calculated, using the number of measurement conditions in this G study ([n.sub.o], [n.sub.t], [n.sub.r]).[20,21]

From the magnitude of the estimated variance components, important sources of measurement error were determined. In this example, the factor of subject (s) was not a measurement condition but the object of measurement. Therefore, this factor was not a source of measurement error.

Results. in Table 1, mean values and standard deviations of the moments about the knee over three repetitions are listed for each subject and occasion-therapist combination. The magnitude of the moments measured in this G study varied from 57.2 to 126.6 N[multiplied by]m. In Table 2, the results of the ANOVA are given, as well as the estimates of variance components for all factors and interactions. Table 2 indicates that the variance component for therapist (t) and repetition (r) was negligible. This implies that the mean muscular moment was not different from one therapist to another or from one repetition to another. The rather large subject-therapist component shows that some subjects produced the largest moments when they were measured by therapist 1, whereas others performed better with therapist 2.

[TABULAR tab·u·lar
adj.
1. Having a plane surface; flat.

2. Organized as a table or list.

3. Calculated by means of a table.



tabular

resembling a table.
 DATA 1 & 2 OMITTED]

The main effect for occasions (o) was very small. A summation summation n. the final argument of an attorney at the close of a trial in which he/she attempts to convince the judge and/or jury of the virtues of the client's case. (See: closing argument)  of the two-way subject-occasion interaction effect so) and the three-way interaction effects including the factor of occasion (sot, sor, otr) yielded a variance of 68.06. This relatively large summated variance implies that some subjects achieved larger moments during the first occasion, whereas others achieved larger moments the second time. Furthermore, the therapist and repetition on which these larger moments were achieved changed from one occasion to another. The four-way interaction term was confounded with residual error and therefore is not interpretable. The value of this component was relatively large (35.14).

In conclusion, neither a general learning effect over occasions nor a marked repetition or therapist effect was found. Therefore, in this study, the main effects of occasion, therapist, and repetition were unimportant un·im·por·tant  
adj.
Not important; petty.



unim·portance n.
 sources of measurement error. The measurement variability, however, was scattered Scattered

Used for listed equity securities. Unconcentrated buy or sell interest.
 over some two-and three-way interaction components and the residual component, indicating that these interaction effects were important sources of measurement error. Standard errors of the estimates of variance components are given in the last column of Table 2. Standard errors associated studies, some hypothetical applications (D studies) of the dynamometer were designed. The subjects in this study were healthy individuals and therefore the results cannot be generalized to any applications on patients. The absolute error variance was calculated for two types of hypothetical D studies, based on the estimated variance components from the G study. The first type refers to applications in which a subject will be measured by one physical therapist only (example A). The second type refers to applications in which the physical therapist will change over measurement occasions (example B). In example A, one is interested in comparing a therapist's measurements of a subject over occasions and repetitions. Measurement results will not be compared with those of any other therapist. In this example, mixed designs are appropriate, with the factor of therapist fixed and the factors of occasion and repetition random. in example B, measurement results of different therapists will be compared. A therapist is seen as a random sample among all physical therapists who could apply strength measurements. The factor of therapist, therefore, is random. Combined with the random factors of occasion and repetition, this leads to random-effects designs with three random measurement factors.

To study the influence of error reduction by increasing the number of measurement conditions and using mean scores as a subject's measurement result, for both examples the measurement error was assessed for several D-study designs. In the first design, a single score is used as measurement result. Four other D-study designs were created by varying the number of measurement conditions: 1, 3, or 5 repetitions; 1 or 2 occasions; and 1 or 2 therapists. in these designs, measurement results were mean scores.

Calculation of reliability indexes. For the mixed designs, the error variance was computed as the sum of all variance components that included at least one random factor (occasion, repetition). For the random-effects designs, the error variance is the sum of all variance components, excluding the variance between subjects.35 For each design, the SEM, denoted as (A), was calculated as the square root of the absolute error variance. The corresponding 95% confidence interval for a measurement result was calculated as [+ or -]1.96 X [sigma] ([delta]). With respect to the difference between two independently obtained measurement results of a subject, the 95% interval is [+ or -]1.96 X [square root] 2 X [sigma]([delta]). That is, a change is statistically significant at the .05 level when the difference between two measurement results is larger than [+ or -]1.96 X [square root] 2 X [sigma]([delta]),[28,32] indicated as the smallest detectable difference.

Results. The SEMS SEMS Standardized Emergency Management System
SEMS Screw and Washer Assemblies
SEMS Student Emergency Medical Services (various universities)
SEMS Support Enforcement Management System
 that must be considered in the hypothetical D studies are given in Table 3. Values are given for measurements by one fixed therapist (mixed designs) and for applications with a change of therapists between measurements random-effects designs). The SEM that must be taken into account varies between different measurement designs. For measurement results based on single scores by one therapist at one occasion and one repetition (design 1), the SEM for mixed designs is 10.5 N[multiplied by]m and the SEM for random-effects designs is 11.8 N[multiplied by]m. By taking mean scores over 3 or 5 repetitions by one therapist at one occasion as the measurement result (designs 2 and 3), the SEM is decreased to 8.8 and 8.4 N[multiplied by]m for mixed designs and to 10.3 and 10.0 N[multiplied by]m for random-effects designs. For more complex designs, the SEM is smaller. For example, for mean scores over two therapists and three repetitions (design 4), the SEM is 7.6 N[multiplied by]m for mixed designs and 8.5 N[multiplied by]m for random-effects designs. For those clinical settings in which only one physical therapist is available, but a patient can be measured during two occasions between which no treatment effect is expected, the SEM for the mean score over three repetitions (design 5) is 6.2 N[multiplied by]m for mixed designs and 8.2 N[multiplied by]m for random-effects designs. To facilitate the interpretation of measurement results in clinical practice, the 95% confidence interval of a measurement result and the smallest detectable difference for each design are also given in Table 3. The smallest detectable differences show that in hypothetical applications of the measurement on healthy female subjects, only changes larger than 17.2 to 29.0 N[multiplied by]m (mixed designs) and 22.8 to 32.6 N[multiplied by]m (random-efects designs) can be interpreted as real changes in muscle force production.

[TABULAR DATA 3 OMITTED]

Discussion

Evaluation of the Method Presented

In this article, the concept of reliability has been approached by a method that was developed as an extension of classical test theory, using elements from an ANOVA. Generalizability theory has already been applied successfully in psychological and educational test research.[18,20,23,25,35,39] What is the importance of generalizability theory with respect to measurements in physical therapy, for researchers and for practitioners?

Researchers

For a researcher, this approach provides a practical tool for assessing reliability. Important sources of measurement error can be determined and accounted for. For example, from the pilot study it is concluded that the subject-therapist combination is an important source of measurement error, as well as the interactions including the factor of occasion. The first error source can possibly be reduced by better standardization of the measurement protocols with respect to the instruction of patients and how test performance is conducted by therapists. A possible approach for diminishing the various occasion interaction effects is to introduce an extra test session for subjects to become accustomed to the measurements and the measurement situation. When, in spite of such measures or standardization, the reduction of measurement error is still insufficient, further reduction can be realized by adjusting the measurement design. In this way, the residual measurement error can also be reduced. For example, it can be effective to increase the number of repetitions, therapists, or measurement occasions and to take the mean score as measurement result, as is shown in the illustrative il·lus·tra·tive  
adj.
Acting or serving as an illustration.



il·lustra·tive·ly adv.

Adj. 1.
 example in this article. A researcher must realize, however, that in clinical practice, it is time consuming to add an extra test session or to increase the number of measurement conditions considerably. Repeated force measurements over three repetitions seem reasonable, but measurements at more than two occasions or by more than two therapists seem not very practical. For example, subjects may change their ability to produce force.

Furthermore, reliability can be assessed for specific applications of the measurements. From the values calculated in the example, it appears that the appropriate error of measurement can vary considerably for different applications, implying different D-study designs. Researchers who apply generalizability theory must therefore report the magnitudes of the variance components as the results of a reliability study. This offers a physical therapist the opportunity to calculate the error of measurement appropriate to his or her clinical situation.

Practitioners

For a physical therapist who wants to evaluate a patient's progress, the indexes of reliability presented in this study can be informative, especially the smallest detectable difference. From the smallest detectable difference, a therapist knows what differences need to be measured in order to conclude that real change has occurred rather than measurement error. For clinical application of measurements, the smallest detectable difference must be small enough to detect clinically important changes during the course of therapy. it can be questioned whether the rather large values found for the smallest detectable difference in the illustrative example (Tab. 3) meet these requirements satisfactorily. This issue will not be discussed in detail, however, because the hand-held dynamometer used was a prototype. Furthermore, the measurements were taken on healthy subjects and cannot be generalized to patients.

Compared with ICCs

Both the approach presented in this report and the ICCs are based on generalizability theory. They differ, however, with respect to the index of reliability used, that is, the SEM and related indexes versus correlation coefficients. Another difference is that the types of ICCs cited in the literature are focused on one-factor and two-factor designs,[3,11-13,50] whereas the method described in this report is general. Furthermore, the faulty interpretation of variance between patients as variance of interest in many ICC applications has already been pointed out in the introduction.

Problem: Unstable Estimates

In the literature,[32,36] problems in estimating variance components were characterized as the "Achilles heel Achilles heel
Noun

a small but fatal weakness [Achilles in Greek mythology was killed by an arrow in his unprotected heel]

Achilles heel ntalón m de Aquiles 
" of generalizability theory. It is especially with small sample sizes that estimates of variance components are unstable and may even be negative.36 With respect to the last point, the calculation procedure used in the our study gives unbiased estimates, whereas negative estimates are set to zero.[20,21,36] A problem is caused by the variability of estimated variance components, which is expressed as standard errors of the estimates. From the large values depicted de·pict  
tr.v. de·pict·ed, de·pict·ing, de·picts
1. To represent in a picture or sculpture.

2. To represent in words; describe. See Synonyms at represent.
 in Table 2, it can be concluded that the variance components estimated in the pilot study are quite unstable. This instability is caused by the very small sample sizes in the G study, for the subjects measured ([n.sub.s]=10) as well as for the measurement conditions ([n.sub.o]=2, [n.sub.t]=2, [n.sub.r]=3). To attain more stable estimates of variance components and thus of the total error variance, larger samples are needed in a reliability study. This can possibly be achieved by integrating results across several studies.[35,51]

Future Studies

In future studies, we will use the generalizability approach described in this article to assess the reliability of force measurements with an adapted prototype of the hand-held dynamometer. The measurements will be conducted in different clinical settings on patients with impaired knee extension or knee flexion force production.

Conclusions

With respect to force measurements from individual patients, the SEM and indexes derived from the SEM are practical clinical measures for expressing reliability. Generalizability theory is a powerful tool for estimating the magnitude of multiple sources of measurement error and assessing the reliability of measurements for specific applications of the measurements. Prerequisites for a reliability study (G study) are that sources of measurement error are included as factors in the measurement design and that the number of conditions for each factor is large enough to attain stable estimates of variance components.

Acknowledgments

We thank Caroline Doorenbosch and Tanneke Vogelaar for carrying out the measurements and the 10 women for participating as subjects in this study. We thank Renny Wiegerink for her contribution to the experimental setup and data analysis and Dick Bezemer, Paul Diegenbach, Rients Rozendal, Chris Rumke, Denhard de Smit, and Cees van der Vleuten for their comments on the manuscript.

References

[1] Bohannon RW. Test-retest reliability test-retest reliability Psychology A measure of the ability of a psychologic testing instrument to yield the same result for a single Pt at 2 different test periods, which are closely spaced so that any variation detected reflects reliability of the instrument  of hand-held dynamometry during a single session of strength assessment. Phys Ther. 1986; 66:206-209. [2] Bohannon RW, Andrews AW. Interrater reliability of hand-held dynamometry. Phys Ther 1987;67:931-933. [3] Riddle riddle, puzzling question, specifically one that consists of a fanciful description or definition of something to be guessed. A famous riddle was asked by the Sphinx: "What goes on four legs in the morning, on two at noon, on three at night?" Oedipus guessed the  DL, Finucane SD, Rothstein JM, Walker ML. Intrasession and intersession in·ter·ses·sion  
n.
The time between two academic sessions or semesters.



inter·ses
 reliability of hand-held dynamometer measurements taken on brain-damaged patients. Phys Ther. 1989;69:182-194. [4] Stratford PW. Reliability: consistency or differentiating among subjects? Phys Ther. 1989; 69:299-300. Letter to the editor. [5] Van der Ploeg RJO RJO Remote Job Output , Oosterhuis HJGH, Reuvekamp J. Measuring muscle strength. J Neurol. 1984;231:200-203. [6] Agre JC. Quantification of muscle function. In: Halpern AS, Fuhrer füh·rer also fueh·rer  
n.
A leader, especially one exercising the powers of a tyrant.



[German, from Middle High German vüerer, from vüeren, to lead, from Old High German
 MJ, eds. Functional Assessment in Rehabilitation. Baltimore, Md: Paul H Brookes Publishing Co; 1984:117-130. [7] Agre JC, Magness JL, Hull SZ, et al. Strength testing strength testing,
n assessment procedure to determine the contractile strength of a muscle.
 with a portable dynamometer: reliability for upper and lower extremities lower extremity
n.
The hip, thigh, leg, ankle, or foot. Also called inferior limb, pelvic limb.
. Arch Phys Med Rehabil. 1987;68:454-458. [8] Bovens AMPM, Van Baak MA, Vrencken JGPM, et al. Variability and reliability of joint measurements. Am J Sports Med. 1990; 18: 58-63. [9] Wiles wile  
n.
1. A stratagem or trick intended to deceive or ensnare.

2. A disarming or seductive manner, device, or procedure: the wiles of a skilled negotiator.

3. Trickery; cunning.
 CM, Karni Y. The measurement of muscle strength in patients with peripheral neuromuscular neuromuscular /neu·ro·mus·cu·lar/ (-mus´ku-ler) pertaining to nerves and muscles, or to the relationship between them.

neu·ro·mus·cu·lar
adj.
1.
 disorders. J Neurol Neurosurg Psychiatry psychiatry (səkī`ətrē, sī–), branch of medicine that concerns the diagnosis and treatment of mental, emotional, and behavioral disorders, including major depression, schizophrenia, and anxiety. . 1983;46:1006-1013. [10] Wadsworth CT, Krishnan R, Sear sear 1  
v. seared, sear·ing, sears

v.tr.
1. To char, scorch, or burn the surface of with or as if with a hot instrument. See Synonyms at burn1.

2.
 M, et al. Intrarater reliability of manual muscle testing and hand-held dynametric muscle testing. Phys Ther, 1987;67:1342-1347. [11] Bohannon RW, Saunders N. Hand-held dynamometry: a single trial may be adequate for measuring muscle strength in healthy individuals. Physiotherapy physiotherapy: see physical therapy.  Canada. 1990;42:6-9. [12] Krebs DE. Intraclass correlation coefficients: use and calculation. Phys Ther. 1984;64: 1581-1589. Computer communication, [13] Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater rat·er  
n.
1. One that rates, especially one that establishes a rating.

2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. 
 reliability. Psychol Bull. 1979;86:420-428. [14] Evans WJ, Cayten CG, Green PA. Determining the generalizability of rating scales in clinical settings. Med Care. 1981;19:1211-1220. [15] Stratford PW, Norman GR, McIntosh JM. Generalizability of grip strength Grip strength is the force applied by the hand to pull on or suspend from objects. Optimum-sized objects permit the hand to wrap around a cylindrical shape with a diameter from one to three inches.  measurements in patients with tennis elbow tennis elbow - overuse strain injury . Phys Ther. 1989; 69:276-281. [16] Rothstein JM. Measurement and clinical practice: theory and application. In: Rothstein JM, ed. Measurement in Physical Therapy. New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
, NY: Churchill Livingstone Imprint of a medical publishing company owned by Elsevier Ltd, but previously owned by Harcourt and Pearsons. Originally formed from Livingstone, Edinburgh, Scotland, and J & A Churchill, London, UK, and subsequently with an office in New York, but now integrated with the rest of  Inc; 1985:1-46. [17] Kerlinger FN. Foundations of Behavioral Research. 2nd ed. New York, NY: Holt, Rinehart and Winston, Inc; 1973. [18] Mitchell SK. Interobserver agreement, reliability, and generalizability of data collected in observational studies observational studies,
n.pl an investigational method involving description of the associations be-tween interventions and outcomes. Outcomes research and practice audits are examples of this investigational method.
. Psychol Bull. 1979;86: 376-390, [19] Brussock CM, Haley SM, Munsat TL, Bernhardt DB. Measurement of isometric force in children with and without Duchenne's muscular dystrophy Duchenne's muscular dystrophy,
n an X-linked recessive condition pres-ent at birth in which the muscles of the pelvis and legs waste away in a symmetric fashion.
. Phys Ther. 1992;72:105-114. [20] Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York, NY: John Wiley John Wiley may refer to:
  • John Wiley & Sons, publishing company
  • John C. Wiley, American ambassador
  • John D. Wiley, Chancellor of the University of Wisconsin-Madison
  • John M. Wiley (1846–1912), U.S.
 & Sons Inc; 1972. [21] Brennan RL. Elements of Generalizability Theory. Iowa City, Iowa Iowa City is a city in Johnson County, Iowa, United States. It is the principal city of the Iowa City, Iowa Metropolitan Statistical Area which encompasses Johnson and Washington counties. : American College American College is the name of:
  • American College Dublin, Dublin, Ireland
  • The American College in Madurai, Tamil Nadu, India
  • The American College of the Immaculate Conception, Leuven (also known as Louvain), Belgium
 Testing Program; 1983. [22] Shavelson RJ, Webb NM, Rowley GL, Generalizability theory. Am Psychol. 1989;44; 922-932, [23] Crocker L, Algina J. Introduction to Classical and Modern Test Theory. New York, NY: Holt, Rinehart and Winston, Inc; 1986. [24] Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psych psych also psyche   Informal
v. psyched, psych·ing, psyches

v.tr.
1.
a. To put into the right psychological frame of mind:
 Rep. 1966;19:3-11 [25] Cardinet J, Tourneur Y, Allal L. The symmetry of generalizability theory: applications to educational measurement. J Educ Measurement. 1976;13:119-135. [26] Lahey MA, Downey RG, Saal FE. Intraclass correlations: There's more than meets the eye More Than Meets the Eye was the three-part series premiere for the 1984 cartoon The Transformers. The three-part pilot was originally known simply as The Transformers .

Psychol Bull. 1983;93:586-595. [27] Guyatt G, Walter S Wal·ter   , Bruno 1876-1962.

German conductor noted for his interpretations of Mozart and Mahler.

Noun 1. Walter - German conductor (1876-1962)
Bruno Walter
, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987;40: 171-178. [28] McNemar Q. Psychological Statistics, 3rd ed. New York, NY: John Wiley & Sons inc; 1962:145-158. [29] Ghiselli EE, Campbell JP, Zedeck S, Measurement Theory for the Behavioral Sciences behavioral sciences,
n.pl those sciences devoted to the study of human and animal behavior.
. San Francisco San Francisco (săn frănsĭs`kō), city (1990 pop. 723,959), coextensive with San Francisco co., W Calif., on the tip of a peninsula between the Pacific Ocean and San Francisco Bay, which are connected by the strait known as the Golden , Calif: WH Freeman & Co Publishers; 1981. [30] Stratford PW. Reliability of a peak knee extensor extensor /ex·ten·sor/ (-ser) [L.]
1. causing extension.

2. a muscle that extends a joint.


ex·ten·sor
n.
A muscle that extends or straightens a limb or body part.
 and flexor flexor /flex·or/ (flek´ser)
1. causing flexion.

2. a muscle that flexes a joint.


flexor retina´culum  see entries under retinaculum.
 torque protocol: a study of post ACL See access control list.

1. ACL - Access Control List.
2. ACL - Association for Computational Linguistics.
3. ACL - A Coroutine Language.

A Pascal-based implementation of coroutines.

["Coroutines", C.D.
 reconstructed re·con·struct  
tr.v. re·con·struct·ed, re·con·struct·ing, re·con·structs
1. To construct again; rebuild.

2.
 knees. Physiotherapy Canada, 1991;43(4):27-30 [31] Bland JM. Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet lancet /lan·cet/ (lan´set) a small, pointed, two-edged surgical knife.

lan·cet
n.
. February 1986:307-310 [32] Hartmann DP. Considerations in the choice of interobserver reliability estimates. J Appl Behav Anal anal (a´n'l) relating to the anus.

a·nal
adj.
1. Of, relating to, or near the anus.

2.
. 1977; 10: 103-116. [33] Spijkerman DCM DCM
abbr.
Distinguished Conduct Medal
, Snijders CJ, Stijnen T, Lankhorst GJ. Standardization of grip strength measurements: effects on repeatability and peak force. Scand J Rehabil Med. 1991;23: 203-206. [34] Thorndike RL. Applied Psychometrics psychometrics

Science of psychological measurement. Psychometricians design and administer psychological tests (see psychological testing), both to generate empirical data on mental processes and to refine their understanding of measurement techniques and the
. Boston, Mass: Houghton Mifflin Houghton Mifflin Company is a leading educational publisher in the United States. The company's headquarters is located in Boston's Back Bay. It publishes textbooks, instructional technology materials, assessments, reference works, and fiction and non-fiction for both young readers  Co; 1982. [35] Brennan RL. Applications of generalizability theory. in: Berk RA, ed. Criterion-referenced Measurement: The State of the Art. Baltimore, Md: Johns Hopkins University Press The Johns Hopkins University Press is a publishing house and division of Johns Hopkins University that engages in publishing journals and books. It was founded in 1878 and holds the distinction of being the oldest continuously running university press in the United States. ; 1980: 186-232. [36] Shavelson RJ, Webb NM. Generalizability Theory: A Primer. Newbury Park, Calif: Sage Publications This article or section needs sources or references that appear in reliable, third-party publications. Alone, primary sources and sources affiliated with the subject of this article are not sufficient for an accurate encyclopedia article.  Inc; 1991. [37] Searle SR. Linear Models. New York, NY: John Wiley & Sons Inc; 1971 [38] Anderson VL, McLean RA. Design of Experiments: A Realistic Approach. New York, NY: Marcel Dekker Marcel Dekker is a well-known encyclopedia publishing company with editorial boards found in New York, New York. They are part of the Taylor and Francis publishing group.

Initially a textbook publisher, they went to encyclopedia publishing in the late 1990's.
 Inc; 1974. [39] Van der Vleuten CPM (1) (Critical Path Method) A project management planning and control technique implemented on computers. The critical path is the series of activities and tasks in the project that have no built-in slack time. , Swanson DB. Assessment of clinical skills with standardized patients standardized patient Teaching patient, see there : state of the art. Teaching and Learning in Medicine. 1990;2:58-76. [40] Smidt GL, Rogers MW. Factors contributing to the regulation and clinical assessment of muscular strength. Phys Ther. 1982;62: 1283-1290. [41] Caldwell LS, Chaffin DB, Dukes-Dobos FN, et al. A proposed standard procedure for static muscle strength testing. Am Ind Hyg Assoc J 1974;35:201-206. [42] Kroemer KHE KHE Know-How Exchange , Marras WS. Towards an objective assessment of the maximal voluntary contraction component in routine muscle strength measurements. Eur J Appl Physiol. 1980;45:1-9. [43] Hosking GP, Bhat US, Dubowitz V, Edwards RHT RHT Reinforced Heel and Toe (stockings)
RHT Richtig Hartes Training
RHT Atlantic Sharpnose Shark (FAO fish species code)
RHT Retractable Hard Top (convertible autos) 
. Measurements of muscle strength and performance in children with normal and diseased dis·eased
adj.
1. Affected with disease.

2. Unsound or disordered.
 muscle, Arch Dis Child. 1976;51: 957-963 [44] Mayhew TP, Rothstein JM. Measurement of muscle performance with instruments. In: Rothstein JM, ed. Measurement in Physical Therapy. New York, NY: Churchill Livingstone Inc; 1985:57-102. [45] Dempster WT. Space Requirements of the Seated Operator, WADC WADC Wright Air Development Center (USAF)
WADC Washington District of Columbia
 TR 55-159. Dayton, Ohio Dayton is a city in southwestern Ohio, United States. It is the county seat and largest city of Montgomery County. As of the 2005 census estimate, the population of Dayton was 158,873. : Wright Patterson Air Force Base; 1955. [46] Van der Leeuw GHF GHF Global Health & Fitness
GHF Global Heritage Fund (cultural preservation organization)
GHF Gesellschaft für Handel und Finanz mbH (German: Society for Trade and Finance Ltd.
, Stam HJ, Huster R. Correction for gravity in isokinetic isokinetic /iso·ki·net·ic/ (-ki-net´ik) maintaining constant torque or tension as muscles shorten or lengthen; see isokinetic exercise, under exercise.  dynamometry of the knee extensors. J Rehabil Sci. 1988;1: 40-44. [47] Miller DI, Nelson RC. Biomechanics The study of the anatomical principles of movement. Biomechanical applications on the computer employ stick modeling to analyze the movement of athletes as well as racing horses.
Biomechanics 
 of Sport: A Research Approach. Philadelphia, Pa: Lea & Febiger; 1973 [48] Winter DA. Biomechanics of Human Movement. New York, NY: John Wiley & Sons Inc; 1979 [49] Crick JE, Brennan RL. Manual for GENOVA: A Generalized Analysis of Variance System. Iowa City, Iowa: American College Testing Program; 1983. [50] Krebs DE. Declare your ICC type. Phys Ther. 1986;66:1431. Letter to the editor. [51] Stalenhoef-Halling BF, van der Vleuten CPM, jaspers TAM, Fiolet JFBM. The feasibility, acceptability and reliability of open-ended questions A closed-ended question is a form of question, which normally can be answered with a simple "yes/no" dichotomous question, a specific simple piece of information, or a selection from multiple choices (multiple-choice question), if one excludes such non-answer responses as dodging a . In: Bender W, Hiemstra RJ, Scherpbier AJJA, Zwierstra RP, eds. Teaching and Assessing Clinical Competence. Groningen, the Netherlands: Boekwerk Publishers; 1990: 552 557.

Commentaries

Following are two commentaries on "The Application of Generalizability Theory to Reliability Assessment: An Illustration Using Isometric Force Measurements."

Measurement has become a popular topic in the physical therapy literature over the last several years. Recently, a few researchers have begun to probe the validity of physical therapy measurements for various purposes. Most published reports, however, have concerned the reliability of data obtained with various measurement procedures. The sophistication so·phis·ti·cate  
v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates

v.tr.
1. To cause to become less natural, especially to make less naive and more worldly.

2.
 of these reliability articles has evolved over the years from descriptive reports with little or no statistical analysis to reports that used Pearson Product-Moment Correlation Coefficients. More recently, the intraclass correlation coefficient (ICC) has become the standard for reliability analyses. Shrout and Fleiss have delineated de·lin·e·ate  
tr.v. de·lin·e·at·ed, de·lin·e·at·ing, de·lin·e·ates
1. To draw or trace the outline of; sketch out.

2. To represent pictorially; depict.

3.
 six formulas for the ICC,[1] and we have been urged to declare which of the formulas we have used in our analyses.[2] We have witnessed debate about the appropriateness of the different formulas for different testing conditions.[3,4] The evolutionary process continues with this report by Roebroeck et al as they encourage us to consider using generalizability (G) theory to report the consistency of our measurement procedures. I congratulate the authors on their efforts to introduce a method that is not commonly used in evaluation of physical therapy measurements. I am aware of only one other investigator who has used G theory to analyze methods of measurement in physical therapy.[5,6] Roebroeck et al present a concise and clear discussion of a complex topic.

An article such as this one serves two purposes. The first is to instruct the reader about a previously unfamiliar method of analysis, and the second is to highlight the strengths and demonstrate the meaningfulness of the new method for potential users. To progress from an awareness of the existence of a method to being able to use it, readers must know the mathematical formulas for estimating the different variance components. The variance estimates are calculated from the appropriate mean squares (MS) and the numbers of conditions (n) of the particular facets (subject [s], occasion [o], therapist [t], or repetition [r]) and the residual error (e). For example, the formula for estimating the subject x occasion variance in this study was [MS.sub.so]-[MS.sub.sor]-[MS.sub.sotr,e]/[n.sub.t][n.sub.r]. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, the variances of higher-order interactions that include subject and occasion are separated from the variance of interest; the residual variance Residual variance or unexplained variance is part of the variance of any residual. The other part is explained variance. In analysis of variance and regression analysis, residual variance is that part of the variance which cannot be attributed to specific causes.  is added, and the result is averaged over the number of conditions in the other facets. These estimates are not intuitively obvious but are necessary both to understand the partitioning of the sources of variance and to be able to use the method.

The readers of this journal are more familiar with the use of reliability coefficients, such as the ICC, to represent the consistency of data. To set G theory in a familiar context, G theory can be viewed not as a departure from the ICC but rather as an extension of the ICC. Both methods involve calculation of an analysis of variance (ANOVA) and a coefficient to represent reliability of data. The ANOVA from which an ICC is drawn typically identifies differences in measurements caused by repeated occasions or different judges; the ICC represents the reliability of the data. in G theory, more factors are included in the ANOVA, and G coefficients, which are ICCs, represent the dependability of the data for relative and absolute decisions.[7] A side-by-side comparison of the appropriate ICCs for interrater and intrarater reliability and the corresponding G coefficients would be enlightening en·light·en  
tr.v. en·light·ened, en·light·en·ing, en·light·ens
1. To give spiritual or intellectual insight to:
. The primary advantage of G theory is its provision of additional and more specific information. G theory analyzes additional sources of error and indicates which sources of error are influential, thus pointing the way to ideas about correcting them.

The major advantage of G theory can also be a weakness. The more facets the investigator includes in a G study, the more complex it becomes. Consider a G study of goniometry goniometry /go·ni·om·e·try/ (go?ne-om´e-tre) the measurement of angles, particularly those of range of motion of a joint.

goniometry

the measurement of range of motion in a joint.
 in which the investigator included not only multiple subjects, therapists, occasions, and repetitions but also different goniometers. In this study, there would be 5 main effects, 10 two-way interactions, 10 three-way interactions, 5 four-way interactions, and 1 residual error term. Some of these sources of variance become so meaningless that they may as well be included with the unexplained unexplained
Adjective

strange or unclear because the reason for it is not known

Adj. 1. unexplained - not explained; "accomplished by some unexplained process"
 (residual) error. Imagine trying to explain why subjects had different ranges of motion on different repetitions when measured by different therapists using different goniometers.

The real usefulness of reports of measurement validity and reliability is to help the reader improve the measure and make decisions based on the measure. The authors made some suggestions of how to use the information to improve the measurement procedure. For example, if the subject x therapist variance is large, better standardization in patient instruction and test conduct is necessary. If the occasion interactions are large, appropriate strategies include providing practice sessions and averaging measurements across sessions. For readers to appreciate the usefulness of the method, more of this kind of discussion is necessary. By way of additional example, if the subject x therapist interaction is very small, different therapists can measure different subjects with little risk of erroneous erroneous adj. 1) in error, wrong. 2) not according to established law, particularly in a legal decision or court ruling.  comparisons between and within subjects. Similarly, if there is a large therapist variance, one therapist ought to do all measurements for any comparative purpose. These and other possible outcomes should be discussed in very practical language.

I was especially happy to see the authors include the discussion of the standard error of measurement and the smallest detectable difference in measurements. I would like to see more instruction and dialogue on these concepts in our profession. To make decisions using measurements, practitioners must recognize that part of any measurement is measurement error. Unfortunately, many clinicians do not understand the concept of measurement error. My recent research suggested that clinicians have difficulty defining the concept of measurement error and are aware of estimates of measurement error for only certain measurements. The therapists in my study were not adept at determining how to apply the concept of measurement error in the context of a decision.(8)

G theory is an analytical method that may be difficult for both practicing clinicians and researchers to understand. Even if they have the statistical experience, most clinicians will not go to the effort of calculating all the variance estimates necessary to determine the relative importance of potential sources of error. Many will not see that the method would lead to a way of improving both their measures and their ability to make good decisions based on their measures. Clinicians need practical guidelines guidelines,
n.pl a set of standards, criteria, or specifications to be used or followed in the performance of certain tasks.
 of what to do and when and how to do it. The challenge for statisticians Statisticians or people who made notable contributions to the theories of statistics, or related aspects of probability, or machine learning: A to E
  • Odd Olai Aalen (1947–)
  • Gottfried Achenwall (1719–1772)
  • Abraham Manie Adelstein (1916–1992)
 is to transmit the information in a manner that clinicians find practical and comprehensible com·pre·hen·si·ble  
adj.
Readily comprehended or understood; intelligible.



[Latin compreh
.

Roebroeck et al have added an important component to our evolving understanding of measurement theory. Other physical therapists who are interested in promoting appropriate use of our measures should become familiar with the method and enter into the dialogue about its use. Karen Whayes, PhD, PT Assistant Professor of Physical Therapy Programs in Physical Therapy Northwestern University Northwestern University, mainly at Evanston, Ill.; coeducational; chartered 1851, opened 1855 by Methodists. In 1873 it absorbed Evanston College for Ladies.  Medical School 345 E Superior St, Room 1323 Chicago, IL 60611

References

[1] Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428. [2] Krebs DE. Declare your ICC type. Phys Ther. 1986;66:1431 Letter to the editor. [3] Bohannon RW. Commentary on "Intrasession and intersession reliability of hand-held dynamometer measurements taken on brain-injured patients." Phys Ther. 1989;69:190-192. [4] Riddle DL, Finucane SD, Rothstein M, Walker ML, Author response to commentary on "Intrasession and intersession reliability of hand-held dynamometer measurements taken on brain-injured patients." Phys Ther. 1989;69: 192-194. [5] Stratford PW. Efficiency analysis of two written short-answer student evaluation formats. Phys Ther. 1988;68:1546-1549. [6] Stratford PW, Norman GR, McIntosh JM. Generalizability of grip strength measurements in patients with tennis elbow. Phys Ther. 1989; 69:276-281. [7] Shavelson RJ, Webb NM, Burstein L. Measurement of teaching. In: Wittrock MC, ed. Handbook of Research on Teaching 3rd ed. New York, NY: Macmillan Publishing Co; 1986: 50-91. [8] Hayes KW. The effect of awareness of measurement error on physical therapists' confidence in their decisions. Phys Ther. 1992;72: 515-531.

The quantitative analysis Quantitative Analysis

A security analysis that uses financial information derived from company annual reports and income statements to evaluate an investment decision.

Notes:
 of error in measurement cannot be optimally accomplished by the use of a single statistic statistic,
n a value or number that describes a series of quantitative observations or measures; a value calculated from a sample.


statistic

a numerical value calculated from a number of observations in order to summarize them.
. The reliability research literature in physical therapy, however, has been dominated by proportional, unit-free indexes of error, such as Pearson's Product-Moment Correlation Coefficient Noun 1. product-moment correlation coefficient - the most commonly used method of computing a correlation coefficient between variables that are linearly related
Pearson product-moment correlation coefficient
 (r) in earlier work and the intraclass correlation coefficient (ICC) more recently. The Standards for Educational and Psychological Testing The Standards for Educational and Psychological Testing is a set of testing standards developed jointly by the American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). ,[1] jointly issued by the relevant authorities in the psychological and educational professions, where testing theory originated, places much greater emphasis on a range of statistics, particularly the standard error of measurement (SEM). We therefore strongly endorse Roebroeck et al when they demonstrate the value of the SEM for quantifying reliability. Our position is also that the uses of reliability information must be examined before a choice is made about the appropriate indexes, although we suggest in concert with the philosophy displayed in other authoritative sources[1] that a wider range of statistics be examined in studies investigating reliability. We agree with Roebroeck et al that metric indexes of error such as the SEM are required when making decisions about individuals, a potentially very frequent use for reliability information in physical therapy, which is probably frustrated frus·trate  
tr.v. frus·trat·ed, frus·trat·ing, frus·trates
1.
a. To prevent from accomplishing a purpose or fulfilling a desire; thwart:
 at present by the dearth of metric data on errors of measurement. We also agree that the factorial designs of generalizability (G) studies provide valuable additional information that studies focusing on only one aspect do not provide.

In some respects, however, we would like to extend the arguments put forth by Roebroeck et al. Reliability data sometimes are used not to make individual decisions, but to compare different techniques for measuring the same variable (eg, torque) or to contrast independent studies of the same measurement method. In these cases, proportional coefficients of reliability such as the ICC have also dominated the literature, even though metric indexes of error are valuable for these situations. A fundamental problem of ratio indexes such as the ICC is that error of measurement and true variability are expressed relatively. An ICC is a ratio of individual variability to total variability, the latter containing the error component. The range of genuine diferences in any attribute is determined by who is sampled in a study. Thus, it is possible to obtain from two studies different impressions of reliability from correlational statistics Noun 1. correlational statistics - a statistical relation between two or more variables such that systematic changes in the value of one variable are accompanied by systematic changes in the other
correlation
 such as the ICC when the range of true score variation changes, even though the error of measurement in metric units remains the same. Metric indicators of error, such as the SEM, do not have this problem. Not being proportional, the SEM is only determined by the physical causes of error and the statistical uncertainty of estimating that error. An ICC could be identical in two studies in which the measurement error is different, simply because one study has included a larger range of true score variation by including subjects with greater individual differences. Similarly, proportional indexes can be misleading when used to compare different tests. For example, Molczyk et al[2] tested knee flexion and extension. They concluded that because the ICCs for flexion scores were generally lower than for extension scores, measurement of flexion was less reliable. When we calculated SEMs from ICCs and standard deviations provided by Molczyk et al, we found (JL Keating, TA Matyas; unpublished research; 1993) that measurement error associated with flexion tests A flexion test is a veterinary proceedure performed on a horse, generally during a prepurchase or a lameness exam. The animal's leg is held in a flexed position for 30 seconds to up to 3 minutes (although most veterinarians do not go longer than a minute), and then the horse is  was generally lower than for extension tests, contrary to the conclusion of the authors.

Roebroeck et al did not discuss the separation of random and systematic error. The factorial designs of G studies contain information about both random and systematic error, as Roebroeck et al note, and we agree that both should be included in estimates of overall error. it is also important, however, to note the different nature of random and systematic errors. Systematic error has known magnitude and direction, and thus may be handled by subtraction subtraction, fundamental operation of arithmetic; the inverse of addition. If a and b are real numbers (see number), then the number ab is that number (called the difference) which when added to b (the subtractor) equals . Random error cannot be subtracted specifically. it can only be used to estimate an interval of uncertainty. Once systematic error has been subtracted, estimates of random error can be used to specify how much change must be observed before a genuine change can be inferred. Knowledge of systematic error can improve the estimation of the true value, whereas random error can only help outline the interval within which the true value is likely to exist. Furthermore, techniques for reducing random error may affect systematic error. For example, random error may be diminished by increasing the number of assessment trials and averaging, as noted by Roebroeck et al. Increased numbers of trials, however, could increase systematic error, due to phenomena such as fatigue, warm-up effects, or practice effects. We therefore believe that statistical analysis should also provide separate estimates of systematic and random error. Although G studies (and other reliability study designs) enable separate estimates, these studies are rarely reported in the physical therapy literature.

Approaches based on analysis of variance, such as the error estimates described by Roebroeck et al, or indeed ICCs, achieve some computational elegance in allowing a single statistic to summarize sum·ma·rize  
intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es
To make a summary or make a summary of.



sum
 data from a sample of assessors, occasions, or other dimensions Other Dimensions is a collection of stories by author Clark Ashton Smith. It was released in 1970 and was the author's sixth collection of stories published by Arkham House. It was released in an edition of 3,144 copies.  of error and in combining several sources of error (assessors, occasions, and so on). In this respect, they are more elegant than bivariate statistics bivariate statistic

a numerical value which indicates the relationship between two individual variables, e.g. correlation between fiber intake and butterfat content of milk.
, such as the Pearson correlation coefficient, which requires calculation of mean correlations across the set of therapist pairs or occasion pairs. In rejecting the utility of the Pearson correlation coefficient for this inelegance in·el·e·gance  
n.
Lack of refinement or polish.

Noun 1. inelegance - the quality of lacking refinement and good taste
 or for failing to be sensitive to the effects of systematic error, however, it is important not to overlook other useful features of regression models.

Regression models provide several statistics beside the Pearson correlation coefficient, including the intercept intercept

in mathematical terms the points at which a curve cuts the two axes of a graph.
 and slope of the line of best fit, and a scattergram scattergram

a graph in which the values found in a statistical study are represented by disconnected, individual symbols.
. Although Pearson's correlation coefficient is insensitive in·sen·si·tive  
adj.
1. Not physically sensitive; numb.

2.
a. Lacking in sensitivity to the feelings or circumstances of others; unfeeling.

b.
 to systematic error, the regression model as a whole is not so. if the intercept on the Y-axis deviates significantly from zero when the slope is unity, the model is signaling systematic error. For example, in a test-retest reliability analysis, a significantly positive intercept would indicate that occasion Y has a tendency to yield higher values than occasion X. If the slope is greater than unity, the regression model is signifying Signifyin' (slang) is an African-American rhetorical device featuring indirect communication or persuasion and the creating of new meanings for old words and signs. Signifying, in this sense, includes repetition and difference, implication and association, combining words and  that scale units in Y are not being used in the same way as in X. For example, if this occurred in an interobserver reliability study, the suggestion is that observer Y has expanded the units of the scale relative to observer X. Thus, both slope and intercept variations from the theoretically expected values Expected value

The weighted average of a probability distribution. Also known as the mean value.
 of 1 and 0, respectively, are very informative. The information provided by slope variation is not readily derived from the analysis of variance.

Similarly, scattergrams may show a lack of uniformity in deviations from the line of best fit. For example, a fan shape in the scattergram indicates that error of measurement increases as a function of the magnitude of the variable. In muscle performance testing Performance Testing covers a broad range of engineering or functional evaluations where a material, product, or system is not specified by detailed material or component specifications: Rather, emphasis is on the final measurable performance characteristics. , for instance, it is possible that performance at high forces may be more variable than at low forces. In such situations, equations relating the magnitude of the error to the scale value would be helpful and can be derived.

In addition, the scattergram and regression model can be examined for departures from linearity. In an interobserver reliability study, for example, a positively accelerating curve suggests that observer Y is using the measurement scale differently from observer X, such that scale units are shrunk shrunk  
v.
A past tense and a past participle of shrink.


shrunk
Verb

a past tense and past participle of shrink

shrunk, shrunken shrink
 relative to observer X's scale at low values, but expanded at higher values. Such insights can be particularly valuable when developing measurement methods or determining the nature of interobserver disagreement.

Of course combining the data from many observers or many occasions is awkward when using bivariate bi·var·i·ate  
adj.
Mathematics Having two variables: bivariate binomial distribution.

Adj. 1.
 analysis, although averages can be obtained for all regression statistics. This particularity par·tic·u·lar·i·ty  
n. pl. par·tic·u·lar·i·ties
1. The quality or state of being particular rather than general.

2.
 of the bivariate regression approach, however, can also be useful. For example, exploration of regression models for different therapist pairs may reveal classifiable variations in these models. If two lypes of model are suggested by the data, the observers who share the similar model can be evaluated from common features, which can be revealing about the nature of error or can allow more specific and accurate estimates of reliability. Such fine-grain explorations are not encouraged by the integrative framework in which ICCs and generalizability statistics have been presented.

It is also important to note that the relationship between the G and decision (D) studies needs to be carefully understood for extrapolation (mathematics, algorithm) extrapolation - A mathematical procedure which estimates values of a function for certain desired inputs given values for known inputs.

If the desired input is outside the range of the known values this is called extrapolation, if it is inside then
 to be valid. Roebroeck et al argue that some facets of variance can be eliminated or modified to allow only those variances that apply in the D study to determine the appropriate standard error. This is true provided the design change from the G to the D study has not introduced a systematic bias in the error. In some D-study situations, exclusion of some factors (eg, variation of occasions, but not therapists) could mean that the error estimates from the G study cannot be generalized, even though adjusted. This is because the D-study conditions have systematically (not randomly) varied from those of the G study. For example, in muscle testing, a D study involving only one clinician clinician /cli·ni·cian/ (kli-nish´in) an expert clinical physician and teacher.

cli·ni·cian
n.
 (typically) may require the subject to perform fewer repetitions than did the G study, which involved more therapists, or occasions, or both. If increasing repetitions produces a fatigue effect, the variability of performance may increase with apparent increases in the size of the error. Alternatively, increased repetitions may lead to a warm-up or a skill-acquisition effect in performance, which could reduce the variability of performance in later trials. In either case, the overall error estimate from the G study would not be comparable to error in the fewer number of trials conducted in the D study.

Ensuring the comparability of the G-and D-study situations requires that we have adequate descriptions of test procedures and experimental designs. We believe that examination of the literature will demonstrate that testing protocols are often poorly reported, imposing strong handicaps on consumers' ability to decide that the conditions of a reliability study sufficiently resemble those intended for clinical use. In this sense, a better understanding of G theory should assist editors in their effort of maintaining a high standard of protocol report.

In conclusion, we endorse the overall thrust of the report by Roebroeck et al. As physical therapy has properly become sensitized sensitized /sen·si·tized/ (sen´si-tizd) rendered sensitive.

sensitized

rendered sensitive.


sensitized cells
see sensitization (2).
 in recent years to the importance of reliability data, Roebroeck et al provide a valuable tutorial on relevant issues. Metric indexes of error (SEMs, confidence intervals) are essential. Factorial analysis of error can be valuable. Integrative expression of systematic and random error is important. However, so is separate specification of both error types. The limitations of Pearson's correlation coefficient should not blind explorers of reliability to the insights that can be provided by regression analysis In statistics, a mathematical method of modeling the relationships among three or more variables. It is used to predict the value of one variable given the values of the others. For example, a model might estimate sales based on age and gender.  more broadly. Finally, it is important to carefully examine the comparability of the G-and D-study conditions, and in this respect fully detailed reports of methods are a fundamental requirement. in applauding the particular contribution of Roebroeck et al, we consider it important to warn against the possibility of replacing one panacea Some antidote or remedy that completely solves a problem. Most so-called panaceas in this industry, if they survive at all, wind up sitting alongside and working with the products they were supposed to replace.  with another. Thomas A Maryas, PhD Reader Department of Behavioural Adj. 1. behavioural - of or relating to behavior; "behavioral sciences"
behavioral
 Health Sciences La Trobe University 1. u/r = unranked

2.AsiaWeek is now discontinued. Student life
During the 1970s and 1980s, La Trobe, along with Monash, was considered to have the most politically active student body of any university in Australia.
 Bundoora 3083 Melbourne, Victoria, Australia Jennifer L Keating, GradDipManipTher Doctoral Research Scholar Department of Behavioural Health Sciences La Trobe University Kenneth M Greenwood, PhD Senior Lecturer senior lecturer
n. Chiefly British
A university teacher, especially one ranking next below a reader.
 Department of Behavioural Health Sciences La Trobe University

References

[1] American Educational Research Association The American Educational Research Association, or AERA, was founded in 1916 as a professional organization representing educational researchers in the United States and around the world.  (AERA AERA American Educational Research Association
AERA Automotive Engine Rebuilders Association
AERA Air Emissions Risk Analysis
AERA Accelerating Economic Recovery in Asia
AERA American European Racquetball Association
), American Psychological Association The American Psychological Association (APA) is a professional organization representing psychology in the US. Description and history
The association has around 150,000 members and an annual budget of around $70m.
 (APA (All Points Addressable) Refers to an array (bitmapped screen, matrix, etc.) in which all bits or cells can be individually manipulated.

APA - Application Portability Architecture
), and National Council on Measurement in Education (NCME NCME National Council on Measurement in Education
NCME National Center for Montessori Education
). Standards for Educational and Psychological Testing, Washington, DC: American Psychological Association; 1985. [2] Molczyk L, Thigpen LK, Eickhoff JA, et al. Reliability of testing the knee extensors and flexors in healthy adult women using a Cybex II isokinetic dynamometer. Journal of Orthopaedic and Sports Physical Therapy. 199 1; 14: 37-41.

Author Responses

We greatly appreciate the thorough commentaries to our article by Dr Hayes and Dr Matyas and colleagues. We find it very encouraging to discuss the concept of reliability of measurements in physical therapy, based on their valuable comments.

Measurements in physical therapy may serve two types of purposes. The first type refers to research, aiming at the assessment of differences between groups of patients. The second type refers to clinical practice, in which measurements are used for clinical decision making on an individual patient. We agree with Matyas et al that comparing different measurement techniques or contrasting reliability data from different studies can be assigned to the second type. Correlation coefficients, such as intraclass correlation coefficients (ICCs), are suitable for the first purpose, but are not appropriate for the second purpose. It is the latter purpose we are interested in, and we share Dr Hayes' and Dr Matyas and colleagues' concern that the current information on reliability is not appropriate for interpretation in clinical practice. In fact, the lack of measurement information that is directly applicable for use by clinicians gave rise to our work on reliability. We started this study with the following situations in mind: Physical therapists use measurements on patients to make decisions such as "Has a patient really improved when a net knee moment of 40 N[multiplied by]m was measured at the first occasion and a moment of 60 N[multiplied by]m was measured at the second occasion, after some weeks of therapy?" or "Has the patient at the second occasion enough knee extension moment to safely start weight-bearing gait exercises, for which the criterion is 40 N[multiplied by]m?" Or, in cases in which a decision will have a significant consequence for a patient, a physical therapist wants to know "How many measurements must I perform to detect with 95% certainty a change of 20 N[multiplied by]m?" These questions cannot be answered with the knowledge of just an ICC, not even if it has the impressive value of .98.

Because information on reliability of measurements is standardly provided as ICCs, a lack of understanding of the concept of measurement error among clinicians, as Dr Hayes found in a recent study,[1] does not surprise us. By presenting measurement information in a manner that corresponds better to clinical situations such as those described above, the understanding and application of the concept of measurement error in clinical decisions will increase. The smallest detectable differences (SDDs) in the first and third examples above and the confidence interval of an observed measurement in the second example provide the physical therapist with a decisive answer. These indexes can easily be derived from the standard error of measurement (SEM).

With respect to the methods used in reliability analysis, we agree with Dr Hayes on the correspondence of both generalizability (G) theory and the calculation of ICCs. To put it more strongly, Shrout and Fleiss[2] stated in the introduction of their article that the context in which they discuss six forms of the ICC is "a special case of the one-facet generalizability study (G study) discussed by Cronbach, Gleser, Nanda, and Rajaratnam.[3]" Both methods are based on analysis of variance (ANOVA), providing the calculation of mean squares and their derived reliability coefficients. The formulas presented by Shrout and Fleiss for the coefficient p, the population value of the ICC, are exactly the same as the corresponding G coefficients calculated in G theory.[2,3] In our article, however, we encourage researchers to assess variance components from the ANOVA mean squares directly. In our opinion, the variance components are more informative to clinicians than are reliability coefficients (ICCs). Furthermore, in this approach, the misleading introduction of between-subject variability is avoided, as Dr Matyas and colleagues pointed out. Dr Hayes raised the problem that G theory may be difficult to understand for both practicing clinicians and researchers. In our opinion, the complexity of the mathematical formulas used in this method is comparable to that of the ICC formulas, and the real problem is that in physical therapy we are not familiar with this method. An important advantage is that variance components correspond better to the awareness that several sources of measurement error influence measurements in clinical practice.

Researchers (ie, G-study designers), should select appropriate factors on the guidance of their clinical relevance. In this respect, we endorse Dr Matyas and colleagues' point on the importance of the comparability of the G and decision (D) situations. With respect to Dr Hayes' example, if the use of different goniometers at different occasions is common practice, this might be an important factor to distinguish; if not, it is useless to include it. It must be kept in mind that more complex G-study designs not only introduce higher-order interactions, but also the number of G-study measurements needed to obtain reliable estimates of the variance components will increase. On the other hand, a more complex design may be advantageous to separate more main effects and two-way interactions, and the variance attributable to these effects is certainly interpretable. In these designs, higher-order interactions become meaningless and can be included in the residual error.

Using the concept of SEM, calculated from a set of variance components, clinicians will be forced to realize which error sources will influence their measurements in clinical practice, dependent on the design they use and the decision they make. It makes a difference whether the decision is made against a previous measurement or against a reference level. Besides this awareness, the approach offers clinicians the possibility of manipulating the amount of measurement error by repeating the measurements of certain factors and averaging them. In this way, one can gain confidence in the assessment at the price of increased effort (ie, performing more measurements). The optimal solution depends on the situation.

As Matyas and colleagues mentioned, we did not separate systematic error from the total amount of measurement error, although G theory enables separate estimates. in fact, systematic error attributable to a factor can be assessed from a G study by considering a factor as fixed. This is only of interest, however, when it is not appropriate or not possible to generalize beyond the G-study conditions of the factor. This would be the case if, for example, the two therapists involved in the G study are specifically selected because they will perform all future D-study measurements at our department. in that case, the factor therapist can be considered fixed in the G study, thus allowing the assessment of a systematic effect of therapist. Future D-study measurements of both therapists can be corrected for this effect. Whether correction for a systematic effect of occasion or repetition is appropriate is a matter of discussion. Because it does not seem obvious to restrict generalization to the specific occasions or the specific repetitions of the G study, these factors were handled as random factors in our study. Furthermore, we agree with Matyas et al that regression analysis may yield a more specific insight into the pattern of a factor effect, in addition to the information provided by variance components on main effects and interaction effects.

It is important to acquire more experience with the application of G theory on measurements in physical therapy. Extension to analysis of data from patients is needed. Special attention may be paid to refining the method and to improving its practicability for clinicians. Marij E Roebroeck Jaap Harlaar Gustaaf J Lankborst, MD, PhD

References

[1] Hayes KW, The effect of awareness of measurement error on physical therapists' confidence in their decisions. Phys Ther. 1992;72: 515-531. [2] Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428. [3] Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York, NY: John Wiley & Sons Inc; 1972. (*) Nicholas Manual Muscle Tester (MM , Lafayette Instrument Co, PO Box 5729, 3700 Sagamore sag·a·more  
n.
A subordinate chief among the Algonquians of North America.



[Eastern Abenaki s
 Pkwy N, Lafayette, IN 47903. ([dagger]) Labmaster DMA (1) (Digital Media Adapter) See digital media hub.

(2) (Document Management Alliance) A specification that provides a common interface for accessing and searching document databases.
, Scientific Solutions Inc, 6225 Cochran Rd, Solon Solon, Athenian statesman
Solon (sō`lən), c.639–c.559 B.C., Athenian statesman, lawgiver, and reformer. He was also a poet, and some of his patriotic verse in the Ionic dialect is extant. At some time (perhaps c.600 B.C.
, OH 44139. ([double dagger]) Olivetti M240, Olivetti, 77 Via Jervis, 10015 Ivrea, Italy. ([section])ASYST Software Technologies Inc, 100 Corporate Woods, Rochester, NY 14623. ([parallel]) Information on GENOVA/PC program available from JE Crick, National Board of Medical Exa 3930 Chestnut St, Philadelphia, PA 19104; information on GENOVA/MAIN frame program available from RL Brennan, American College Testing Program. PO Box 168, Iowa City Iowa City, city (1990 pop. 59,738), seat of Johnson co., E Iowa, on both sides of the Iowa River; founded 1839 as the capital of Iowa Territory, inc. 1853. Among its manufactures are foam rubber, animal feed, paper, and food products. The city is the seat of the Univ. , IA 52240.

ME Roebroeck is Research Scientist, Department of Rehabilitation Medicine, Free University Hospita PO Box 7057, 1007 MB Amsterdam, the Netherlands. Address all correspondence to Ms Roebroeck.

J Harlaar is Head, Kinesiologic Laboratory, Department of Rehabilitation Medicine, Free University Hospital.

GJ Lankhorst, MD, Phd, is Professor, Department of Rehabilitation Medicine, Free University Hospit

This study was approved by the Review Committee of Free University Hospital.

This study was supported by StiPT, Executive Agency for Technology Policy, Medical Technology Unit, the Netherlands.

This article was submitted November 9, 1992, and was accepted February 2, 1993.
COPYRIGHT 1993 American Physical Therapy Association, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1993, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:includes commentaries and author response
Author:Greenwood, Kenneth M.
Publication:Physical Therapy
Date:Jun 1, 1993
Words:11536
Previous Article:Reaction and movement times in patients with hemiparesis for unilateral and bilateral elbow flexion. (includes commentaries and author response)
Next Article:Patient Education in Physical Therapy.
Topics:



Related Articles
Relationship between multiple predictor variables and normal knee torque production.
Intrasession and intersession reliability of hand-held dynamometer measurements taken on brain-damaged patients. (includes commentary and author...
Comparison of spinal mobility and isometric trunk extensor forces with electromyographic spectral analysis in identifying low back pain.
Measurement of accessory motion: critical issues and related concepts.
Changes in torque and electromyographic activity of the quadriceps femoris muscles following isometric training. (includes commentary and author...
Reliability of passive wrist flexion and extension goniometric measurements: a multicenter study. (includes commentary and author response)
The relationships among knee extensor torques produced during maximal voluntary contractions under various test conditions.
The influence of subject and test design on dynamometric measurements of extremity muscles.
Was Torque Measured?
Shoulder function and 3-dimensional scapular kinematics in people with and without shoulder impingement syndrome.(Research Report)(Clinical report)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles