Health status measures: strategies and analytic methods for assessing change scores.Measurements taken during initial assessments, followed by a series of measurements taken at subsequent points in time, are often used by therapists to demonstrate an effect of treatment. The interval between assessments may be several minutes, such as before and after a treatment session, or several weeks or months in the case of patients with more chronic conditions. Regardless of the interval between assessments, clinicians should be detecting true and meaningful changes in patients. In recognition of this need, recent clinical research has placed a greater emphasis on a measure's capacity to detect change.[1-6] Unlike traditional reliability and validity study designs that have appeared in physical therapy literature and curricula, detailed methodology for assessing change is relatively new to the health sciences literature. Perhaps the single most important factor responsible for this direction is the shift in focus of outcome assessment from impairment Impairment 1. A reduction in a company's stated capital. 2. The total capital that is less than the par value of the company's capital stock. Notes: 1. This is usually reduced because of poorly estimated losses or gains. 2. measures to those of health status Leg, functional status, disability, health-related quality of life). Functional status measures are classified as being either generic or disease-specific (sometimes referred to as "condition-specific") measures. Generic measures are designed to be applicable across a broad spectrum of diseases, conditions, and demographic and cultural subgroups.[7] A strength of generic measures is that they allow a comparison of health status among different disease and population groups. Such comparisons are of particular importance to health policy analysts. Disease- or condition-specific measures are intended to assess disability and clinically important change in disability within a specific group (eg, persons with low back pain).[7] Although valid standards exist for measures of impairment such as strength, fitness, and range of motion, their presence may be quixotic quix·ot·ic also quix·ot·i·cal adj. 1. Caught up in the romance of noble deeds and the pursuit of unreachable goals; idealistic without regard to practicality. 2. for attributes such as functional status and health-related quality of life. We believe that the greatest challenge facing clinical researchers investigating functional status measures is determining a measure's capacity to detect clinically meaningful change, because there is not a single agreed-upon criterion standard of change. The term gold standard" is sometimes used to describe a measure that provides a representation as accurate as any, previously, provided. Accordingly, lesser standards such as those used to assess change in functional status over time should be thought of as being silver, bronze, tin, or lead in nature. Researchers have generally, used two approaches to determine the magnitude of changes ill health status measures that is necessary to establish clinical meaningfulness.[8] Lydick and Epstein[8] suggest that interpretations of the clinical meaningfulness of health status change scores are either distribution-based or anchor-based. Distribution-based interpretations require the use of statistical distributions to make decisions about the clinical relevance of changes in health status. These interpretations are also called "normative-based decisions." Anchor-based interpretations require changes in health status measures to be compared with other clinical changes such as a patient global rating of health status.[9] These interpretations are also called "criterion-based decisions" because a standard or criterion is used for the judgment. Anchor-based interpretations of change scores are generally preferred because the change score relates to a more clearly understood clinical phenomenon (such as a patient rating of extent of improvement) rather than to the statistical significance of the change score.[8] Interpretation of the meaningfulness of change scores is just one of many issues that must be considered when reading the literature that examines the usefulness of health status measures. This article reviews some of the more common terminology, designs, and analyses used to evaluate a measure's capacity to assess meaningful change in the health status of patients. Representative articles are used to illustrate the various designs and analyses used to quantify Quantify - A performance analysis tool from Pure Software. a measure's capacity to detect meaningful change. Terminology The ultimate goal when assessing change is to distinguish among patients or groups of patients whose health status has improved, deteriorated, and remained stable. Unfortunately, the terminology, associated with this goal is not uniform. For example, the terms "sensitivity to change" (not to be confused with the term "sensitivity" used within the context of diagnostic test methodology) and "responsiveness" are used frequently by those interested in assessing change.[10-12] In some instances, however, these terms have been used in association with study designs and analyses that provide a strong evaluation of valid change, whereas in other reports the terms refer to designs that assess change only. Meenan et al,[10] for example, used the term sensitivity to change" to describe a rigorous design and analysis strategy aimed at assessing valid change. Their analysis attempted to differentiate between patients whose health status was likely to change an important amount fie, those treated with the previously proven intervention and patients whose health status was not likely to change (ie, those who received the placebo placebo (pləsē`bō), inert substance given instead of a potent drug. Placebo medications are sometimes prescribed when a drug is not really needed or when one would not be appropriate because they make patients feel well taken care of. ). Liang et al,[11] however, used the term "sensitivity to change" in a study design and analysis that allowed only the assessment of change. Kirshner and Guyatt[12] have created the term "evaluative measure" to describe a measure intended to assess change. Specifically, these authors suggest that an effective evaluative measure should possess three properties: (1) The measure should have low intrasubject variation in subjects whose health status is stable (reliability), (2) the change detected by the measure should be consistent with an external standard of change (validity), and (3) the measure should be able to detect clinically important change (responsiveness). A further illustration of the terminology dilemma is provided in an article by Kopec et al, who used the term "responsiveness" as follows: "Two aspects of this property are 1) ability to detect true changes in a single group of patients and 2) ability to discriminate dis·crim·i·nate v. dis·crim·i·nat·ed, dis·crim·i·nat·ing, dis·crim·i·nates v.intr. 1. a. between groups differing in the amount or direction of change."[13](p343) It is interesting to note that Kopec and colleagues' source citation Citation (foaled 1945) U.S. Thoroughbred racehorse. In four seasons he won 32 of 45 races, finished second in ten, and third in two. He won the 1948 Triple Crown, and became the first horse to win $1 million. He set a world record in 1950 by running a mile in 1:33 3/5. is Kirshner and Guyatt's article.[12] What Kopec et al have described is not limited to responsiveness, but rather responsiveness and validity, two properties of an evaluative measure. A different perspective has been offered by Hays and Hadorn[14] and Williams and Naylor.[15] These authors suggest that responsiveness is a component of validity rather than a distinct entity.[14,15] Their argument focuses on the notion that validity exists to the extent that a measure assesses what it purports to measure and therefore has some use for making judgment. Within the context of assessing change, a valid measure must be able to detect a clinically important change. Thus, responsiveness is viewed as an aspect of validity. These examples illustrate the need to look beyond the terminology to the study design and analysis in order to distinguish the extent to which a measure is capable of assessing change. Within the framework of our article, we have adopted the view of Hays and Hadorn[14] and Williams and Naylor[15]: We believe that responsiveness is one component of validity. Subsequent sections in this article will suggest that there is a hierarchy of study designs available to assess change over time. Study Designs The methodological rigor rigor /rig·or/ (rig´er) [L.] chill; rigidity. rigor mor´tis the stiffening of a dead body accompanying depletion of adenosine triphosphate in the muscle fibers. of studies used to assess change varies. Designs that provide a strong construct for change are capable of distinguishing the direction and degree to which the health status of patients has changed. In this section, representative study designs and constructs that form the foundation of these designs are reviewed. The constructs are divided into single-group studies, illustrated in designs 1a and 1b, and multiple-group studies, shown in designs 1a, 1b, and 2c. Illustrations of these designs are presented in Table 1. Single-Group Designs The most simplistic sim·plism n. The tendency to oversimplify an issue or a problem by ignoring complexities or complications. [French simplisme, from simple, simple, from Old French; see simple approach is the before-after design shown in design 1a.[16] In brief, patients who are expected to undergo a change in health status are measured at two points in time. The interval between assessments and any interventions applied during this period serve as constructs for change. The extent to which the measured values differ between assessments represents the measure's ability to detect change. There are two major limitations associated with this design. First, if no change was detected between the measurements, it is unclear whether the measure was unable to detect a change or whether the construct was at fault and the patients did not undergo the anticipated change. The second limitation is that this design does not allow for the assessment of a measure's performance on patients whose health status is stable (eg, Does it reflect stability?). Because of these shortcomings A shortcoming is a character flaw. Shortcomings may also be:
Design 1b improves on design 1a in that it attempts to examine both stability and change. Measurements are taken at three or more points in time, and hypotheses are formed about the amount of change between measurements.[1] For example, when three points in time are used and the period between time 1 (T1) and time it (T2) is less than that between T2, and time 3 (T3), it could be hypothesized that the amount of change between T1 and T2 would be less than the amount of change between T2 and T3. A limitation of this design is that stability is often assessed over a shorter period than the change assessment (ie, T2-T3). Accordingly, this design may underestimate the magnitude of random variability that occurs over longer periods in patients whose health status is truly stable. Multiple-Group Designs These designs improve on designs 1a and 1b because they provide a more refined estimate of change and stability over the same time period. Common to the following designs is a situation in which the health status of patients is expected to change to varying degrees (eg, the health status of some patients is expected to undergo all important change, whereas the health status of other patients is not expected to change). To maximize statistical efficiency, it is important that the numbers of patients in the groups being contrasted are approximately equal and representative of the groups being compared. In each case, a measure's capacity to detect change is tested by determining its ability to differentiate between patients whose health status has changed to varying degrees. Each of the ensuing en·sue intr.v. en·sued, en·su·ing, en·sues 1. To follow as a consequence or result. See Synonyms at follow. 2. To take place subsequently. designs has unique strengths and limitations. For simplicity, only two groups are presented pictorially pic·to·ri·al adj. 1. Relating to, characterized by, or composed of pictures. 2. Represented as if in a picture: pictorial prose. 3. in the multiple-group designs shown in Table 1. [TABULAR tab·u·lar adj. 1. Having a plane surface; flat. 2. Organized as a table or list. 3. Calculated by means of a table. tabular resembling a table. DATA OMITTED] Design 2a requires that a previously proven effective intervention be available.[10] Initially, measurements are taken on all patients, subsequently, patients are randomly assigned to receive either the previously proven intervention or a placebo intervention. The interventions are applied for an appropriate period, and follow-up measurements are taken at a common point in time. The construct associated with this design states that patients receiving the proven intervention will display more improvement than patients receiving the placebo intervention. Accordingly, the more adept a measure is at discriminating dis·crim·i·nat·ing adj. 1. a. Able to recognize or draw fine distinctions; perceptive. b. Showing careful judgment or fine taste: between the two groups, the greater its ability to assess change. Obvious limitations associated with this approach are the availability of a proven intervention and the ethical concerns associated with denying patients a proven intervention. Design 2b, a variation of design 2a, makes use of the known natural or clinical histories of the condition of interest.[4,5] Central to this design is the availability of two or more cohorts of subjects whose health status, based on prior evidence, is expected to undergo different amounts of change. For example, consider the anticipated change over a 1-month period for patients who have acute versus chronic low back pain. Knowledge of how the health status of these patient groups is likely to change can be used to form the following construct: Patients with acute back pain will demonstrate greater improvement over the study period than will patients with chronic back pain. Accordingly, the extent to which a measure can distinguish a difference in the change scores between these two groups is an indication of its ability to assess change. Design 2c is similar to design 1a; however, in this example, an external standard of change is available.[3,5,17] Accordingly, after the follow-up measurement, the external standard is applied to classify clas·si·fy tr.v. clas·si·fied, clas·si·fy·ing, clas·si·fies 1. To arrange or organize according to class or category. 2. To designate (a document, for example) as confidential, secret, or top secret. , the extent to which the patients, health status has changed. As with the previous two designs, the more adept a measure is at assessing change, the greater will be its ability to differentiate among the change scores between the groups established by the external standard. The principal limitation of this design is identifying an adequate criterion standard of change. Many studies using this approach have based the criterion standard on a combination of clinicians, and patients, global ratings of change. The validity of this technique may be compromised for two reasons. First, patients complete the functional status measure and the global rating. Thus, the criterion measure is not independent of the functional status measure being assessed. Evidence also suggests that patients have difficulty recalling their initial state on which an estimate of change is based.[18] Analyses Analyses used to assess change over time are shown in Table 2. To aid in the clarification of these analyses, the results from two data sets are considered. The first data set, summarized in Table 3, represents actual Roland-Morris Questionnaire (RMQ RMQ Risk Management Questionnaire )[19] and Jan van Breemen Function Questionnaire (JVBF)[20] scores obtained from 77 patients with low back pain @all patients with complete data) who participated in an outcome measure study.[3](*) The RMQ is a 24-item self-administered questionnaire. Items are scored 1 point if checked by the patient and 0 points if left blank.[19] Thus, scores can vary from 0 (no disability) to 24 (significant disability). The JVBF consists of nine questions, each scored on an 11-point scale (0-10).[20] A score of 90 represents no disability, whereas a score of 0 represents significant disability. In brief, these questionnaires were administered at the time of the initial patient assessment and following 4 to 6 weeks of physical therapy. The intent of the study was to evaluate the properties of the measures and not the effectiveness of therapy. Accordingly, the natural history of low back pain, the interval between assessments, and the application of physical therapy served as constructs for improvement. In addition to administering the two disability questionnaires at 4 to 6 weeks follow-up, two 15-point (-7 to +7) global ratings of change and the importance of change were also completed independently by clinicians and patients (Appendix)[3]. An estimate of true and meaningful change was obtained by averaging the clinicians' and patients, ratings. For illustrative il·lus·tra·tive adj. Acting or serving as an illustration. il·lus tra·tive·ly adv.Adj. 1. purposes, the global ratings of change are coded four ways: (1) as a raw average of clinicians, and patients, ratings, (2) as the raw average collapsed into six levels of change, (3) as the raw average dichotomized into two levels of change using a cut point of 5 on the global rating scale, and (4) as the raw average dichotomized into two levels of change using a cut point of 3 on the global rating scale. The data in Table 4 represent hypothetical Hypothetical is an adjective, meaning of or pertaining to a hypothesis. See:
[TABULAR DATA OMITTED] Observations Concerning the Change Coefficients A number of change coefficients are defined in Table 2, and the values for these coefficients for the two data sets in Tables 3 and 4 are presented in Table 5. Several points are worth highlighting. First, the effect size (ES),[21] standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. response mean (SPM SPM - Sequential Parlog Machine ),[11] and paired t value evaluate change only. They cannot be used to assess a measure's ability to assess change to varying degrees. When applied to the same sample, the SRM (1) (Storage Resource Management) The management of the storage resources in an organization in order to avoid duplication of files and to determine space utilization across all servers. and paired t value provide the same information. The magnitude of the t value is influenced by the number of subjects taking part in the study; however, the SRM is not influenced by sample size. For this reason, we believe that the SRM is the preferred measure of the two measures. A comparison of ES and SRM, shown in Table 5, indicates a difference in the rankings of the measures. The ES indicates that the JVBF is the more responsive measure, whereas the SRM suggests the opposite is true. Although these two indices often yield the same ranking of measures, this example demonstrates that this is not always the case. For the data reported in this article, the global rating represented the criterion standard of change. Occasionally, when the criterion measure is assessed on ordinal (mathematics) ordinal - An isomorphism class of well-ordered sets. or interval scales, researchers elect to dichotomize di·chot·o·mize v. di·chot·o·mized, di·chot·o·miz·ing, di·chot·o·miz·es v.tr. To separate into two parts or classifications. v.intr. To be or become divided into parts or branches; fork. the results to represent those patients whose health status has changed an important amount and those whose health status has not changed. The reason for this action is that clinical decisions are often dichotomous di·chot·o·mous adj. 1. Divided or dividing into two parts or classifications. 2. Characterized by dichotomy. di·chot in nature (eg, the patient's health status has changed an important amount; therefore, I will continue to treat as planned, or the patient's health status has not changed an important amount and I will change the treatment approach). In order to dichotomize the data, a cut point (ie, a selected point on the criterion scale that divides patients into those whose health status has changed an important amount from those whose health status has not changed) must be chosen. The choice of cut point is often arbitrary, and the results for Guyatt's Responsiveness Index (GRI GRI Graduate, Realtors Institute GRI Global Reporting Initiative GRI Gas Research Institute GRI Gallaudet Research Institute GRI General Rate Increase GRI Geoscience Research Institute (Loma Linda, CA) )[22] (design 2c) indicate the potential impact this choice can have on the results. The results presented in Table 5 show that for a global rating cut point of 3, the JVBF appears to be the better measure (JVBF GRI = 2.97 versus RMQ GRI = 2.03); however, when a cut point of 5 is applied, the RMQ appears to be the superior choice (JVBF GRI = 1.49 versus RMQ GRI = 2.31). This conundrum conundrum A problem with no satisfactory solution; a dilemma can be avoided by selecting an analysis, such as correlation, that preserves a higher level of measurement (ie, ordinal or interval) on the criterion standard. In the event that the data are required to be dichotomized, we would argue that a cut point that produces findings similar to those of the correlation analysis is favored. The magnitude of the GRI is also influenced by other factors. For example, it is common for the within-patient variability on a measure in patients whose health status is stable (represented by the standard deviation In statistics, the average amount a number varies from the average number in a series of numbers. (statistics) standard deviation - (SD) A measure of the range of values in a set of numbers. of the difference between T1 and T2 scores) to increase as the duration between assessments increases. The numerator numerator the upper part of a fraction. numerator relationship see additive genetic relationship. numerator Epidemiology The upper part of a fraction of the GRI, like the indexes described, also is dependent on the construct for change. Thus, an estimate of change based on design Sic, where all patients are believed to undergo an important change in health status, is likely to be greater than that obtained for design 2a, where many but not all patients are expected to undergo an important change in health status. Two potential limitations of GRI are (1) It does not take into account systematic change that may occur in patients whose health status is stable, and (2) it does not consider variability in the change group.[23] The t test for independent sample means, analysis of variance (ANOVA anova see analysis of variance. ANOVA Analysis of variance, see there ) of change scores, Norman's [S.sub.repeat], Norman's [S.sub.ancova][24] and receiver operating characteristic (ROC) curve[25] analyses represent competing analytic an·a·lyt·ic or an·a·lyt·i·cal adj. 1. Of or relating to analysis or analytics. 2. Expert in or using analysis, especially one who thinks in a logical manner. 3. Psychoanalytic. strategies for evaluating stronger constructs of change. Like the GRI, these statistics are all influenced by the choice of study design, the construct for change, and the global rating cut point, when applicable. Moreover, the probability values associated with the t test and ANOVA of change scores are influenced by sample size. When two groups are being compared, the ANOVA of change scores yields the same results as the t test for independent dent sample means@ however, the former test accommodates the comparison of more than two groups (eg, deterioration de·te·ri·o·ra·tion n. The process or condition of becoming worse. , no change, improvement). Norman coefficients are intraclass correlations In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups. , and they can be interpreted as the proportion of variance due to true change. Both Norman coefficients can accommodate more than two groups. Norman's [S.sub.repeat] also is capable of handling more than two points in time. Norman's [S.sub.ancova] is appropriate only when there is no group X covariate interaction (ie, usually design 2a only). The ROC curve ROC curve acronym for receiver operating characteristic curve. A graphical method of assessing the characteristic of a diagnostic test. analysis has two advantages: (1) A simple procedure exists for statistically comparing two competing measures (eg, RMQ and JBVF), and (2) the most efficient cut point for making decisions using the measure (eg, RMQ) to differentiate patients whose health status has changed an important amount from those whose health status has not changed can be identified by choosing the measure's score that produced the ROC curve data point closest to the upper left-hand corner of the curve. [TABULAR DATA OMITTED] In summary, the magnitude of any given coefficient coefficient /co·ef·fi·cient/ (ko?ah-fish´int) 1. an expression of the change or effect produced by variation in certain factors, or of the ratio between two different quantities. 2. is dependent on the study design and the patient sample. For example, the magnitude of SRM would be greater in a sample where all patients are expected to undergo an important change compared with a sample where many but not all patients are expected to undergo an important change. The magnitude of the GRI also is likely to differ for designs 1b and 2c, principally due to an increase in the magnitude of random within-patient variability (the denominator denominator the bottom line of a fraction; the base population on which population rates such as birth and death rates are calculated. denominator of the responsiveness index) over the extended interval in design 2c. Moreover, indices of change generated from designs 2b and 2c are likely to be greater than those produced from design 2a. The reason for this presumption A conclusion made as to the existence or nonexistence of a fact that must be drawn from other evidence that is admitted and proven to be true. A Rule of Law. If certain facts are established, a judge or jury must assume another fact that the law recognizes as a logical is that the responses of the two groups being contrasted in designs 2b and 2c are likely to differ to a greater extent than those being compared in design 2a. Comparing Measures When the goal of a study is to identify which of competing measures (eg, RMQ and JVBF) is more effective at assessing change, one challenge facing researchers is to determine the extent to which any observed difference between the measures being compared is likely to be due to chance. Accordingly, hypotheses need to be formed and tested. For example, given that a study is designed to identify whether a difference exists in the ability of the RMQ and JVBF to assess change, the following hypotheses could be formed: Null hypothesis null hypothesis, n theoretical assumption that a given therapy will have results not statistically different from another treatment. null hypothesis, n : There will be no difference in the ability of the RMQ and JVBF to assess change. Alternate hypothesis The alternate hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses whose likelihoods are compared by a statistical hypothesis test. : There will be a difference in the ability of the RMQ and JVBF to assess change. Moreover, researchers are required to define what constitutes clinically. and statistically significant differences. Like beauty, clinical significance is often unique to the beholder; however, most researchers would agree that patients, practitioners, economics, and feasibility all play defining roles. Statistical significance is not defined by each measurement, and it defines the extent to which an observed difference between measures is likely to be due to chance. Although detailing the step-by-step analytic methods used to test for statistical significance is beyond the scope of this article, the remainder of this section references three statistical procedures that test for differences among measures. Jackknife jack·knife n. 1. A large clasp knife. 2. Sports A dive in the pike position, in which the diver straightens out to enter the water hands first. v. procedure applied to the SRM.(+) This procedure allows a person to estimate the population SRMs for the measures of interest. In brief, pseudovalues are obtained by systematically dropping one subject's data from the analysis, computing computing - computer the SRM, replacing the subject's data, dropping the next subject's data from the analysis, and repeating the process until estimates have been obtained with all subject's data having been dropped once.[11] The pseudovalues are used to estimate the population SRM mean and variance. Using these quantities, a t test can be used to compare measures.[11] ROC curve comparisons. Hanley and McNeil[26,27] have reported that z values can be calculated to compare the differences in ROC curve areas. In brief, the numerator of the expression is the difference in areas under the ROC curves for the measures of interest, and the denominator is composed of the standard error of the difference in areas. The expression for the standard error varies depending on whether the data for the two measures were collected on the same cases fie, dependence or on different cases (ie, independent). Comparison of correlation coefficients Correlation Coefficient A measure that determines the degree to which two variable's movements are associated. The correlation coefficient is calculated as: . This approach simply compares the magnitudes of the correlation coefficients. When the data for the two measures are independent, Fisher's z transformation and test can be applied.[28] When the data for the measures are dependent, however, the test described by Williams[29,30] is recommended. Observations Related to Head-to-Head Comparisons A summary of comparisons between the RMQ and JVBF using the methods outlined is presented in Table 6. A review of this table indicates that the results are dependent on the design, summary statistic statistic, n a value or number that describes a series of quantitative observations or measures; a value calculated from a sample. statistic a numerical value calculated from a number of observations in order to summarize them. , and choice of cut point. For these data, the result of the ROC curve analysis using a global rating cut point of 5 is consistent with the correlation result. Finally, neither the ROC curve analysis using a cut point of 3 nor the test of SRMS shows any statistical difference between measures. Again, these results demonstrate that the choice of design, analysis, and cut point, where applicable, all impact significantly on the results. [TABULAR DATA OMITTED] Research Designs Used to Assess Change: Examples From the Literature The intent of this section is to briefly review one example of each study design. This review is not intended to be an exhaustive summary of all the literature in which sensitivity to change has been addressed, but rather articles have been chosen to demonstrate a design and analysis feature. The majority of examples come from measures used to assess change of health status in patients with low back pain, principally because there is a greater volume of work to draw from in this area. Fairbank and colleagues[31] examined the ability of the Oswestry Questionnaire to detect change in health status over time in patients with low back pain. The sample of 25 patients was selected based on the authors, contention that the combination of treatment and the passage of time would result in improvement in the patients, health status. The questionnaire was administered at the initial visit and was readministered following 3 weeks of treatment (Tab. 7). This approach represents a type 1a design. Change was evaluated by performing a paired I test to determine whether pretreatment pretreatment, n the protocols required before beginning therapy, usually of a diagnostic nature; before treatment. pretreatment estimate, n See predetermination. scores differed significantly from posttreatment scores. Posttreatment scores were found to be lower than pretreatment scores, and the authors concluded that the questionnaire could be used to detect change in health status in patients with low back pain. Given that the construct for change was that the patients, health status would improve, the extent to which the scale is capable of demonstrating the absence of change in patients whose health status is stable cannot be determined from this study. Stratford et al,[1] using a 1b design, examined the sensitivity to change of a function visual analog scale and pain-free function questionnaire on 32 patients with lateral epicondylitis lateral epicondylitis Tennis elbow, see there . Both measures were applied at the initial visit (T1), days later (T2), 6 weeks later or sooner if clinical change had occurred (T3), and 4 days following T3 (T4) (Tab. 7). In keeping with a 1b design, it was postulated pos·tu·late tr.v. pos·tu·lat·ed, pos·tu·lat·ing, pos·tu·lates 1. To make claim for; demand. 2. To assume or assert the truth, reality, or necessity of, especially as a basis of an argument. 3. that a greater change would occur between T2 and T3 than between T1 and T2 or between T3 and T4. A repeated-measures ANOVA was performed to examine the magnitude of the variance components due to replicate rep·li·cate v. 1. To duplicate, copy, reproduce, or repeat. 2. To reproduce or make an exact copy or copies of genetic material, a cell, or an organism. n. A repetition of an experiment or a procedure. measures and that of change. The analysis indicated that the variance due to replicate scores was much lower than the variance due to change. This finding supports the validity of both measures to assess change over time. A type 2a design, in which patients were randomly, assigned to receive a therapy of known effectiveness or a placebo intervention, is illustrated in a study by Meenan et al[10] (Tab. 7) that evaluated the sensitivity to change of the Arthritis Impact Measurement Scale (AIMS). Subjects with rheumatoid arthritis rheumatoid arthritis Chronic, progressive autoimmune disease causing connective-tissue inflammation, mostly in synovial joints. It can occur at any age, is more common in women, and has an unpredictable course. were randomly assigned to three treatment groups. effective (injectable in·ject·a·ble adj. Capable of being injected. Used of a drug. n. A drug or medicine that can be injected. gold, n = 54), alternate treatment alternate treatment, n the contract provisions that authorize the insurance carrier to determine the amount of benefits payable, giving consideration to alternate procedures, services, or courses of treatment that may be performed to accomplish the (oral gold, n = 64), and placebo (n = 43). The AIMs joint tenderness, joint swelling swelling /swell·ing/ (swel´ing) 1. transient abnormal enlargement of a body part or area not due to cell proliferation. 2. an eminence, or elevation. , and grip strength Grip strength is the force applied by the hand to pull on or suspend from objects. Optimum-sized objects permit the hand to wrap around a cylindrical shape with a diameter from one to three inches. were assessed at T1 and T2. Change scores were calculated for all measures. The ANOVAs of change scores and correlations of change scores were calculated for all measures. The results of this study showed that AIMS scores improved more for the injectable gold group than for the placebo group. Based on these results, the authors concluded that the AIMS is sensitive to change in this patient group. [TABULAR DATA OMITTED] Follick and associates,[32] using a 2b design, examined the capacity of the Sickness Impact Profile Sickness Impact Profile Medtalk An instrument used to evaluate perceived health status–quality of life and changes in functional status in Pts being treated for a potentially fatal condition. (SIP (1) (Session Initiation Protocol) An IP telephony signaling protocol developed by the IETF. Primarily used for voice over IP (VoIP) calls, SIP can also be used for video or any media type; for example, SIP has been used to set up multi-player Quake games. ) to detect change in patients with chronic low back pain. The construct for change was that patients receiving treatment would improve more than patients on the waiting list not receiving treatment. The interval between initial and follow-up SIP administrations was 6 months. Analysis of variance and t tests were used to determine whether the changes in SIP scores for the treatment group were significantly different than the changes in SIP scores for the control group. In accordance Accordance is Bible Study Software for Macintosh developed by OakTree Software, Inc.[] As well as a standalone program, it is the base software packaged by Zondervan in their Bible Study suites for Macintosh. with the initial construct, the treatment group,s change score was shown to be greater than that of the control group, suggesting that the SIP is able to detect change in the health status of patients with chronic low back pain. Examples of the 2c design and correlation and ROC curve analyses are provided by Deyo and colleagues.[17,25,33] Deyo and Diehl[17,33] investigated the SIP's ability to assess change in health status in patients with mechanical low back pain. The SIP change scores were correlated cor·re·late v. cor·re·lat·ed, cor·re·lat·ing, cor·re·lates v.tr. 1. To put or bring into causal, complementary, parallel, or reciprocal relation. 2. with patient and clinician clinician /cli·ni·cian/ (kli-nish´in) an expert clinical physician and teacher. cli·ni·cian n. assessments of global change in patient status. The construct for change was based on the notion that if both clinician and patient believed that the patient's health status changed, then a clinically meaningful change in health status had occurred. Agreement on independent assessments by the clinician and patient served as the criterion standard for meaningful change in health status. A six-point global rating-of-change scale was collapsed to a three-point scale (better, worse, same). Patients judged to be "the same" and "better" on the global scale demonstrated similar improvements in SIP scores, whereas patients judged to be worse demonstrated SIP scores that were consistent with deterioration. The authors commented that these results may reflect that patients and clinicians were rating lack of improvement on some basis other than function or that the SIP is measuring functional change better than the patient or clinician. The low numbers in the "same" (n=6) and "worse" (n=2) categories make it difficult to identify whether these results were due to a problem in sensitivity of the global rating scale in defining clinically important change over time. These results highlight the need for careful selection of a construct for change when attempting to measure the capability of a measure to assess stable or deteriorating de·te·ri·o·rate v. de·te·ri·o·rat·ed, de·te·ri·o·rat·ing, de·te·ri·o·rates v.tr. To diminish or impair in quality, character, or value: conditions. Moreover, maximum statistical efficiency is achieved when the number of patients in the groups being compared are equal. In a subsequent study,(25) Deyo and Centor assessed change in the SIP and RMQ on patients with acute low back pain using ROC curves. Patients completed the, and the RMQ before and after a 3-week period during which patients received varying amounts of bed rest. The construct for change in health status was that the majority of patients with acute back pain show improvement over a 3-week period, and ROC curve analysis was used to evaluate the results. As in the study described previously, [17,33] Deyo and Centor used a clinician/patient consensus of change as the criterion measure. The areas under the curves for the SIP total score, the SIP physical dimension, and the RMQ were all greater than chance, however, they did not differ from each other. The statistical comparisons of the areas under the ROC curves were based on the works of Hanley and McNeil.[26,27] Summary The intent of this article has been to provide a summary of various design and analytic methods for assessing and quantifying change over time. We suggest that a hierarchy of designs and analytic methods exist for assessing change in health status. Finally, when designing a study to evaluate change in health status, we believe that the following points should be considered: (1) Select a strong study design such as design 2a or 2b; (2) choose an analytic method that is adept at formally testing the validity of change (eg, correlation coefficient, ROC curves, Norman's statistic); (3) when subgroups of patients are being compared, such as patients whose health status has changed an important amount and patients whose health status has not changed, try to create subgroups of approximately equal sample size, (4) select an analysis that takes into account that the data are not independent when head-to-head comparisons of health status measures are performed on the same subjects; and (5) estimate the required sample size before initiating the study. (+) In a previous report,[16] Liang and colleagues compared competing measures by creating a ratio [t.sub.measure]1/[t.sub.measure] 2)2, which was labeled "relative efficiency." We have not included this index in this section, as no formal hypothesis testing hypothesis testing In statistics, a method for testing how accurately a mathematical model based on one set of data predicts the nature of other data sets generated by the same process. was described. References [1] Stratford P, Levy D, Gauldie S, et al. Extensor carpi radialis Extensor carpi radialis can refer to:
ten·do·ni·tis n. Variant of tendinitis. : a validation See validate. validation - The stage in the software life-cycle at the end of the development process where software is evaluated to ensure that it complies with the requirements. of selected outcome measures. Physiotherapy physiotherapy: see physical therapy. Canada. 1987;39:250-255. [2] Shields RK, Enloe LJ, Evans RE, et al. Reliability, validity, and responsiveness of functional tests in patients with total joint replacement. Phys Ther. 1995;75:169-179. [3] Stratford PW, Binkley J, Solomon P, et al. Assessing change over time in patients with low back pain. Phys Ther. 1994;74:528-533. [4] Di Fabio RP, MacKey G, Holte JB. Disability and functional status in patients with low back pain receiving workers, compensation: a descriptive study with implications for the efficacy of physical therapy. Phys Ther. 1995;75:180-193. [5] Boyce WF, Gowland C, Rosenbaum PL, et al. The Gross Motor Performance Measure: validity and responsiveness of a measure of quality of movement. Phys Ther. 1995;75:603-615. [6] Stratford PW, Gill gill, in weights and measures gill, in weights and measures: see English units of measurement. C, Westaway M, Binkley J. Assessing disability and change on individual patients: a report of a patient specific measure. Physiotherapy, Canada. 1995;47:258-263. [7] Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care. 1989;27:S217-S232. [8] Lydick E, Epstein RS. Interpretation of quality of life changes. Qual Life Res. 1993;2:221-226. [9] Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10:407-415. [10] Meenan RF, Anderson JJ, Kazis LE, et al. Outcome assessment in clinical trials: evidence for the sensitivity of a health status measure. Arthritis Rheum rheum (rldbomacm) any watery or catarrhal discharge. rheum n. A watery or thin mucous discharge from the eyes or nose. rheum any watery or catarrhal discharge. , 1984;27:1344-1352. [11] Liang MH, Fossel AH, Larson MG. Comparison of five health status instruments for orthopedic orthopedic /or·tho·pe·dic/ (-pe´dik) pertaining to the correction of deformities of the musculoskeletal system; pertaining to orthopedics. evaluation. Med. Care. 1990;28:632-642. [12] Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis. 1985;38:27-36. [13] Kopec JA, Esdaile JM, Abrahamowicz M, et al. The Quebec back pain disability scale: measurement properties. Spine. 1995;20:341-352. [14] Hays RD, Hadorn D. Responsiveness to change: an aspect of validity, not a separate dimension. Qual Life Res. 1992;1:73-75. [15] Williams JI, Naylor CD. How should health status measure be assessed? Cautionary notes on procrustean frameworks. J Clin Epidemiol. 1992;45:1347-1351. [16] Liang MH, Cullen KE, Schwartz JA. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum. 1985;28:542-547. [17] Deyo RA. Comparative validity of the Sickness Impact Profile and shorter scales for functional assessment in low-back pain. Spine. 1986; 11:951-954. [18] Streiner DL, Norman GR. health Measurement Scales: A Practical Guide to Their Development and Use. 2nd ed. New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of , NY: Oxford University Press Inc; 1995:164. [19] Roland M, Morris R. A study of the natural history of back pain, part I: development of a reliable and sensitive measure of disability in low-back pain. Spine. 1983;8:141-144. [20] Lankhorst GJ, van de Stadt RJ, Voglelaar TW, et al. Objectivity and repeatability of measurements in low back pain. Scand J Rehabil Med. 1982;14:21-26. [21] Kazis LE, Anderson JJ, Meenan RF. Effect size for interpreting changes in health status. Med Care. 1989;27:S178-S189. [22] Guyatt G, Walker S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987; 40:171-178. [23] Tuley MR, Mulrom CD, McMahan CA. Estimating and testing an index of responsiveness and the relationship of the index to power. J Clin Epidemiol. 1991;44:417-421. [24] Norman GR. Issues in the use of change scores in randomized ran·dom·ize tr.v. ran·dom·ized, ran·dom·iz·ing, ran·dom·iz·es To make random in arrangement, especially in order to control the variables in an experiment. trials. J Clin Epidemiol. 1989;42:1097-1105. [25] Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy analogy, in biology, the similarities in function, but differences in evolutionary origin, of body structures in different organisms. For example, the wing of a bird is analogous to the wing of an insect, since both are used for flight. to diagnostic test performance. J Chronic Dis. 1986;39:897-906. [26] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology radiology, branch of medicine specializing in the use of X rays, gamma rays, radioactive isotopes, and other forms of radiation in the diagnosis and treatment of disease. . 1982;143:29-36. [27] Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves receiver operating characteristic curve see roc curve. derived from the same cases. Radiology. 1983;148:839-843. [28] Kleinbaum DG, Kupper LL, Muller Mul·ler , Hermann Joseph 1890-1967. American geneticist. He won a 1946 Nobel Prize for the study of the hereditary effect of x-rays on genes. Mül·ler , Johannes Peter 1801-1858. KE. Applied Regression Analysis In statistics, a mathematical method of modeling the relationships among three or more variables. It is used to predict the value of one variable given the values of the others. For example, a model might estimate sales based on age and gender. and Other Multivariable Methods. 2nd ed. Boston, Mass: PWS-Kent Publishing Co; 1988:91. [29] Williams EJ. The comparison of regression regression, in psychology: see defense mechanism. regression In statistics, a process for determining a line or curve that best represents the general trend of a data set. variables. Journal of the Royal Statistical Society The Journal of the Royal Statistical Society is a series of three peer-reviewed statistics journals published by Blackwell Publishing for the London-based Royal Statistical Society. , Series B. 1959;21:396-399. [30] Steiger JH. Tests for comparing elements of a correlation matrix Noun 1. correlation matrix - a matrix giving the correlations between all pairs of data sets statistics - a branch of applied mathematics concerned with the collection and interpretation of quantitative data and the use of probability theory to estimate population . Psychol Bull. 1980;87:245-251. [31] Fairbank JCT JCT Junction JCT Jerusalem College of Technology JCT Joint Contracts Tribunal (UK build contracts governing body) JCT Journal of Coatings Technology JCT John Christner Trucking JCT Journal of Curriculum Theorizing , Davies JB, Couper J, O'Brien JP. The Oswestry low back pain disability, questionnaire. Physiotherapy. 1980;66:271-273. [32] Deyo RK, Diehl AK. Measuring physical and psychological function in patients with low back pain. Spine. 1983;8:635-642. PW Stratford, PT, is Assistant Professor, School of Occupational Therapy and Physiotherapy, and Associate Member, Department of Clinical Epidemiology epidemiology, field of medicine concerned with the study of epidemics, outbreaks of disease that affect large numbers of people. Epidemiologists, using sophisticated statistical analyses, field investigations, and complex laboratory techniques, investigate the cause and Biostatistics biostatistics /bio·sta·tis·tics/ (-stah-tis´tiks) biometry. bi·o·sta·tis·tics n. The science of statistics applied to the analysis of biological or medical data. , McMaster University McMaster University, at Hamilton, Ont., Canada; nondenominational; founded 1887. It has faculties of humanities, science, social sciences, business, engineering, and health sciences, as well as a school of graduate studies and a divinity college. , Hamilton, Ontario, Canada. Address all correspondence to Mr Stratford at Faculty of Health Sciences, School of Rehabilitation rehabilitation: see physical therapy. Science, McMaster University, OT/PT OT/PT Occupational/Physical Therapy (medical) Building T-16, 1280 Main St W, Hamilton, Ontario, Canada L8S 4K1 (stratfor@mcmaster.ca). JM Binkley, PT, COMP, is Director of Clinical Research, Rehab Management Systems Inc, Dahlonega, GA 30597, and Assistant Clinical Professor, McMaster University. DL Riddle riddle, puzzling question, specifically one that consists of a fanciful description or definition of something to be guessed. A famous riddle was asked by the Sphinx: "What goes on four legs in the morning, on two at noon, on three at night?" Oedipus guessed the , PT, is Associate Professor, Department of Physical Therapy, Medical College of Virginia History The school was founded in 1838 as the Medical Department of Hampden-Sydney College. It received an independent charter from the General Assembly in 1854 and became the Medical College of Virginia, and shortly thereafter transferred all its property to the Commonwealth , Virginia Commonwealth University Formed by a merger between the Richmond Professional Institute and the Medical College of Virginia in 1968, VCU has a medical school that is home to the nation's oldest organ transplant program. , Richmond, VA, 23298-0224. |
|
||||||||||||||||

tra·tive·ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion