Reliability of the PEDro scale for rating quality of randomized controlled trials. (Research Report).Systematic reviews of randomized controlled trials A randomized controlled trial (RCT) is a scientific procedure most commonly used in testing medicines or medical procedures. RCTs are considered the most reliable form of scientific evidence because it eliminates all forms of spurious causality. (RCTs) are considered by some authors (1-3) to constitute the best single source of information about the effectiveness of health care interventions. Most systematic reviews involve assessment of the quality of the RCTs being reviewed because there is evidence that low-quality studies provide biased estimates of treatment effectiveness. For example, RCTs that are not blinded (4,5) or do not use concealed con·ceal tr.v. con·cealed, con·ceal·ing, con·ceals To keep from being seen, found, observed, or discovered; hide. See Synonyms at hide1. allocation (4-6) tend to show greater effects of intervention than RCTs with these features. Systematic reviews may exclude low-quality studies from the analysis (eg, systematic review by Herbert and Gabriel (7)), or they may weight the findings of low-quality studies less heavily in the analysis (eg, systematic reviews by van Tulder et al (8) van Poppel et al, (9) and Berghmans et al (10)). Consequently, the method of quality assessment can affect the conclusions of reviews that use quantitative methods (eg, meta-analysis) (11) or qualitative methods (eg, the levels of evidence approach) (12) to summarize sum·ma·rize intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es To make a summary or make a summary of. sum the results. As an illustration, Colle and colleagues (12) have shown in a re-analysis of the RCTs included in the Cochrane review of exercise for low back pain (13) that the conclusions of the review changed substantially when different scales were used to rate the RCTs. The original conclusion was that there was "conflicting evidence" on the effectiveness of exercise therapy versus inactive in·ac·tive adj. 1. Not active or tending to be active. 2. a. Not functioning or operating; out of use: inactive machinery. b. treatment, but the conclusion changed to "strong evidence" that exercise is more effective than inactive treatment when the Beckerman scale, rather than the original scale, was used to rate RCT RCT Randomized Controlled Trial RCT Regimental Combat Team (infantry regiment with their own artillery, engineers, medical and tanks) RCT Rollercoaster Tycoon RCT Randomized Clinical Trial RCT Rhondda Cynon Taff quality. (12) The methodological quality of an RCT also may be of interest outside the context of a systematic review. Researchers planning or reporting an RCT, journal reviewers considering a manuscript reporting an RCT, or clinicians judging whether RCTs have relevance to their practice all may need to consider the issue of methodological quality. Although there are numerous definitions of RCT methodological quality, we prefer the following definition: "the likelihood of the trial design to generate unbiased results that are sufficiently precise and allow replication in clinical practice." (3(p651)) Verhagen and colleagues (3) described 2 approaches to quality assessment of RCTs. The first approach focuses on the presence or absence of key methodological components, such as randomization randomization (ranˈ·d a research tool that uses an evidence-based approach to improve the quality of reports of randomized trials. was developed by the CONSORT CONSORT. A man or woman married. The man is the consort of his wife, the woman is the consort of her husband. group in order to improve the quality of reports of RCTs. (14) One issue that has received little consideration is the reliability of the assessments of RCT quality. In a 1998 review, 21 scales of trial quality were described, and only 12 scales had any evidence about reliability. In a review published in 2001, (3) the authors studied about 60 scales and reported that the reliability of the resultant This article is about the resultant of polynomials. For the result of adding two or more vectors, see Parallelogram rule. For the technique in organ building, see Resultant (organ). In mathematics, the resultant of two monic polynomials assessments for most of the scales was unknown. The reliability of the assessments obtained with what we believe is the most widely used scale (the Jadad scale used, for example, by the Philadelphia Panel (15)) is in dispute. (16) Two groups (16,17) reported low reliability of assessments obtained with the Jadad scale, whereas another 2 groups (18,19) reported high reliability. The study by Jadad et al (19) probably provides an overly optimistic op·ti·mist n. 1. One who usually expects a favorable outcome. 2. A believer in philosophical optimism. op view of reliability as the authors excluded one assessor's results because "he recorded the scores incorrectly and it was impossible to determine to which report each score referred." (19(p5)) For researchers considering which scale they should use in a systematic review, there is an additional problem. In most systematic reviews, it is not the rating of an individual that is used but rather a consensus rating from 2 or more assessors. Therefore, the reliability estimate that is most important for people conducting systematic reviews is the reliability of consensus ratings, not the reliability of an individual rating. We are unaware of any study that has evaluated the reliability of consensus ratings. In this article, we report on 2 studies that evaluated the reliability of data obtained with the PEDro scale. The scale is called the PEDro scale because it was initially developed to rate quality of RCTs on PEDro, the Physiotherapy physiotherapy: see physical therapy. Evidence Database (www.pedro.fhs.usyd. edu.au). The PEDro scale is an 11-item scale designed for rating methodological quality of RCTs (the scale is presented in the Figure, and operational definitions for each scale item are given in the Appendix). Each satisfied item (except for item 1, which, unlike other scale items, pertains to external validity External validity is a form of experimental validity.[1] An experiment is said to possess external validity if the experiment’s results hold across different experimental settings, procedures and participants. ) contributes one point to the total PEDro score (range=0-10 points). The scale has been used to rate the quality of over 3,000 RCTs in the PEDro database (20) and in several systematic reviews. (7,21,22) The scale is based on the list developed by Verhagen et al (23) using the Delphi consensus technique. There is evidence for discriminative dis·crim·i·na·tive adj. 1. Drawing distinctions. 2. Marked by or showing prejudice: discriminative hiring practices. validity for 3 of the scale items: randomization, (24) concealed allocation, (4,6,24) and blinding. (4) The other items are reported to have face validity face validity (fāsˑ v n (23) but are yet to be validated by other means. As the developers of the PEDro database, we faced a dilemma when planning the database: what scale should we use to rate the RCTs to be archived in PEDro? In the end, we chose what we believe is a conservative path and based the PEDro list on the 2 scales that had been developed by formal scale development techniques. (3) Because the items in the 3-item Jadad scale and the 9-item Delphi list are all contained in the 11-item PEDro scale, it is possible to generate "Jadad," "Delphi," and "PEDro" scores from the PEDro database. If we had chosen to use only the 3-item Jadad scale, we would not have had this versatility. In addition, for each RCT in the PEDro database, we record and display which of the 11 PEDro items are satisfied, an approach that accommodates those who view quality as the presence or absence of components such as randomization. Because we regularly use the PEDro scale to rate RCTs in the database and to rate RCTs for the systematic reviews we conduct, we were interested in the reliability of assessments obtained with the PEDro scale. Additionally, the reliability of assessments obtained with the PEDro scale is likely to be of interest to the large number of people who have used the PEDro database. In this article, we report on 2 studies that investigated the interrater reliability of ratings of each of the 11 items on the PEDro scale and the total (summed) PEDro score. Interrater reliability was evaluated for individual ratings and consensus ratings. Method We conducted 2 studies. In both studies, the reports of RCTs were rated for methodological quality. In study 1, we randomly selected 25 RCTs (using the random number function in Microsoft Excel (tool) Microsoft Excel - A spreadsheet program from Microsoft, part of their Microsoft Office suite of productivity tools for Microsoft Windows and Macintosh. Excel is probably the most widely used spreadsheet in the world. Latest version: Excel 97, as of 1997-01-14. *) from the Englishlanguage RCTs in the PEDro database, and we created a new set of ratings for the reliability analysis. One of the selected RCTs was published in the 1970s, 11 RCTs were published in the 1980s, and 13 RCTs were published in the 1990s. Nine RCTs were coded as relevant to the musculoskeletal musculoskeletal /mus·cu·lo·skel·e·tal/ (-skel´e-t'l) pertaining to or comprising the skeleton and muscles. mus·cu·lo·skel·e·tal adj. Relating to or involving the muscles and the skeleton. subdiscipline sub·dis·ci·pline n. A field of specialized study within a broader discipline; a subfield. , 4 as relevant to neurology neurology (n rŏl`əjē, ny –), study of the morphology, physiology, and pathology of the human nervous system. , 2 as relevant
to the cardiothoracic cardiothoracic /car·dio·tho·rac·ic/ (-thah-ras´ik) pertaining to the heart and the thorax. car·di·o·tho·rac·ic n. Of or relating to the heart and the chest. subdiscipline, 2 as relevant to continence continence /con·ti·nence/ (kon´tin-ens) the ability to control natural impulses.con´tinent con·ti·nence n. 1. Self-restraint; moderation. 2. and women's health Women's Health Definition Women's health is the effect of gender on disease and health that encompasses a broad range of biological and psychosocial issues. , 2 as relevant to gerontology gerontology: see geriatrics. , 2 as relevant to orthopedics orthopedics (ôrthəpē`dĭks), medical specialty concerned with deformities, injuries, and diseases of the bones, joints, ligaments, tendons, and muscles. , and 2 as relevant to sports, and no appropriate category was assigned to 2 RCTs (see Moseley et al (25) for definitions). Each RCT was independently rated by 11 raters who were aware that they were participating in a reliability study. In both studies, raters were volunteer physical therapists who had been trained in the use of the scale. For training, raters rated a series of 5 practice RCTs and were given feedback on their performance using criterion ratings that we generated, as well as justification of the rating for each item. Instructions for obtaining the training package are on the PEDro Web site (www. pedro.fhs.usyd.edu.au). Additional feedback was obtained via e-mail correspondence with the third author (RDH RDH abbr. Registered Dental Hygienist RDH, n an abbreviation for registered dental hygienist. ). Raters then had to pass a rating accuracy test using a separate set of RCTs. Each rater rat·er n. 1. One that rates, especially one that establishes a rating. 2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. independently rated 5 test RCTs using the 11-item PEDro scale (ie, a total of 55 ratings), and these ratings were submitted via e-mail and compared with criterion ratings that we generated. Those who scored 51/55 or more correct ratings (ie, <10% errors) were considered able to rate RCTs, whereas those who scored from 46/55 to 50/55 received further feedback before they were considered able to rate trials (raters scoring 45/55 or less were excluded from further rating unless they passed a subsequent accuracy test using another set of 5 RCTs). The cutoff of <10% errors as evidence of ability to rate RCTs is the consensus opinion of the PEDro developers and was not empirically derived. In study 2, we examined the reliability of ratings on a larger sample of RCTs, and we examined the reliability of ratings made by a panel of 2 or 3 raters (ie, consensus ratings). For this study, 120 English-language RCTs with consensus ratings were randomly selected from the PEDro database (using the random number function in Microsoft Excel). One of the retrieved RCTs was published in the 1960s, 5 were published in the 1970s, 29 were published in the 1980s, 73 were published in the 1990s, and 12 were published in the 2000s. Forty-eight RCTs were coded as relevant to the musculoskeletal subdiscipline, 22 as relevant to cardiothoracics, 15 as relevant to gerontology, 9 as relevant to orthopedics, 8 as relevant to neurology, 6 as relevant to continence and women's health, 4 as relevant to pediatrics, 4 as not being relevant to a specific subdiscipline, 3 as relevant to ergonomics ergonomics, the engineering science concerned with the physical and psychological relationship between machines and the people who use them. The ergonomicist takes an empirical approach to the study of human-machine interactions. , and 1 as relevant to sports. Each RCT had previously been independently rated by 2 raters, and where the ratings for any scale item in any RCT disagreed, a third (consensus) rater arbitrated. These existing ratings, together with a new set of independent ratings created for the study, were used in the reliability analysis. Consensus ratings were performed by 4 of the authors (CGM (1) (Computer Graphics Metafile) An ISO/IEC standard format for 2D graphics images introduced in 1987. Primarily a vector graphics format for technical illustrations and geophysical visualizations, CGM also supports raster graphics and text. , CS, RDH, and AMM AMM Autorisation de Mise sur le Marche (French) AMM Autorisation de Mise sur le Marché (French: Commission of Marketing Authorization) AMM ASEAN Ministerial Meeting AMM American Metal Market ) and 2 research assistants who developed the PEDro scale and maintain the PEDro database. Raters were asked to specify where in the report of an RCT each criterion was described as being fulfilled ful·fill also ful·fil tr.v. ful·filled, ful·fill·ing, ful·fills also ful·fils 1. To bring into actuality; effect: fulfilled their promises. 2. , and these sheets were made available to the consensus raters. Many of the disagreements in the use of the scale seem to arise when one rater misses the description in the text of inclusion of the methodological feature. This can arise if the report of an RCT is poorly organized (eg, describing use of an intention-totreat analysis in the discussion section) or the article is old and attends to a methodological feature (eg, concealed allocation) but does not use the specific term because it was not yet in common use. As an illustration, Doull and colleagues' 1931 RCT (26) described a process that would achieve concealed allocation, but the term "concealed allocation" would not be coined for many decades. The final rating (that agreed on by the first 2 raters or assigned by the third rater) will be referred to as the "consensus rating." The 120 RCTs were assessed by 25 raters who each rated from 1 to 56 RCTs ([bar]X=13.8). A third rater was required for at least one scale item in all except 24 RCTs. All of these ratings were conducted as part of the normal process of maintaining the PEDro database. The raters were not aware that the reliability of ratings would be evaluated. Subsequently, the 120 RCTs in study 2 were re-rated by a different set of raters. Again, each RCT was rated twice, and where necessary a third rater arbitrated. Seven raters each rated from 8 to 60 trials ([bar]X=45). A third assessment was required for at least one scale item in all except 27 RCTs. The reliability of dichotomous di·chot·o·mous adj. 1. Divided or dividing into two parts or classifications. 2. Characterized by dichotomy. di·chot judgments for each item was evaluated with a generalized gen·er·al·ized adj. 1. Involving an entire organ, as when an epileptic seizure involves all parts of the brain. 2. Not specifically adapted to a particular environment or function; not specialized. 3. kappa Kappa Used in regression analysis, Kappa represents the ratio of the dollar price change in the price of an option to a 1% change in the expected price volatility. Notes: Remember, the price of the option increases simultaneously with the volatility. statistic statistic, n a value or number that describes a series of quantitative observations or measures; a value calculated from a sample. statistic a numerical value calculated from a number of observations in order to summarize them. using the multirater kappa utility. ([dagger]) In addition, the base rate for a positive response and the percent of agreement were calculated. The reliability of individual ratings was evaluated in study 1 by comparing the ratings from all 5 raters and in study 2 by comparing all 4 individual ratings (the first 2 ratings from each of the 2 sets of raters). The reliability of consensus ratings was evaluated in study 2 by comparing the first and second sets of consensus ratings. The reliability of the total PEDro score (obtained by summing "yes" responses to items 2-11) was evaluated using type 1,1 intraclass correlation In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups. coefficients (ICCs) with the ICCSF1A.SPS (Standby Power System) A UPS system that switches to battery backup upon detection of power failure. See UPS. SPS - Symbolic Programming System. Assembly language for IBM 1620. macro in SPSS A statistical package from SPSS, Inc., Chicago (www.spss.com) that runs on PCs, most mainframes and minis and is used extensively in marketing research. It provides over 50 statistical processes, including regression analysis, correlation and analysis of variance. 10.0 (SPSS for Windows, Release 10.0.5 ([double dagger double dagger n. A reference mark ( ) used in printing and writing. Also called diesis.Noun 1. ])). In study 1, each of the 11 raters rated each RCT, so the 2,1 form of the ICC ICC See: International Chamber of Commerce statistic would normally be used. In study 2, pairs of raters were drawn from a larger panel, so not all raters rated each RCT. Thus, in study 2, the 1,1 form of the ICC statistic was used. We believed it was more appropriate to use the same ICC model for each study (because this facilitates comparison across studies); therefore, we used the 1,1 model in both studies. The consequence of such a choice is that we potentially slightly underestimated the reliability of ratings in study 1. In addition, we determined the percentage of close agreement of ratings within 2 points on the total PEDro scale for all ratings and the standard error of the measurement for the consensus ratings only. Reliability estimates lie along a continuum. Although kappa and ICC values are continuous data, we believe that physical therapists collapse these continuous data into discrete categories In mathematics, especially category theory, a discrete category is a category whose only morphisms are the identity morphisms. It is the simplest kind of category. Specifically a category C is discrete if
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es To put into a category or categories; classify. cat measurements of reliability, we have chosen to describe the level of reliability for the kappa values using categories suggested by Landis and Koch (27) ([greater than or equal to] .81="almost perfect," .61-.80="substantial," .41-.60 ="moderate," .21-.40="fair," .00-.20="slight," and <.00="poor") and for ICC values using those suggested by Fleiss (28) (>.75="excellent" reliability, .40-.75="fair" to "good" reliability, and (.40="poor" reliability). The categories provide a description of the level of reliability that some readers may find useful. They should not be used to make a judgment as to whether the level of reliability is acceptable or not. Such a decision would require a consideration of how the data will be used. Results The reliability of ratings of individual scale items is shown in Tables 1 and 2. With the exception of "random allocation," "therapist blinding," and "intention-to-treat analysis," similar estimates of interrater reliability by individual raters were obtained in study 1 (Tab. 1) and study 2 (Tab. 2). From the study with the largest sample (ie, study 2), kappa values for individual scale items ranged from .36 to .80. The reliability of ratings for the PEDro scale item "groups similar at baseline" was "fair." The reliability of ratings for the PEDro scale items "eligibility criteria specified," "point measures and variability data," "random allocation," "less than 15% dropouts," "betweengroup statistical comparisons," and "intention-to-treat analysis" was "moderate." The reliability of ratings for all other scale items was "substantial." The reliability of consensus ratings (ie, ratings made by a panel of 2 or 3 raters) ranged from .50 to .79 for individual scale items. The items "groups similar at baseline," "point measures and variability data," and "intention-to-treat analysis" demonstrated "moderate" reliability, whereas the other 8 scale items demonstrated "substantial" reliability. With the consensus ratings, 5 of the 11 items ("eligibility criteria specified," "random allocation," "groups similar at baseline," "less than 15% dropouts," and "between-group statistical comparisons") achieved reliability in a higher benchmark than was achieved for individual ratings. For example, for item 1 ("eligibility criteria specified") individual ratings had "moderate" reliability, whereas consensus ratings achieved "substantial" reliability. For the remaining 6 items, the reliability was within the same benchmark for individual and consensus ratings. The ICCs for interrater reliability of the total PEDro scores for individual raters were .55 (95% confidence interval confidence interval, n a statistical device used to determine the range within which an acceptable datum would fall. Confidence intervals are usually expressed in percentages, typically 95% or 99%. [CI]=.41, .72) for study I and .56 (95% CI=.47, .65) for study 2. The ICC for consensus ratings was slightly higher at .68 (95% CI=.57, .76). These findings suggest that the total PEDro score can be assessed with "fair" to "good" reliability. In study 1, assessments by individual raters of the total PEDro score agreed exactly on 35% of occasions, differed by I point or less on 78% of occasions, and differed by 2 points or less 93% of the time. In study 2, exact agreement occurred on 35% of occasions, and individual raters differed by 1 point or less on 78% of occasions and by 2 points or less 94% of the time. Consensus scores were in exact agreement 46% of the time, differed by 1 point or less 85% of the time, and differed by 2 points or less 99% of the time. The standard error of the measurement for the consensus ratings was 0.70 unit. Discussion The main findings of our studies were that the reliability of ratings of individual PEDro scale items varied from "fair" to "substantial," or from "moderate" to "substantial" when rated by panels of raters, and the reliability for the total PEDro score was "fair" to "good." The reliability for some PEDro scale items was only "fair" or "moderate." The item "groups similar at baseline" was the only one that had "fair" reliability, and the reliability for this item improved to "moderate" when consensus ratings were used. Rating this item requires a decision as to whether groups of subjects in a RCT were similar on key prognostic prog·nos·tic adj. 1. Of, relating to, or useful in prognosis. 2. Of or relating to prediction; predictive. n. 1. A sign or symptom indicating the future course of a disease. 2. indicators prior to the intervention. It is likely that this decision is influenced by the rater's knowledge of the condition being treated and how strictly the term "similar" is interpreted. The other 2 items with comparable reliability in consensus ratings were "point measures and variability data" and "intention-to-treat analysis." The appearance here of "point measures and variability data" is a little surprising because the presence or otherwise of such measures should be relatively easy to establish. The majority of RCTs in the PEDro database do not include an intention-to-treat analysis. (25) Where an intention-to-treat analysis has been undertaken, it is often not explicitly stated, so accurate rating of this item required careful reading of the text to establish whether this had occurred. Our impression is that intention-to-treat analysis is better reported in more recent articles. The reliability we observed for the total PEDro score for panels of raters (ICC=.68, 95% CI=.57, .76) is similar to that reported by Berard et al (29) for the Chalmers scale (ICC=.66, 95% CI=.55, .79), by Jadad et al (19) for the Jadad scale (ICC=.59, 95% CI=.46, .74), and by Verhagen et al (30) for the Maastricht list (ICC=.77, 95% CI=.64, .89) but not as high as the reliability reported by Oremus et al (18) for the Jadad scale (ICC=.90). Our reliability for individual items is difficult to benchmark because only Clark et al (17) provided reliability estimates for each item using the Jadad scale, and the items in that scale are not sufficiently similar to items in the PEDro scale to allow meaningful comparison. For a number of the scale items, the base rate was either very high or very low. When interpreting the kappa values for these items, readers need to be aware of the behavior of kappa values. When the prevalence (or base rate) is either very high or very low, it is possible to have high agreement but a low kappa value, and this characteristic of the kappa statistic is sometimes called the "base rate problem." (31) This characteristic is not unique to the kappa statistic but also occurs, for example, with the ICC statistic when rating a homogeneous The same. Contrast with heterogeneous. homogeneous - (Or "homogenous") Of uniform nature, similar in kind. 1. In the context of distributed systems, middleware makes heterogeneous systems appear as a homogeneous entity. For example see: interoperable network. sample. Spitznagel and Helzer (31) viewed this influence of the base rate on the kappa value as undesirable, whereas Shrout et al (32) contended that this behavior is entirely appropriate and "represents the real problem of making distinctions in increasingly homogeneous populations." (32(p175)) They stated that "a major strength of K is precisely that it does weigh disagreements more when the base rate approaches 0% or 100%." (32(p174)) Our opinion on this issue is closer to Shrout and colleagues' position, (32) and so we would defend the use of the kappa statistic in our study. We believe that the important issue is not a low base rate but the scenario where a data set has an artificially low base rate that is not representative of the population. In such a situation, both sides of the base rate problem debate would agree that the estimates of reliability provided by the kappa statistic are misleading. In both studies, we randomly selected a sample of trials from the population of trials on the PEDro database. Not surprisingly, the base rates in the 2 samples were very similar to the base rate for the population (see Moseley et a1 (25)). Accordingly, we believe that the use of the kappa statistic was justified in our studies and did not produce misleading inferences about reliability of ratings for items on the PEDro scale. An understanding of the error associated with the PEDro scale can be used to guide the conduct of a systematic review that uses a minimum PEDro score as an inclusion criterion. In our studies, we noted that repeated PEDro consensus scores were within one point on 85% of occasions and within 2 points on 99% of occasions. We believe it is sensible to conduct a sensitivity analysis to see how the conclusions of a systematic review are affected by varying the PEDro cutoff. For example, in Maher's review of workplace interventions to prevent low back pain, (22) reducing the PEDro cutoff from the original strict PEDro cutoff of 6 to a less strict cutoff of 5 (or even 4) did not change the conclusion that there was strong evidence that braces See curly brace. are ineffective in preventing low back pain. Readers should have more confidence in the conclusion of a review that is unaffected by changing the quality cutoff. The precision of the PEDro scale also should be considered by users of the PEDro database. None of the scale items had perfect reliability for the consensus ratings (consensus ratings are displayed on the PEDro database); thus, users need to understand that the PEDro scores contain some error. Readers who use the total score to distinguish between low- and high-quality RCTs need to recall that the standard error of the measurement for total scores is 0.70 unit and consider this when comparing 2 studies. Based on this standard error of the measurement, a difference of 1 unit in the PEDro scores of 2 studies provides 68% confidence that the 2 studies truly had different PEDro scores, a difference of 2 units provides 96% confidence that the 2 studies truly had different PEDro scores, and a difference of 3 units provides 99% confidence that the 2 studies truly had different PEDro scores. Conclusion The results of our studies indicate that the reliability of the total PEDro score, based on consensus judgments, is acceptable. The scale appears to have sufficient reliability for use in systematic reviews of physical therapy RCTs.
Appendix.
Operational Definitions for the 11 PEDro Criteria
Criterion Operational Definition
All criteria Points are awarded only when a criterion is
clearly satisfied. If, on a literal reading of
the trial report, it is possible that a
criterion was not satisfied, a point should not
be awarded for that criterion.
Criterion 1 This criterion is satisfied if the report
describes the source of subjects and a list of
criteria used to determine who was eligible to
participate in the study.
Criterion 2 A study is considered to have used random
allocation if the report states that allocation
was random. The precise method of randomization
need not be specified. Procedures such as coin
tossing and dice rolling should be considered
random. Quasi-randomization allocation
procedures such as allocation by hospital
record number or birth date, or alternation, do
not satisfy this criterion.
Criterion 3 Concealed allocation means that the person who
determined if a subject was eligible for
inclusion in the trial was unaware, when this
decision was made, of which group the subject
would be allocated to. A point is awarded for
this criterion, even if it is not stated that
allocation was concealed, when the report
states that allocation was by sealed opaque
envelopes or that allocation involved
contacting the holder of the allocation
schedule who was "off-site."
Criterion 4 At a minimum, in studies of therapeutic
interventions, the report must describe at
least one measure of the severity of the
condition being treated and at least one
(different) key outcome measure at baseline.
The rater must be satisfied that the groups'
outcomes would not be expected to differ, on
the basis of baseline differences in prognostic
variables alone, by a clinically significant
amount. This criterion is satisfied even if
only baseline data of subjects completing the
study are presented.
Criteria 4, 7-11 Key outcomes are those outcomes that provide the
primary measure of the effectiveness (or lack
of effectiveness) of the therapy. In most
studies, more than one variable is used as an
outcome measure.
Criteria 5-7 Blinding means the person in question (subject,
therapist, or assessor) did not know which
group the subject had been allocated to. In
addition, subjects and therapists are only
considered to be "blind" if it could be
expected that they would have been unable to
distinguish between the treatments applied to
different groups. In trials in which key
outcomes are self-reported (eg, visual analog
scale, pain diary), the assessor is considered
to be blind if the subject was blind.
Criterion 8 This criterion is satisfied only if the report
explicitly states both the number of subjects
initially allocated to groups and the number
of subjects from whom key outcome measurements
were obtained. In trials in which outcomes are
measured at several points in time, a key
outcome must have been measured in more than
85% of subjects at one of those points in time.
Criterion 9 An intention-to-treat analysis means that, where
subjects did not receive treatment (or the
control condition) as allocated and where
measures of outcomes were available, the
analysis was performed as if subjects received
the treatment (or control condition) they were
allocated to. This criterion is satisfied,
even if there is no mention of analysis by
intention to treat, if the report explicitly
states that all subjects received treatment or
control conditions as allocated.
Criterion 10 A between-group statistical comparison involves
statistical comparison of one group with
another. Depending on the design of the study,
this may involve comparison of 2 or more
treatments or comparison of treatment with a
control condition. The analysis may be a simple
comparison of outcomes measured after the
treatment was administered or a comparison of
the change in one group with the change in
another (when a factorial analysis of variance
has been used to analyze the data, the latter
is often reported as a group x time
interaction). The comparison may be in the
form hypothesis testing (which provides a P
value, describing the probability that the
probability that the groups differed only by
chance) or in the form of an estimate (eg, the
mean or median difference, a difference in
proportions, number needed to treat, a relative
risk or hazard ratio) and its confidence
interval.
Criterion 11 A point measure is a measure of the size of the
treatment effect. The treatment effect may be
described as a difference in group outcomes or
as the outcome in (each of) all groups.
Measures of variability include standard
deviations, standard errors, confidence
intervals, interquartile ranges (or other
quartile ranges), and ranges. Point measures
and/or measures of variability may be provided
graphically (eg, standard deviations may be given
as error bars in a figure) as long as it is clear
what is being graphed (eg, as long as it is clear
whether error bars represent standard deviations
or standard errors). Where outcomes are
categorical, this criterion is considered to have
been met if the number of subjects in each
category is given for each group.
Figure.
PEDro Scale items. Each satisfied item (except the first item)
contributes 1 point to the total PEDro score (range=O-lO points).
Operational definitions of each item are given in the Appendix.
1. Eligibility criteria were specified
2. Subjects were randomly allocated to groups (in a crossover
study, subjects were randomly allocated an order in which
treatments were received)
3. Allocation was concealed
4. The groups were similar at baseline regarding the most
important prognostic indicators
5. There was blinding of all subjects
6. There was blinding of all therapists who administered the
therapy
7. There was blinding of all assessors who measured at least
one key outcome
8. Measurements of at least one key outcome were obtained
from more than 85% of the subjects initially allocated to
groups
9. All subjects for whom outcome measurements were available
received the treatment or control condition as allocated,
or where this was not the case, data for at least one key
outcome were analyzed by "intention to treat"
10. The results of between-group statistical comparisons are
reported for at least one key outcome
11. The study provides both point measurements and
measurements of variability for at least one key outcome
Table 1.
Estimates of Reliability from Study 1 for Each of the 11 Items of the
PEDro Scale
Base
Rate % of Kappa
PEDro Scale Item (a) Agreement (SE)
1. Eligibility criteria specified 70.2 81.5 .56 (.06)
2. Random allocation 95.6 92.7 .13 (.27)
3. Concealed allocation 9.8 93.3 .62 (.17)
4. Groups similar at baseline 55.3 70.3 .40 (.03)
5. Subject blinding 18.9 89.5 .66 (.10)
6. Therapist blinding 3.3 95.8 .33 (.32)
7. Assessor blinding 29.1 88.9 .73 (.06)
8. Less than 15% dropouts 62.2 72.9 .42 (.04)
9. Intention-to-treat analysis 8.7 86.0 .12 (.18)
10. Between group statistical
comparisons 84.0 89.7 .62 (.12)
11. Point measures and variability
data 78.2 86.0 .59 (.09)
(a) Base rate for a "yes" response.
Table 2.
Estimates of Reliability From Study 2 for Each of the 11 Items of the
PEDro Scale (a)
Individual Ratings
Base
Rate % of Kappa
PEDro Scale Item (b) Agreement (SE)
1. Eligibility criteria specified 73.1 76.8 .41 (.06)
2. Random allocation 95.2 95.1 .47 (.20)
3. Concealed allocation 19.6 88.6 .64 (.08)
4. Groups similar at baseline 61.7 69.7 .36 (.04)
5. Subject blinding 8.3 94.2 .62 (.14)
6. Therapist blinding 4.8 98.2 .80 (.20)
7. Assessor blinding 39.2 84.7 .68 (.04)
8. Less than 15% dropouts 62.5 75.0 .47 (.04)
9. Intention-to-treat analysis 15.2 86.5 .48 (.1O)
10. Between-group statistical
comparisons 91.3 91.9 .50 (.14)
11. Point measures and variability
data 84.0 85.1 .45 (.09)
Consensus Ratings
Base % of Kappa
PEDro Scale Item Rate Agreement (SE)
1. Eligibility criteria specified 71.7 85.0 .63 (.11)
2. Random allocation 95.8 98.3 .79 (.31)
3. Concealed allocation 18.8 90.8 .70 (.14)
4. Groups similar at baseline 62.5 76.7 .50 (.10)
5. Subject blinding 5.8 96.7 .70 (.26)
6. Therapist blinding 4.2 98.3 .79 (.31)
7. Assessor blinding 41.7 90.0 .79 (.09)
8. Less than 15% dropouts 65.8 85.0 .67 (.10)
9. Intention-to-treat analysis 14.6 89.2 .57 (.16)
10. Between-group statistical
comparisons 92.9 95.8 .68 (.23)
11. Point measures and variability
data 87.5 90.0 .54 (.17)
(a) Study 2 provided estimates of both individual and consensus
ratings.
(b) Base rate for a "yes" response.
* Microsoft Corp, One Microsoft Way, Redmond, WA 98052-6399. ([dagger]) Christopher N Chapman, University of Tulsa. ([double dagger]) SPSS Inc, 233 Wacker Wacker may refer to:
References (1) National Health and Medical Research Council The National Health and Medical Research Council (NHMRC) is Australia's peak funding body for medical research, with a budget of nearly A$500M a year . The Council was established to develop and maintain health standards and is responsible for implementing the . How to Use the Evidence: Assessment and Application of Scientific Evidence. Canberra, Australia Capital Territory, Australia: Biotext; 2000. (2) Moher D, Cook D, Eastwood S Eastwood is the name of several places:
randomized irregular - contrary to rule or accepted order or general practice; "irregular hiring practices" controlled trials controlled trial Clinical research A clinical study in which one group of participants receives an experimental drug while the other receives either a placebo or an approved–'gold standard' therapy. See Blinding, Double-blinded. : the QUORUM A majority of an entire body; e.g., a quorum of a legislative assembly. A quorum is the minimum number of people who must be present to pass a law, make a judgment, or conduct business. statement. Lancet lancet /lan·cet/ (lan´set) a small, pointed, two-edged surgical knife. lan·cet n. . 1999;354(9193):1896-1900. (3) Verhagen AP, de Vet HCW HCW Health care worker, see there , de Bie RA, et al. The art of quality assessment of RCTs included in systematic reviews. J Clin Epidemiol. 2001;54:651-654. (4) Schulz K, Chalmers I, Hayes R, Altman D. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA JAMA abbr. Journal of the American Medical Association . 1995;273:408-412. (5) Egger M, Bartlett C, Holenstein F, Sterne J. How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study. Health TechnolAssess. 2003;7:1-76. (6) Moher D, Pham B, Cook D, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998;352 (9128):609- 613. (7) Herbert R, Gabriel M. Effects of stretching before and after exercising on muscle soreness and risk of injury: systematic review. BMJ BMJ n abbr (= British Medical Journal) → vom BMA herausgegebene Zeitschrift . 2002;325:468-472. (8) van Tuider MW, Cherkin DC, Berman B, et al. The effectiveness of acupuncture acupuncture (ăk`y pŭng'chər), technique of traditional Chinese medicine, in which a number of very fine metal needles are inserted into the skin at specially designated points. in the management of acute and chronic low back pain:. a
systematic review within the framework of the Cochrane Collaboration The Cochrane Collaboration was developed in response to Archie Cochrane's call for up-to-date, systematic reviews of all relevant randomized controlled trials of health care. Back Review Group. Spine. 1999;24:1113-1123.(9) van Poppel MN, Koes BW, van der Ploeg T, et al. Lumbar lumbar /lum·bar/ (lum´bar) pertaining to the loins. lum·bar adj. Of, near, or situated in the part of the back and sides between the lowest ribs and the pelvis. supports and education for the prevention of low back pain in industry: a randomized controlled trial. JAMA. 1998;279:1789-1794. (10) Berghmans LC, Hendriks HJ, Bo K, et al. Conservative treatment of stress urinary incontinence stress urinary incontinence n. See stress incontinence. in women: a systematic review of randomized clinical trials randomized clinical trial, n a clinical study where volunteer participants with comparable characteristics are randomly assigned to different test groups to compare the efficacy of therapies. . Br J Urol. 1998;82:181-191. (11) Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999;282:1054-1060. (12) Colle F, Rannou F, Revel M, et al. Impact of quality scales on levels of evidence inferred from a systematic review of exercise therapy and low back pain. Arch Phys Med Rehabil. 2002;83:1745-1752. (13) van Tulder MW, Malmivaara A, Esmail R, Koes BW. Exercise therapy for low back pain. The Cochrane Library The Cochrane Library is a collection of databases in medicine and other healthcare specialties provided by the Cochrane Collaboration. At its core is a database of systematic reviews and meta-analyses which summarise and interpret the results of high-quality medical research. . 2003; issue 1. (14) Moher D, Schulz K, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports of parallelgroup randomised trials. Lancet. 2001;357(9263):1191-1194. (15) Philadelphia Panel Evidence-Based Clinical Practice Guidelines clinical practice guidelines Clinical policies, practice guidelines, practice parameters, practice policies Medtalk Systematically developed statements to assist practitioner and Pt decisions about appropriate health care for specific clinical circumstances. See Psychology. on Selected Rehabilitation rehabilitation: see physical therapy. Interventions: Overview and Methodology. Phys Ther. 2001;81:1629-1640. (16) Bhandari M, Richards RR, Sprague S Sprague , Frank Julian 1857-1934. American engineer and inventor. He developed the first electric trolley system (1887) and made advances in electric elevator design. , Schemitsch EH. Quality in the reporting of randomized ran·dom·ize tr.v. ran·dom·ized, ran·dom·iz·ing, ran·dom·iz·es To make random in arrangement, especially in order to control the variables in an experiment. trials in surgery: is the Jadad scale reliable? Control Clin Trials. 2001;22:687-688. (17) Clark HD, Wells GA, Huet C, et al. Assessing the quality of randomized trials: reliability of the Jadad scale. Control Clin Trials. 1999;20: 448-452. (18) Oremus M, Wolfson C, Perrault A, et al. Interrater reliability of the modified Jadad quality scale for systematic reviews of Alzheimer's disease Alzheimer's disease (ăls`hī'mərz, ôls–), degenerative disease of nerve cells in the cerebral cortex that leads to atrophy of the brain and senile dementia. drug trials. Dement de·ment tr.v. de·ment·ed, de·ment·ing, de·ments 1. To make (a person) insane. 2. To cause (a person) to lose intellectual capacity. Geriatr Cogn Disord. 2001;12:232-236. (19) Jadad A, Moore A, Carroll D, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17:1-112. (20) Sherrington C, Herbert RD, Maher CG, Moseley AM. PEDro: a database of randomised trials and systematic reviews in physiotherapy. Man Ther. 2000;5:22.3-226. (21) Ferreira M, Ferreira P, Latimer J, et al. Does spinal manipulative therapy Spinal manipulative therapy (SMT) is the generic term commonly given to a group of manually applied therapeutic interventions. [1] These interventions are usually applied with the aim of inducing intervertebral movement by directing forces to vertebrae, and include spinal help people with chronic low back pain? Australian Journal of Physiotherapy. 2002;48:277-284. (22) Maher CG. A systematic review of workplace interventions to prevent low back pain. Australian Journal of Physiotherapy. 2000;46: 259-269. (23) Verhagen AP, de Vet HCW, de Bie RA, et al. The Delphi List: a criteria list for quality assessment of randomized clinical trials for conducting systematic reviews developed by Delphi consensus. J Clin Epidemiol. 1998;51:1235-1241. (24) Kunz R, Oxman A. The unpredictability paradox: review of empirical comparisons of randomised and non-randomised clinical trials. BMJ. 1998;317:1185-1190. (25) Moseley AM, Herbert RD, Sherrington C, Maher CG. Evidence for physiotherapy practice: a survey of the Physiotherapy Evidence Database (PEDro). Australian Journal of Physiotherapy. 2002;48:43-49. (26) Doull J, Hardy M, Clark J, Herman N. The effect of irradiation irradiation /ir·ra·di·a·tion/ (i-ra?de-a´shun) 1. radiotherapy. 2. the dispersion of nervous impulse beyond the normal path of conduction. 3. with ultra-violet light on the frequency of attacks of upper respiratory disease Noun 1. respiratory disease - a disease affecting the respiratory system respiratory disorder, respiratory illness adult respiratory distress syndrome, ARDS, wet lung, white lung - acute lung injury characterized by coughing and rales; inflammation of the (common colds). Am J Hyg. 1931;13:460-477. (27) Landis J, Koch G. The measurement of observer agreement for categorical data categorical data data relating to category such as qualitative data, e.g. dog, cat, female. It may be nominal when a name is used, e.g. location, breed, or ordinal when a range of categories is used, e.g. calf, yearling, cow. . Biometrics The biological identification of a person. Examples are face, iris and retinal patterns, hand geometry and voice. Increasingly built into laptop computers, fingerprint readers have become popular as a secure method for identification. . 1977;33:159-174. (28) Fleiss JL. The Design and Analysis of Clinical Experiments. New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of , NY: John Wiley John Wiley may refer to:
(29) Berard A, Andreu N, Tetrault JP, et al. Reliability of Chalmers' scale to assess quality in meta-analyses on pharmacological Pharmacological Referring to therapy that relies on drugs. Mentioned in: Pain Management pharmacological, pharmacologic pertaining to pharmacology. treatments for osteoporosis osteoporosis (ŏs'tēō'pərō`sĭs), disorder in which the normal replenishment of old bone tissue is severely disrupted, resulting in weakened bones and increased risk of fracture; osteopenia . Ann Epidemiol. 2000;10:498-503. (30) Verhagen AP, De Vet HCW, De Bie RA, et al. Balneotherapy balneotherapy (bälˑ·nē·ō·theˈ·r and quality assessment: interobserver reliability of the Maastricht criteria list for blinded quality assessment. J Clin Epidemiol. 1998;51:335-341. (31) Spitznagel E, HelzerJ. A proposed solution to the base rate problem in the Kappa statistic. Arch Gen Psychiatry psychiatry (səkī`ətrē, sī–), branch of medicine that concerns the diagnosis and treatment of mental, emotional, and behavioral disorders, including major depression, schizophrenia, and anxiety. . 1985;42:725-728. (32) Shrout P, Spitzer R, Fleiss JL. Quantification quan·ti·fy tr.v. quan·ti·fied, quan·ti·fy·ing, quan·ti·fies 1. To determine or express the quantity of. 2. of agreement in psychiatric psy·chi·at·ric adj. Of or relating to psychiatry. psychiatric adjective Pertaining to psychiatry, mental disorders diagnosis revisited. Arch Gen Psychiatry. 1987;44:172-177. CG Maher, PT, PhD, is Associate Professor, School of Physiotherapy School of Physiotherapy is located in Lahore, Punjab, Pakistan. It is located in Mayo Hospital and is affiliated with King Edward Medical College. , Faculty of Health Sciences, The University of Sydney The University of Sydney, established in Sydney in 1850, is the oldest university in Australia. It is a member of Australia's "Group of Eight" Australian universities that are highly ranked in terms of their research performance. , PO Box 170, Lidcombe, New South Wales Lidcombe is a suburb in western Sydney, in the state of New South Wales Australia. Lidcombe is located 17 kilometres west of the Sydney central business district, in the local government area of Auburn Council. Lidcombe is colloquially known as ‘Liddy’. 1825, Australia (C.Maher@fhs.usyd.edu.au). Address all correspondence to Dr Maher. C Sherrington, PT, PhD, is Research Officer, Prince of Wales Prince of Wales switches places with his double, poor boy Tom Canty. [Am. Lit.: The Prince and the Pauper] See : Doubles Medical Research Institute, University of New South Wales The University of New South Wales, also known as UNSW or colloquially as New South, is a university situated in Kensington, a suburb in Sydney, New South Wales, Australia. , Sydney, New South Wales New South Wales, state (1991 pop. 5,164,549), 309,443 sq mi (801,457 sq km), SE Australia. It is bounded on the E by the Pacific Ocean. Sydney is the capital. The other principal urban centers are Newcastle, Wagga Wagga, Lismore, Wollongong, and Broken Hill. , Australia. RD Herbert, PT, PhD, is Senior Lecturer senior lecturer n. Chiefly British A university teacher, especially one ranking next below a reader. , School of Physiotherapy, The University of Sydney. AM Moseley, PT, PhD, is Lecturer, Rehabilitation Studies Unit, Department of Medicine, The University of Sydney. M Elkins, PT, M-HSc, is Research Physiotherapist physiotherapist /phys·io·ther·a·pist/ (-ther´ah-pist) physical therapist. physiotherapist physical therapist. , Department of Respiratory Medicine, Royal Prince Alfred Hospital RPA Hospital is sometimes confused with The Alfred Hospital in Melbourne, Victoria. The short form "PA Hospital" also refers to Princess Alexandra Hospital in Brisbane, Queensland. , Camperdown, New South Wales Camperdown is an inner-city suburb of Sydney, in the state of New South Wales, Australia. Camperdown is located 4 kilometres south-west of the Sydney central business district and is part of the Inner West region. , Australia. The authors are Directors of the Centre for Evidence-Based Physiotherapy. All authors provided concept/idea/research design, writing, data collection and analysis, project management, fund procurement The fancy word for "purchasing." The procurement department within an organization manages all the major purchases. , subjects, facilities/equipment, institutional liaisons, and consultation (including review of manuscript before submission). The study was partially funded by the Centre for Evidence-Based Physiotherapy's financial supporters: Motor Accidents Authority of New South Wales, Australia; Physiotherapists Registration Board of New South Wales, Australia; NRMA NRMA National Roads & Motorists' Association (Australia) NRMA National Reloading Manufacturers Association NRMA Natural Resource Management Area NRMA National Resources Mobilization Act (Canada) Insurance, Australia; New South Wales Department of Health The New South Wales Department of Health is an agency of the Government of New South Wales with responsibility for the provision of healthcare, particularly through public hospitals. The Minister for Health is Reba Meagher. , Australia. This article was received January 24, 2003, and was accepted March 25, 2003. |
|
||||||||||||||||||||

rŏl`əjē, ny
) used in printing and writing. Also called diesis.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion