Estimates of quality and reliability with the physiotherapy evidence-based database scale to assess the methodology of randomized controlled trials of pharmacological and nonpharmacological interventions.An assessment of the methodological quality of individual randomized controlled trials A randomized controlled trial (RCT) is a scientific procedure most commonly used in testing medicines or medical procedures. RCTs are considered the most reliable form of scientific evidence because it eliminates all forms of spurious causality. (RCTs) included in meta-analyses and systematic reviews is commonly undertaken; this process is intended to identify potential sources of bias that may compromise both the internal validity Internal validity is a form of experimental validity [1]. An experiment is said to possess internal validity if it properly demonstrates a causal relation between two variables [2] [3]. and the external validity External validity is a form of experimental validity.[1] An experiment is said to possess external validity if the experiment’s results hold across different experimental settings, procedures and participants. of a study. (1) Despite the continuing debate over the relative merits of this endeavor and in the absence of a gold standard, there has been a proliferation proliferation /pro·lif·er·a·tion/ (pro-lif?er-a´shun) the reproduction or multiplication of similar forms, especially of cells.prolif´erativeprolif´erous pro·lif·er·a·tion n. of various scales and checklists intended to evaluate key components of trial quality. Most scales typically award a series of points when study criteria are met. Theoretically, higher overall scores indicate studies with better methodological quality, which in turn yield estimates of intervention effects that are closer to the true results. The score that an individual RCT RCT Randomized Controlled Trial RCT Regimental Combat Team (infantry regiment with their own artillery, engineers, medical and tanks) RCT Rollercoaster Tycoon RCT Randomized Clinical Trial RCT Rhondda Cynon Taff receives as a result of this process may determine its inclusion in the review or its weighting in the pooled results. Often, the final quality score that an RCT receives is based on consensus ratings from 2 or more study abstractors. For this reason, regardless of the quality assessment tool chosen, good agreement between raters must be established. Most quality assessment tools provide standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. administration guidelines to ensure uniform application; however, the scores awarded by abstractors depend markedly on the level of methodological detail described in each study. When reporting lacks clarity, individual interpretation may differ between abstractors, affecting the consistency of agreement and thereby reducing reliability. Disagreements typically are resolved by third-party review or through arbitration between reviewers. Surprisingly, although many scales are in use, few estimates of reliability have been published. The Physiotherapy physiotherapy: see physical therapy. Evidence-Based Database (PEDro) Scale, developed by the Centre for Evidence-Based Physiotherapy, is an example of one such quality assessment scale. (2) The scale is based largely on the Delphi List (3) and was developed to assess the methodological quality of RCTs specifically pertaining per·tain intr.v. per·tained, per·tain·ing, per·tains 1. To have reference; relate: evidence that pertains to the accident. 2. to physical therapy interventions that were included in the database. The interrater reliability of the PEDro Scale was assessed previously in only a single trial, and no studies assessing the reliability of the Delphi List have been published. By use of the kappa Kappa Used in regression analysis, Kappa represents the ratio of the dollar price change in the price of an option to a 1% change in the expected price volatility. Notes: Remember, the price of the option increases simultaneously with the volatility. statistic for pair-wise comparisons, reliability estimates determined with the PEDro tool for 2 raters assessing 120 RCTs were found to range from .50 to .79 after consensus was achieved (.36 to .80 before consensus). The intraclass correlation In statistics, the intraclass correlation (or the intraclass correlation coefficient[1]) is a measure of correlation, consistency or conformity for a data set when it has multiple groups. coefficient (ICC ICC See: International Chamber of Commerce ) for total scores was .56 (95% confidence interval confidence interval, n a statistical device used to determine the range within which an acceptable datum would fall. Confidence intervals are usually expressed in percentages, typically 95% or 99%. [CI] = .47-.65) for ratings by individual raters. The percentage of agreement ranged from 70% to 98%. (4) In an effort to determine which tools had been used previously to assess the methodological quality of published reviews, we surveyed 10 randomly selected reviews that evaluated physical therapy interventions from the Cochrane Database of Systematic Reviews. We found a wide range of approaches used to assess the methodological quality of individual RCTs. (5-14) Most reviews used a qualitative, checklist approach, whereby individual methodological components were noted to be present or absent, but a total score was not determined. The number of quality items assessed for adequacy ranged from 1 to 10 and most frequently included randomization randomization (ranˈ·d n. 1. The concealment or the screening of one sensory process or sensation by another. 2. An opaque covering used to camouflage the metal parts of a prosthesis. , intention-to-treat analysis, and accounting for dropouts. In 1 case, although 10 individual items had been summed, the authors noted that the purpose was to gain an overall impression of quality, and the data were not used for quantitative purposes. (8) Three reviews used previously validated tools to assess methodological quality. The Jadad Scale (15) was used in 2 reviews, (13,14) and the PEDro Scale was used in the third. (9) Two reviews used the Delphi List, another previously validated tool, with modifications. (5,6) Two reviews quantitatively assessed only whether concealed allocation had been adequately described, although they included a more comprehensive list of quality indicators. (7,8) None of these reviews reported estimates of reliability between raters. We previously used the PEDro Scale to assess the methodological quality of 272 RCTs that were included in a systematic review of the stroke rehabilitation rehabilitation: see physical therapy. literature. (16) In addition to physical and rehabilitation therapies (n = 215), many of the therapies assessed in this review were pharmacological Pharmacological Referring to therapy that relies on drugs. Mentioned in: Pain Management pharmacological, pharmacologic pertaining to pharmacology. or surgical (n = 57). The methodological quality of pharmacological trials included in this review was found to be significantly higher than that of nonpharmacological trials when the PEDro Scale was used (mean [+ or -] SD = 6.77 [+ or -] 1.3 versus 5.53 [+ or -] 1.3; P < .000I). (17) The difference in quality between study types was largely attributable to the inherent difficulty of designing single-blind studies single-blind study, n an experiment in which the person collecting the data knows whether the subjects are in the control or experimental groups but the subjects do not. single-blind study see blinding. (ie, those in which participants are not aware of their group assignments) for nonpharmacological interventions, although double-blind studies double-blind study, n experimental technique in clinical research in which neither the researcher nor the patient knows whether the treatment administered is considered inactive (placebo) or active (medicinal). (ie, outcome assessors are not aware of group assignments) also were less frequent for nonpharmacological interventions. As a means of formulating final conclusions in this review, only studies that achieved a PEDro score of 6 or greater were used when there was an abundance of evidence. Although an assessment of the reliability of the PEDro Scale was not included in this review and could not be conducted retrospectively, it was of interest to establish whether reliability estimates would vary depending on intervention type (pharmacological versus nonpharmacological). Therefore, the purposes of this study were to assess how well 2 examiners agreed on specific items when using this scale (reliability), to determine whether the reliability and methodological quality differ between pharmacological and nonpharmacological studies, and to identify what aspects of RCTs tend to detract from detract from verb 1. lessen, reduce, diminish, lower, take away from, derogate, devaluate << OPPOSITE enhance verb 2. their quality because these aspects are not incorporated into the study's design or they are not reported or stated clearly. We anticipated that there would be good agreement between study types (pharmacological and nonpharmacological) for individual PEDro items and that the composite scores for pharmacological trials would once again be higher than those for nonpharmacological trials because of the inability in the latter to keep subjects unaware of group assignments (masking). Discrepancies attributable to interpretation and differences in scoring patterns between study types are discussed, with an emphasis on highlighting the practical considerations encountered with the PEDro Scale under typical use. Method Article Selection This study was 1 component of a master's thesis, the objective of which was to compare differences in effect sizes reported between trials that used double blinding and those that did not. In order to examine this contrast, studies that used pharmacological approaches (which were more frequently masked studies) and those that used nonpharmacological approaches (which, because of the nature of the interventions provided, were more frequently not masked studies) to treat the same medical condition were sought. Therefore, the inclusion criteria
Inclusion criteria are a set of conditions that must be met in order to participate in a clinical trial. for selecting the meta-analyses were predefined by one of the authors (SKB SKB Smithkline Beecham SKB Steve Kimock Band SKB St Kitts, Saint Kitts And Nevis - Golden Rock (Airport Code) SKB Sportsklubben Brann (football club, Norway) SKB Smart Knee Board ). Previously published Cochrane Collaboration The Cochrane Collaboration was developed in response to Archie Cochrane's call for up-to-date, systematic reviews of all relevant randomized controlled trials of health care. meta-analyses that evaluated both intervention approaches (pharmacological and nonpharmacological) for the same medical condition were retrieved, and the methodology of the trials was assessed with the PEDro Scale. Only 3 medical complication comparisons that were the subjects of both pharmacological and nonpharmacological investigations emerged: antidepressant antidepressant, any of a wide range of drugs used to treat psychic depression. They are given to elevate mood, counter suicidal thoughts, and increase the effectiveness of psychotherapy. treatments versus cognitive behavioral therapy cognitive behavioral therapy n. A highly structured psychotherapeutic method used to alter distorted attitudes and problem behavior by identifying and replacing negative inaccurate thoughts and changing the rewards for behaviors. for bulimia nervosa bulimia nervosa Eating disorder, mostly in women, in which excessive concern with weight and body shape leads to binge eating followed by compensatory behaviour such as self-induced vomiting or the excessive use of laxatives or diuretics. , excitatory ex·ci·ta·tive or ex·ci·ta·to·ry adj. Causing or tending to cause excitation. Adj. 1. excitatory - (of drugs e.g. acid antagonists antagonists, n muscles that counterbalance agonists during specific movements. opioid Neurology A pain-attenuating peptide that occurs naturally in the brain, which induces analgesia by mimicking endogenous opioids at opioid versus surgery for stroke, and calcium supplementation calcium supplementation Metabolism The addition of Ca2+ to the diet, usually in the form of calcium carbonate versus exercise therapy for osteoarthritis osteoarthritis or osteoarthrosis or degenerative joint disease Most common joint disorder, afflicting over 80% of those who reach age 70. It does not involve excessive inflammation and may have no symptoms, especially at first. . PEDro Scale The PEDro Scale consists of 10 criteria assessing the quality of study components related to internal validity (2) (Tab. 1). Each item receives either a "yes" or a "no" score. The maximum score that a study can receive is 10. The PEDro score allocates up to 3 points for the level of masking achieved (eg, masking of subject, therapist, and assessor), 2 points for randomization procedures (random allocation, concealment of allocation), 3 points for the reporting of appropriate data (baseline characteristics baseline characteristic Medical practice An initial finding or value in a Pt, before any formal intervention , between-group comparisons, and point and range estimates of efficacy), and 1 point each for analysis of data (intention-to-treat analysis) and adequacy of followup. For the purposes of this review, follow-up (criterion 7) was considered adequate if all of the originally randomized ran·dom·ize tr.v. ran·dom·ized, ran·dom·iz·ing, ran·dom·iz·es To make random in arrangement, especially in order to control the variables in an experiment. participants were accounted for at the end of the study. This interpretation differs from that described by the PEDro Scale, which defines adequacy as the measurement of the main outcome in more than 85% of the participants. We modified this criterion because we believed that substantial bias could be introduced through imbalanced dropout (1) On magnetic media, a bit that has lost its strength due to a surface defect or recording malfunction. If the bit is in an audio or video file, it might be detected by the error correction circuitry and either corrected or not, but if not, it is often not noticed by the human rates between groups, even though 85% or more of the original participants were analyzed. (18) The methodology of each study was scored by 2 experienced, independent raters who were familiar with the PEDro tool and who were well matched in terms of education and knowledge in research methodology (NCF See National Cristina Foundation. and SKB), although neither had formal training in the application of the PEDro tool. Both raters were unaware of each other's results until all of the studies were assessed, at which point discrepancies were identified and discussed. Discrepancies were classified as "error" or "interpretation." Errors were resolved easily when it was evident that 1 of the abstractors had simply missed its reference in the original article, and consensus was easy to achieve. Interpretation discrepancies occurred when the abstractors interpreted the presence or absence of an item differently because of its presentation in the original article. Items of disagreement and reasons for discrepancies were recorded and tabulated. Statistical Analysis Both mean ([+ or -] SD) and median (interquartile range In descriptive statistics, the interquartile range (IQR), also called the midspread, middle fifty and middle of the #s, is a measure of statistical dispersion, being equal to the difference between the third and first quartiles. ) composite PEDro scores, achieved after consensus was reached, were analyzed. Differences in median scores between pharmacological and nonpharmacological studies were analyzed by use of the Mann-Whitney U test Mann-Whitney U test, n.pr See test, Mann-Whitney U. . Differences in proportions of studies meeting criteria between intervention types (nonpharmacological and pharmacological) were evaluated by use of the chi-square statistic with a continuity correction In probability theory, if a random variable X has a binomial distribution with parameters n and p, i.e., X is distributed as the number of "successes" in n independent Bernoulli trials with probability p . The Cohen cohen or kohen (Hebrew: “priest”) Jewish priest descended from Zadok (a descendant of Aaron), priest at the First Temple of Jerusalem. The biblical priesthood was hereditary and male. kappa statistic assessing pair-wise comparisons was used to estimate the interrater reliability of each of the 10 PEDro items for all intervention arms. The kappa statistic is a popular chance-corrected measure of agreement between 2 raters assessing a nominal-level variable. (19) The kappa statistic ranges from 0 to 1.00, and a higher value is indicative of better reliability. Agreement between data abstractors on total composite PEDro scores was assessed by use of the ICC with a 2-way mixed-effects model (with the absolute agreement definition). In addition to scores for all studies combined, the kappa and ICC scores were derived for pharmacological and nonpharmacological interventions. SPSS A statistical package from SPSS, Inc., Chicago (www.spss.com) that runs on PCs, most mainframes and minis and is used extensively in marketing research. It provides over 50 statistical processes, including regression analysis, correlation and analysis of variance. version 12 *, (1) was used for all analyses. A P value of less than .05 was considered statistically significant. Results Descriptive Statistics descriptive statistics see statistics. Eighty-one RCTs from 6 Cochrane reviews were retrieved. (20-25) Two trials included both drug versus placebo and therapy versus placebo arms as part of the trial design, resulting in a total of 83 scoring opportunities; 34 of these assessed nonpharmacological interventions, and 49 assessed pharmacological interventions. The publication dates for the individual RCTs ranged from 1961 to 2002. Quality The percentages of all studies that met criteria for each of the PEDro items after consensus was reached are shown in Table 2. The final PEDro scores, achieved after consensus was reached, ranged from a low of 2 (1.2%) to a high of 10 (1.2%). The most frequently occurring intermediate scores were 5 (25.3%), 6 (25.3%), and 7 (28.97%). Seven studies achieved a score of 8 (8.4%), and no studies achieved a score of 9. The average total PEDro scores were 5.94 (SD = 1.43) for all studies combined, 6.88 (SD = 1.2) for pharmacological studies, and 5.29 (SD = 1.26) for nonpharmacological studies. The median score for pharmacological studies was significantly higher than that for nonpharmacological studies (7 versus 5, U = 249.5, P < .0001). A higher percentage of drug studies than of nondrug studies met PEDro criteria for masking, adequacy of follow-up, and intention-to-treat analysis, whereas nondrug studies more frequently met criteria for concealed allocation, baseline comparability, and the inclusion of point estimates. Trials evaluating pharmacological interventions were more frequently masked trials with regard to both subjects and outcome assessors than were trials evaluating nonpharmacological interventions. The differences in proportions were statistically significant (97.1% versus 0%, P < .0001, for masking of subjects and 85.3% versus 32.7%, P < .0001, for masking of assessors). Reliability Regardless of study type, there was 100% agreement between raters for PEDro Scale item randomization and reporting of between-group comparisons, whereas the poorest agreement was found for concealed allocation and baseline comparability. The kappa scores for all studies and the breakdown for drug and nondrug studies are shown in Table 3. Kappa scores varied from a low of .452 for concealed allocation among drug trials to perfect agreement (1.00) for randomization and reporting of results from between-group comparisons. Because of the inherent limitations of the statistical test, the kappa score was 0 for 3 of the PEDro items, despite a high percentage of agreement, and the kappa score was small and negative for 1 item, despite a high percentage of agreement (the Appendix shows examples of these 2 phenomena). The ICCs associated with the cumulative PEDro score were .91 (95% CI = .83-.94) for all studies, .89 (95% CI = .78-.95) for pharmacological studies, and .91 (95% CI = .84-.952) for nonpharmacological studies. Discussion Quality Regardless of intervention type (drug versus nondrug), at least 75% of the trials met the criteria for random allocation, baseline comparability, between-group comparisons, adequacy of follow-up, and the inclusion of point estimates and measures of variability. Less than 30% of the trials fulfilled the criteria for concealed allocation or intention-to-treat analysis. Paradoxically, both of these components of trial design have been shown to be the most important in reducing bias. (26) These results are consistent with those that we previously reported. (17) Studies of either intervention type were infrequently in·fre·quent adj. 1. Not occurring regularly; occasional or rare: an infrequent guest. 2. awarded points for masking of therapist, as often there was no mention of a therapist, regardless of masking status. In the absence of reporting, a point could not be awarded. Although the absence of reporting does not mean that it did not occur, it is a limitation of all tools, which rely exclusively on the examination of the written publication. Moseley et al (2) also assessed the percentages of studies meeting PEDro criteria by evaluating 2,376 RCTs within the PEDro database. Our results are remarkably similar to theirs, with a few exceptions, which may have been attributable to either different inclusion criteria for the studies or the correctness of the ratings. Moseley et al (2) reported that 94% of studies fulfilled the criteria for randomization, whereas we included only studies that were clearly randomized and excluded quasi-randomized or controlled trials controlled trial Clinical research A clinical study in which one group of participants receives an experimental drug while the other receives either a placebo or an approved–'gold standard' therapy. See Blinding, Double-blinded. . Higher percentages of studies included in the present review met the criteria for baseline comparability (84% versus approximately 65%) and for between-group comparisons (100% versus 89%). Differences in Quality Between Pharmacological and Nonpharmacological Studies The cumulative PEDro scores of RCTs evaluating drug interventions were significantly higher than those of RCTs evaluating therapy interventions, although drug studies did not consistently outperform Outperform An analyst recommendation meaning a stock is expected to do slightly better than the market return. Notes: Exact definitions vary by brokerage, but in general this rating is better than neutral and worse than buy or strong buy. therapy trials on an item-by-item basis. The percentages of nonpharmacological studies that met criteria for concealed allocation, baseline comparability, and the inclusion of point estimates were slightly higher, although the differences were small and no statistical tests of significance were performed. As predicted, the greatest difference in scores between intervention types was for subject masking, in which virtually all drug trials succeeded, whereas none of the therapy trials did. An unexpected finding was the difference between study types in the area of masked assessments; only a small percentage (33%) of nondrug trials succeeded in the masking of outcome assessors. Moseley et al (2) also reported that a small percentage of trials used masked assessments in evaluations of physical therapies. However, the difference that we report with respect to study type is not easily explained. Although the difficulties with masking of subjects to group assignments in nonpharmacological trials are obvious, the obstacles to ensuring masked assessments are less so. A possible explanation for the shortcoming short·com·ing n. A deficiency; a flaw. shortcoming Noun a fault or weakness Noun 1. of therapy trials could be a lack of resources (eg, additional personnel were not available to carry out masked assessments), as these trials may have been conducted in a research setting rather than in a clinical setting. There was no more than a 5% difference between intervention types in the number of studies that met criteria for any of the other 8 PEDro items. Estimates of Reliability Although there is no consensus as to what constitutes a "good" or "acceptable" kappa score, for agreement that is less than 100%, guidelines interpreting the strength of agreement have been published. (27-29) With the use of any 1 of these 3 published guidelines, our agreement ranged from substantial or good to perfect for each of the 10 pair-wise comparisons. These estimates of reliability are consistent with those in 1 other published report. (4) To date, we are not aware of other evaluations of the reliability of the PEDro tool. In many cases, scoring discrepancies arose from ambiguity in reporting, as it was unclear whether criteria had been satisfied on the basis of what was explicitly stated. The extent to which a literal translation This article or section may contain original research or unverified claims. Please help Wikipedia by adding references. See the for details. This article has been tagged since September 2007. of the eligibility criteria is adhered to will affect the consistency of the agreement. For example, in 1 case, the term "placebo" was used, although the word "blinding" or masking was not. In another case, the authors reported that they attempted to keep assessors unaware of group assignments. Disagreement also arose when details of the study methodology appeared outside of the Method section. Although the value of assessing differences in baseline prognostic factors prognostic factor Medtalk Any factor–eg, Pt age, family Hx, lifestyle, stage of presentation, that is weighed in determining a prognosis. See Prognosis. in RCTs has been the subject of debate, (30) it was the second largest source of scoring disagreement. There were 3 cases in which 1 of the abstractors believed that too few clinically important baseline variables had been assessed for the equality of traits. On 2 occasions, a point was not awarded when there appeared to be a significant difference for a variable thought to be important, on the basis of either the results of a significance test or their own judgment. In these cases, abstractors had to make an educated guess as to the potential for bias arising from the imbalance. Subject area knowledge and expertise were influential in scoring under these conditions and resulted in the second lowest kappa score. There were also 3 disagreements as to whether criteria had been fulfilled for the reporting of point estimates and measures of variability. When improvements over time between intervention groups are reported, point estimates are not applicable and can lead to uncertainty in scoring. The high ICC for the composite PEDro scores also suggests good agreement, although this test does not consider the way in which the final scores were reached. It is possible for 2 raters to reach similar scores without achieving consistency on an item-by-item basis. Estimates of Reliability Between Intervention Classes (Drug and Nondrug) In general, there was better agreement between raters for the nonpharmacological studies than for the pharmacological studies, although the differences were small. The item that caused the largest number of disagreements between intervention types was concealed allocation ([kappa] = .452 for drug studies versus [kappa] = .788 for nondrug studies). This item also represented the criterion met least often (Tab. 2). One possible explanation for the poor agreement was that although it was generally clear for nonpharmacological studies that there had been no attempt to ensure that allocation had been concealed, pharmacological studies more frequently had attempted to achieve this goal, although ambiguous language and incomplete descriptions of processes resulted in disagreements. This finding was particularly true for multicenter trials A multicenter research trial is a clinical trial conducted at more than one medical center or clinic. Most large clinical trials, particularly Phase III trials, are conducted at several clinical research centers. , in which the term "concealed allocation" often was not used, although one rater rat·er n. 1. One that rates, especially one that establishes a rating. 2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. thought that it could be inferred, because centralized cen·tral·ize v. cen·tral·ized, cen·tral·iz·ing, cen·tral·iz·es v.tr. 1. To draw into or toward a center; consolidate. 2. assignment usually is associated with this trial design. Adequately concealed randomization procedures, such as the use of opaque, sequentially numbered envelopes or off-site randomization, ensure that the investigators have no foreknowledge fore·knowl·edge n. Knowledge or awareness of something before its existence or occurrence; prescience. foreknowledge Noun knowledge of something before it actually happens Noun 1. of subject group assignments and reduce bias by minimizing the possibility that the randomization schedule can be subverted. Although this concept seems straightforward, Schulz and Grimes Grimes is a surname, that is believed to be of a Scandinavian decent and may refer to
The mathematical limitations of the kappa statistic were evident for several cases in which the kappa value was 0, despite a high percentage of raw agreement. This situation occurred when the product of 1 of the marginal totals was 0 and obviously remained 0 after it was divided by "n." The kappa value also took on a negative and nonsensical value in 2 cases in which, again, there was a high percentage of raw agreement. This result occurred because the value for expected agreement was greater than that for observed agreement. Examples of these calculations are provided in the Appendix for clarification. As far as we are aware, there is no agreed-upon solution to this dilemma (eg, adding 0.5 to each cell, in a manner similar to a Yates correction of a chi-square statistic). Limitations of the study are that the "true" PEDro scores remain unknown, and consensus agreement on scale items does not necessarily mean that the raters were correct. In this respect, no claim can be made about the validity of the tool. However, we have likely successfully simulated a practical situation faced by many representative users of the tool attempting to score studies for potential inclusion in systematic reviews. Although the kappa value has statistical limitations, it is a commonly used statistical tool that many clinicians are comfortable using. Conclusion Evaluating the methodological quality of a clinical trial is often difficult because of a lack of reporting clarity, poor organization of the report, or the author's failure to include salient details. In the present study with the PEDro tool, 2 raters unanimously agreed on the reporting of 2 trial components--randomization and whether between-group comparisons had been reported--for 81 RCTs. The poorest agreement was found for concealed allocation and baseline comparability. Therefore, there appears to be greater clarity of reporting for certain components of study methodology than for others. Although many scales and checklists are in use, there is no consensus as to which one(s), if any, can distinguish definitively between well and poorly conducted trials. However, certain components of trial methodology, such as randomization, concealment of allocation, masking, and intention-to-treat analysis, are known to influence the validity of results. Therefore, it is imperative that these items, which are most commonly associated with the potential for bias, be reported in the methodology section of a trial with transparent, unambiguous language, so that physical therapists are better equipped to identify studies that are more likely to yield valid results.
Appendix.
Sample Calculations Demonstrating the Limitations of the Kappa
Statistic
There were 3 PEDro items for which the kappa value was 0, despite a
high percentage of agreement: masking of subjects, masking of
therapists, and point estimates and variability. This phenomenon can
occur when 1 of the products of the marginal totals equals 0 (when 0
is present in 1 or more of the cells of the corresponding 2x2 table).
The result is a numerator that also equals 0.
A calculation of an example (masking of subjects in pharmacological
studies) is presented for clarification.
1. 2 x 2 table
Assessor 2
Assessor 1 Yes No Total
Yes 33 (a) 1 34
No 0 (c) 0 (d) 0
Total 33 1 34
2. Raw agreement:
a + d/a + b + c + d = 33 + 0/ 33 + 1 + 0 + 0 = .9706 (97%)
3. Kappa value: observed agreement - expected agreement/
1 - expected agreement
Where observed agreement = a + d/a + b + c + d = .9706
and expected agreement = [(a+b)(a+c)/a + b + c + d] +
[(c+d)(b+d)/a + b + c + d] = .9706
= 0.9706-0.9706/1 - 0.9706 = 0
The kappa statistic for PEDro item 10 (point estimates and
variability) was negative, despite good agreement. This result
occurred because the value for expected agreement was greater than
that for observed agreement.
A calculation of an example (masking of subjects in pharmacological
studies) is presented for clarification.
1. 2x2 table
Assessor 2
Assessor 1 Yes No Total
Yes 80 2 82
No 1 0 1
Total 81 2 83
2. Raw (observed) agreement: .9639
3. Kappa value: 0.9639 - 0.9644/1 - 0.9644 = -.016
This article was received July 28, 2005, and was accepted January 16, 2006. References (1) Verhagen AP, de Vet HC, de Bie RA, et al. The art of quality assessment of RCTs included in systematic reviews. J Clin Epidemiol. 2001;54:651-654. (2) Moseley AM, Herbert RD, Sherrington C, Maher CG. Evidence for physiotherapy practice: a survey of the Physiotherapy Evidence Database (PEDro). Aust J Physiother. 2002;48:43-49. (3) Verhagen AP, de Vet HC, de Bie RA, et al. The Delphi list: a criteria list for quality assessment of randomized clinical trials randomized clinical trial, n a clinical study where volunteer participants with comparable characteristics are randomly assigned to different test groups to compare the efficacy of therapies. for conducting systematic reviews developed by Delphi consensus. J Clin Epidemiol. 1998;51:1235-1241. (4) Maher CG, Sherrington C, Herbert RD, et al. Reliability of the PEDro scale for rating quality of randomized controlled trials. Phys Ther. 2003;83:713-721. (5) Verhagen AP, Scholten-Peeters GG, de Bie RA, Bierma-Zeinstra SM. Conservative treatments for whiplash whiplash n. a common neck and/or back injury suffered in automobile accidents (particularly from being hit from the rear) in which the head and/or upper back is snapped back and forth suddenly and violently by the impact. . Cochrane Database Syst Rev. 2004; (1):CD003338. (6) Van den Ende CH, Vliet Vlieland TP, Munneke M, Hazes JM. Dynamic exercise therapy in rheumatoid arthritis rheumatoid arthritis Chronic, progressive autoimmune disease causing connective-tissue inflammation, mostly in synovial joints. It can occur at any age, is more common in women, and has an unpredictable course. : a systematic review. Br J Rheumatol. 1998;37:677-687. (7) Green S, Buchbinder R, Hetrick S. Physiotherapy interventions for shoulder pain. Cochrane Database Syst Rev. 2003;(2):CD004258. (8) Handoll HH, Sherrington C, Parker MJ. Mobilisation n. 1. Mobilization. Noun 1. mobilisation - act of marshaling and organizing and making ready for use or action; "mobilization of the country's economic resources" mobilization strategies after hip fracture hip fracture Orthopedic surgery A femoral fracture which affects 1/6 white ♀–US during life Epidemiology 250,000/yr–US Specifics Proximal femur; 90+% femoral neck, intertrochanteric; 5-10% are subtrochanteric Risk factors Tall, thin ♀, surgery in adults. Cochrane Database Syst Rev. 2004;(4): CD001704. (9) Ada L, Foongchomcheay A, Canning C. Supportive devices for preventing and treating subluxation subluxation /sub·lux·a·tion/ (sub?luk-sa´shun) 1. incomplete or partial dislocation. 2. in chiropractic, any mechanical impediment to nerve function; originally, a vertebral displacement believed to impair nerve of the shoulder after stroke. Cochrane Database Syst Rev. 2005;(1):CD003863. (10) Pollock A, Baer G, Pomeroy V, Langhorne P. Physiotherapy treatment approaches for the recovery of postural control and lower limb function following stroke. Cochrane Database Syst key. 2003;(2): CD001920. (11) Hayden J, Tulder M, Malmivaara A, Koes B. Exercise therapy for treatment of non-specific low back pain. Cochrane Database Syst Rev. 2005;(3):CD000335. (12) Dagfinrud H, Kvien TK, Hagen KB. Physiotherapy interventions for ankylosing spondylitis Ankylosing Spondylitis Definition Ankylosing spondylitis (AS) refers to inflammation of the joints in the spine. AS is also known as rheumatoid spondylitis or Marie-Strümpell disease (among other names). . Cochrane Database Syst Rev. 2004;(4):CD002822. (13) Barclay-Goddard R, Stevenson T, Poluha W, et al. Force platform feedback for standing balance training after stroke. Cochrane Database Syst Rev. 2004;(4):CD004129. (14) Milne S, Brosseau L, Robinson V, et al. Continuous passive motion continuous passive motion n. Abbr. CPM A technique in which a joint, usually the knee, is moved constantly in a mechanical splint to prevent stiffness and to increase the range of motion. following total knee arthroplasty. Cochrane Database Syst Rev. 2003;(2): CD004260. (15) Jadad AR, Moore RA, Carroll D, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17:1-12. (16) Teasell RW, Foley NC, Bhogal SK, Speechley MR. An evidence-based review of stroke rehabilitation. Top Stroke Rehabil. 2003;10:29-58. (17) Bhogal SK, Teasell RW, Foley NC, Speechley MR. Quality of the stroke rehabilitation research. Top Stroke Rehabil. 2003;10:8-28. (18) Bhogal SK, Teasell RW, Foley NC, Speechley MR. The PEDro scale provides a more comprehensive measure of methodological quality than the Jadad scale in stroke rehabilitation literature. J Clin Epidemiol. 2005;58:668-673. (19) Cohen JA. A coefficient of variation Coefficient of Variation A measure of investment risk that defines risk as the standard deviation per unit of expected return. for nominal scales See: principal scale; scale. . Educ Psychol Meas. 1960;20:37-46. (20) Shea B, Wells G, Cranney A, et al. Calcium supplementation on bone loss in postmenopausal post·men·o·paus·al adj. Of or occurring in the time following menopause. postmenopausal Change of life Gynecology adjective Referring to the time in ♀ when menstrual periods stop for ≥ 1 yr women. Cochrane Database Syst Rev. 2003;(4):CD004526. (21) Muir KW, Lees KR. Excitatory amino acid amino acid (əmē`nō), any one of a class of simple organic compounds containing carbon, hydrogen, oxygen, nitrogen, and in certain cases sulfur. These compounds are the building blocks of proteins. antagonists for acute stroke. Cochrane Database Syst Rev. 2003;(3):CD001244. (22) Bacaltchuk J, Hay P. Antidepressants Antidepressants Medications prescribed to relieve major depression. Classes of antidepressants include selective serotonin reuptake inhibitors (fluoxetine/Prozac, sertraline/Zoloft), tricyclics (amitriptyline/ Elavil), MAOIs (phenelzine/Nardil), and heterocyclics versus placebo for people with bulimia nervosa. Cochrane Database Syst Rev. 2003;(4):CD003391. (23) Prasad Prasāda (Sanskrit: प्रसाद), prasād/prashad (Hindi), Prasāda in (Kannada), prasādam (Tamil), or prasadam K, Shrivastava A. Surgery for primary supratentorial intracerebral in·tra·cer·e·bral adj. Existing within the cerebrum. haemorrhage. Cochrane Database Syst Rev. 2000;(2):CD000200. (24) Hay PJ, Bacaltchuk J. Psychotherapy psychotherapy, treatment of mental and emotional disorders using psychological methods. Psychotherapy, thus, does not include physiological interventions, such as drug therapy or electroconvulsive therapy, although it may be used in combination with such methods. for bulimia nervosa and binging. Cochrane Database Syst Rev. 2003;(1):CD000562. (25) Bonaiuti D, Shea B, Iovine R, et al. Exercise for preventing and treating osteoporosis osteoporosis (ŏs'tēō'pərō`sĭs), disorder in which the normal replenishment of old bone tissue is severely disrupted, resulting in weakened bones and increased risk of fracture; osteopenia in postmenopausal women. Cochrane Database Syst Rev. 2002;(3):CD000333. (26) Kunz R, Oxman AD. The unpredictability paradox: review of empirical comparisons of randomised Adj. 1. randomised - set up or distributed in a deliberately random way randomized irregular - contrary to rule or accepted order or general practice; "irregular hiring practices" and non-randomised clinical trials. BMJ BMJ n abbr (= British Medical Journal) → vom BMA herausgegebene Zeitschrift . 1998;317:1185-1190. (27) Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of , NY: John Wiley John Wiley may refer to:
(28) Altman DG. Practical Statistics for Medical Research. London, England: Chapman & Hall; 1990:403-409. (29) Landis JR, Koch GG. The measurement of observer agreement for categorical data categorical data data relating to category such as qualitative data, e.g. dog, cat, female. It may be nominal when a name is used, e.g. location, breed, or ordinal when a range of categories is used, e.g. calf, yearling, cow. . Biometrics. 1977;33:159-174. (30) Schulz KF, Grimes DA. Allocation concealment in randomised trials: defending against deciphering. Lancet lancet /lan·cet/ (lan´set) a small, pointed, two-edged surgical knife. lan·cet n. . 2002;359:614-618. * SPSS Inc, 233 S Wacker Wacker may refer to:
NC Foley, MSc (Candidate), is Research Associate, Department of Physical Medicine and Rehabilitation physical medicine and rehabilitation or physiatry or physical therapy or rehabilitation medicine Medical specialty treating chronic disabilities through physical means to help patients return to a comfortable, productive life despite a medical , Parkwood Hospital, St Joseph's Health Care London, London, Ontario, Canada. Address all correspondence to Ms Foley at 801 Commissioner's Rd East, London, Ontario, Canada N6C 5J1 (norine.foley@sjhc.london.on.ca). SK Bhogal, MSc, is PhD Candidate, Department of Epidemiology and Biostatistics biostatistics /bio·sta·tis·tics/ (-stah-tis´tiks) biometry. bi·o·sta·tis·tics n. The science of statistics applied to the analysis of biological or medical data. , McGill University McGill University, at Montreal, Que., Canada; coeducational; chartered 1821, opened 1829. It was named for James McGill, who left a bequest to establish it. Its real development dates from 1855 when John W. Dawson became principal. , Montreal, Quebec, Canada. RW Teasell, MD, FRCPC FRCPC Fellow of the Royal College of Physicians and Surgeons of Canada , is Professor and Chair/Chief, Department of Physical Medicine and Rehabilitation, Parkwood Hospital, St Joseph's Health Care London and the University of Western Ontario Western is one of Canada's leading universities, ranked #1 in the Globe and Mail University Report Card 2005 for overall quality of education.[2] It ranked #3 among medical-doctoral level universities according to Maclean's Magazine 2005 University Rankings. , London, Ontario, Canada. Y Bureau, PhD, is Statistical Consultant, Imaging Program, Lawson Health Research Institute, London, Ontario, Canada. MR Speechley, PhD, is Associate Professor, Department of Epidemiology and Biostatistics, Faculty of Medicine and Dentistry dentistry, treatment and care of the teeth and associated oral structures. Dentistry is mainly concerned with tooth decay, disease of the supporting structures, such as the gums, and faulty positioning of the teeth. , Schulich School of Medicine and Dentistry, University of Western Ontario. This study was a modification from one component of a master's thesis completed by Sanjit Bhogal at the University of Western Ontario in the Department of Epidemiology and Biostatistics (2004). Ms Foley and Sanjit Bhogal were both involved in concept/idea/research design, writing, and data collection and analysis. Dr Bureau and Dr Speechley were consultants on the project. Dr Teasell procured funds and was a consultant. This project was funded by the Canadian Stroke Network and the Heart & Stroke Foundation of Ontario.
Table 1.
Physiotherapy Evidence-Based Database Scale (2,a)
Item Description
1 Participants were randomly allocated to groups (in a
crossover study, participants were randomly allocated to
the order in which the interventions were received).
2 Allocation was concealed.
3 The groups were similar at baseline with regard to the most
important prognostic indicators.
4 There was blinding of all participants (as per wording of
original guidelines.
5 There was blinding of all therapists who administered the
intervention.
6 There was blinding of all assessors who measured at least
one key outcome.
7 The follow-up of all participants entered into the trial was
adequate.
8 All participants for whom outcome measures were
available received the intervention or control condition as
allocated or, when this was not the case, data for at
least one key outcome were analyzed on the basis of
"intention to treat."
9 The results of between-group statistical comparisons were
reported for at least one key outcome.
10 The study provided both point measures and measures of
variability for at least one key outcome.
(a) Adapted and reprinted from Maher CG, Sherrington C, Herbert
RD, et al. Reliability of the PEDro scale for rating quality of
randomized controlled trials. Phys Ther. 2003;83:713-721, with
permission of the American Physical Therapy Association.
Table 2.
Percentages of Studies That Met Physiotherapy Evidence-Based
Database (PEDro) Criteria
% (No.) of Studies
Pharmacological
PEDro Scale Item All (N = 83) (n = 34)
Random allocation 100 (83) 100 (34)
Concealed allocation 15.7 (13) 14.7 (5)
Baseline comparability 81.9 (68) 79.4 (27)
Between-group comparison 100 (83) 100 (34)
Masking of participants 39.8 (33) 97.1 (33)
Masking of therapists or
intervention administrators 1.2 (1) 2.9 (1)
Masking of assessors 54.2 (45) 85.3 (29)
Adequacy of follow-up 84.3 (70) 88.2 (30)
Intention-to-treat analysis 26.5 (22) 29.4 (10)
Point estimates and measures
of variability 95.2 (79) 94.1 (32)
% (No.) of Studies
Nonpharmacological [chi square]
PEDro Scale Item (n = 49) Value
Random allocation 100 (49) 0.0
Concealed allocation 16.3 (8) 0.0
Baseline comparability 83.7 (41) 0.043
Between-group comparison 100 (49) 0.0
Masking of participants 0 75.0
Masking of therapists or
intervention administrators 0 0.034
Masking of assessors 32.7 (16) 20.4
Adequacy of follow-up 81.6 (40) 0.257
Intention-to-treat analysis 24.5 (12) 0.061
Point estimates and measures
of variability 95.9 (47) 0.0
Table 3.
Mean (SE) Kappa Scores for Individual Components of the Physiotherapy
Evidence-Based Database (PEDro) Scale
[bar.X] (SE) Kappa Score for the Following Studies:
PEDro Scale Item Pharmacological Nonpharmacological All
Randomization 1.00 1.00 1.00
Concealed .452 (.175) .788 (.116) .631 (.106)
allocation
Baseline .679 (.0167) .766 (.127) .729 (.103)
comparability
Between-group 1.00 1.00 1.00
comparisons
Masking of 33/34 (a) 1.00 .975 (.025)
subjects
Masking of 30/34 (a) 1.00 79/83 (a)
therapists
Masking of 1.00 .826 (.082) .902 (.048)
assessors
Adequacy of .841 (.155) .851 (.13) .849 (.085)
follow-up
Intention-to- .837 (.110) .946 (.053) .903 (.055)
treat analysis
Point estimates 33/34 (a) 48/49 (b) 81/83 (b)
(a) The kappa score was 0.
(b) The kappa score was negative.
|
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion