Comparing results of systematic reviews: parallel reviews of research on repeated reading.
Education and related services are relying increasingly on empirically supported treatments (ESTs), which have been shown to improve student outcomes through rigorous research. Many organizations have developed review systems with guidelines for judging the quality of studies and identifying ESTs. However, little explicit attention has been paid to issues of validity of these review systems. In this study, we used the criteria developed by Homer and colleagues (2005), Gersten and colleagues (2005), and the What Works Clearinghouse (WWC, 2008; Kratochwill et al., 2010) to evaluate the research base on repeated reading. The corpus of literature reviewed was derived from previous narrative literature reviews and meta-analyses that concluded that repeated reading was an effective intervention for improving reading fluency. However, the review systems employed in this study resulted in the conclusion that repeated reading did not have enough high quality research support to be considered an EST. The current reviews relied on strict criteria for the quality of each individual study, whereas the previous reviews and meta-analyses included studies with a wider range of quality. These results demonstrate that systematic reviews that strictly appraise the quality of studies and reject those not meeting standards can be substantially more conservative than other scientific review methods. The finding that these different review methods (narrative, meta-analysis, and systematic) can produce diverging recommendations raises issues of validity for practice recommendations.
Widespread concern about the level of academic achievement of American students persists, especially in the areas of reading and mathematics. Year after year, fewer than half of students in the U.S. are deemed "proficient" on national standardized tests in math and reading (Lee, Grigg, & Dion, 2007; Perie, Grigg, & Donahue, 2005). This is a complex problem demanding multifaceted solutions. However, one key aspect of any solution must focus on the specific programs and interventions adopted and implemented in schools. In the past, selection of educational treatments has too frequently been driven by ideology, superficial novelty, and marketing hype rather than empirically demonstrated effectiveness (Slavin, 1989, 2002, 2008). Numerous educators have identified a substantial gap between what is known about effective educational practices and what is actually implemented in schools (Carnine, 1997; Fuchs & Fuchs, 1996; Gersten, Vaughn, Deshler, & Schiller, 1997; Gottfredson & Gottfredson, 2001; Greenwood & Abbott, 2001; Vaughn, Moody, & Schumm, 1998). Major recent federal education legislation such as the No Child Left Behind Act (NCLB, 2003), Reading First (NCLB, 2003) and Individuals with Disabilities Improvement Act (IDEIA, 2004) have attempted to promote, and even mandate, the use of practices that are supported by scientific research. Thus, the movement toward using empirically supported treatments to enhance evidence-based education is at the forefront of education as never before (Detrich, Keyworth, & States, 2007; Slavin, 2008).
The success of evidence-based education is critically dependent on educators' ability to identify treatments that have sufficient research support to suggest that they will be effective if widely disseminated and well implemented--that is, empirically supported treatments (ESTs). Therefore, the process and standards for identifying ESTs are crucial to the entire enterprise of evidence-based education. There is widespread consensus that ESTs should be identified through a particular type of scientific literature review. These systematic reviews use explicit and replicable methods to (a) search for relevant studies, (b) screen studies for relevance and general methodological features, (c) determine the methodological adequacy of each study, (d) summarize the outcomes from individual studies, and (e) describe the nature and strength of evidence related to the treatment. Numerous organizations have developed review systems that specify more detailed procedures, guidelines, standards, and criteria for each of these steps in the review process (e.g., Gersten et al., 2005; Homer et al., 2005; Kratochwill & Stoiber, 2002; www.bestevidence.org; www.campbellcollaboration.org). These review systems have been used to produce many reviews. What Works Clearinghouse (WWC, 2008) has posted reviews of interventions on their website for beginning reading, English language learners, elementary math, middle school math, character education and dropout prevention. Other organizations such as the Best Evidence Encyclopedia (BEE; www.bestevidence.org) have also produced reviews of interventions relevant to broadly defined educational outcomes. In addition, systematic reviews with these features have been published as stand-alone articles in professional journals (e.g., Bellini & Akullian, 2007; Browder, Wakeman, Spooner, Ahlgrim-Delzell, & Algozzine, 2006; Stenhoff & Lignugaris/Kraft, 2007) and the journal Evidence-Based Practice Briefs is devoted to publishing short summaries of systematic reviews.
However, the process of evaluating the reliability and validity of these review systems in education has just begun (e.g., Briggs, 2008; Confrey, 2006; Green & Skukauskaite, 2008; Schoenfeld, 2006; Slavin, 2008; Slocum, Detrich, & Spencer, 2012 [this issue]; Wendt & Miller, 2012 [this issue]). One important source of validity evidence of these review systems is how well results of these review systems correspond with other methods for reviewing scientific research. The type of systematic review that we described above is not the only approach to reviewing research literature as a basis for making practical recommendations. The traditional approach to this task is the narrative review in which the reviewer describes and discusses various research studies that in his or her expert opinion are most relevant to the topic. Based on this discussion, the reviewer draws conclusion about the treatments that he or she believes are most effective. A more recent approach is the meta-analysis in which the reviewer systematically searches for all relevant literature, describes features of the studies, and derives effect size statistics that describe the overall effects of treatments. Systematic reviews (as described above) are the most recent approach to reviewing scientific research and identifying recommended treatments.
This paper focuses on the degree to which results from various types of literature reviews correspond with one another -- that is, criterion evidence of validity. Two important questions regarding this type of evidence are: (a) to what degree do results from systematic reviews correspond with results from other types of scientific literature reviews, and (b) to what degree do results from systematic reviews based on one review system correspond with results from other reviews that use different review systems (i.e., different procedures and standards)? Initial reports of correlations among review systems are sobering. For example, Briggs (2008) examined ratings given to elementary-level mathematics programs by two prominent organizations that perform systematic reviews. He found that when WWC and BEE reviewed the same set of programs, their ratings had a correlation of only .57. Although this is only one example, it clearly suggests that careful examination of the validity of review systems is warranted.
In order to assess the degree to which two review systems converge with each other and with traditional narrative and meta-analytic literature reviews, we selected an instructional practice that has been endorsed by numerous previous narrative and meta-analytic reviews. Repeated reading is an intervention designed to increase reading fluency, and in turn, reading comprehension. The most basic form of repeated reading "consists of rereading a short, meaningful passage several times until a satisfactory level of fluency is reached" (Samuels, 1979, p. 404). Dozens of studies have evaluated the efficacy of repeated reading (with numerous variations on the basic procedure) for improving various aspects of reading fluency and comprehension. Six major reviews of research on repeated reading have been published (Chard, Ketterlin-Geller, Baker, Doabler, & Apichatabutra, 2009; Chard, Vaughn & Tyler, 2002; Dowhower, 1989; Meyer & Felton, 1999; National Institute of Child Health and Human Development [NICHD], 2000; Therrien, 2004).
Dowhower (1989) conducted a narrative review of ten studies on repeated reading. In general, she reported whether or not the intervention in each study resulted in more improvement for the treatment group than the control group, and summarized the effects. She did not evaluate the methodological quality of the studies. She concluded, "We have the research evidence to show that repeated reading procedures produce gains in speed and accuracy, result in better phrasing and expression, and enhance recall and understanding for both good and poor readers" (p. 506).
Meyer and Felton (1999) also narratively summarized ten studies on repeated reading and five studies on single word and phrase fluency training, without addressing the methodological quality of studies. The authors reported whether the students in the repeated reading group showed greater improvement than the students in the other groups. The authors concluded:
In spite of many limitations such as length of intervention, early studies provided positive evidence for the efficacy of fluency training. Later research helped define variables such as reader skill level and characteristics, type of RR [repeated reading] technique, number of passages read, and length of practice to be considered. (Meyer & Felton, 1999, p. 297)
Although they noted limitations in the literature, they ultimately identified repeated reading as efficacious and recommended three curricula that used repeated reading to improve fluency.
The National Reading Panel (NICHD, 2000) analyzed 50 studies on "repeated and guided repeated oral reading" (p. 3-11). Their meta-analysis included effect sizes comparing treatment and no-treatment control groups. Studies were selected according to the general methods of the National Reading Panel which required: (a) screening for relevance of the topic, (b) careful description of participants, interventions, methods, and outcome measures, and (c) publication in a peer reviewed journal. In addition, the experimental studies of repeated and guided oral reading were included only if they reported outcomes at pre-and posttest for a treatment group and a control group. Based on outcome measures, mean weighted effect sizes favoring the repeated reading groups were .55 for accuracy, .44 for fluency, and .35 for comprehension. The panel concluded, "Guided repeated oral reading and repeated reading provide students with practice that substantially improves word recognition, fluency, and--to a lesser extent--reading comprehension" (NICHD, 2000, p. 3-3). They also concluded that the patterns found in the single case and other studies supported the results of the meta-analysis. However, the broad definition of repeated reading and "guided repeated oral reading" included methods such as neurological impress (Hollingsworth, 1970, 1978) and paired reading (Morgan, 1976; Morgan & Lyon, 1979; Shany & Biemiller, 1995) in which text was not read repeatedly (i.e., more than once).
Chard, Vaughn and Tyler (2002) reported a meta-analysis of published and unpublished literature on interventions for helping elementary students with learning disabilities build reading fluency. They included 24 studies, 21 of which examined the effects of repeated reading on the reading fluency, accuracy, and/or comprehension of students with learning disabilities. The studies included group experimental, single case, single group, and case study designs, with no explicit characterization of methodological quality. For studies that reported enough information, the authors calculated a standardized mean difference effect size (Cohen's d) for each relevant comparison. For the other studies, the results were summarized in a narrative. On fluency outcomes, they found average effect sizes of d = .68 for repeated reading without a model, and d = .71 for interventions that included multiple features plus repeated reading. The authors concluded, "In general, the findings from this synthesis suggested that repeated reading interventions for students with LD are associated with improvements in reading rate, accuracy, and comprehension" (Chard et al., 2002, p. 402).
Therrien (2004) conducted a meta-analysis of 18 studies on repeated reading to determine the "essential instructional components of repeated reading and the effect of repeated reading on reading fluency and comprehension" (p. 252). Therrien (2004) included only studies that were experimental, reported quantitative information, and included enough information from which to calculate an effect size. No other methodological quality criteria were reported. The meta-analysis provided a great deal of information on the differential effects of the various components of repeated reading (e.g., repeated reading with a model vs. repeated reading without a model) on reading fluency and comprehension. In addition, Therrien (2004) separated these effects into results for "nontransfer" (assessment using a previously practiced text) and "transfer" (assessment using a previously unread passage) of reading skills. For transfer measures, the mean effect sizes for growth were .50 for fluency and .25 for comprehension. Therrien (2004) concluded that, "results from this analysis ... confirmed previous findings that repeated reading improves students' reading fluency and comprehension" (p. 258).
Based on these narrative and meta-analytic literature reviews, repeated reading has been accepted as a research-supported practice for almost 20 years. The authors of these reviews included a broad range of research to provide the most available information on the treatment. However, this effort resulted in certain limitations that may decrease our confidence in the authors' conclusions. For example, these reviews included minimal or no criteria for evaluating the methodological quality upon which the conclusions are based. This inclusive strategy has the potential advantage of providing information from a large sample of research on the treatment and reduces the problem of excluding potentially valuable studies. However, including lower quality studies may produce erroneous conclusions based on critically flawed research.
More recently, Chard et al. (2009) conducted a systematic review of the literature on repeated reading with students with learning disabilities using an adaptation of Gersten et al.'s (2005), and Horner et al.'s (2005) review systems. They found that much of the literature included in previous narrative and meta-analytic reviews did not meet their methodological standards, and that based only on the method-ologically acceptable studies, the treatment would not be considered empirically supported for improving reading fluency of students with learning disabilities. This outcome suggests that results from systematic reviews may not converge with those from narrative reviews and meta-analyses. Importantly though, Chard et al. (2009) focused their review on students with learning disabilities. Thus, one explanation for divergent results could be that relatively few studies have been conducted with children with learning disabilities, but a similar review of the entire body of research on repeated readings might support the practice. In addition, Chard and colleagues (2009) assessed the literature base using the criteria from Gersten et al. (2005) and Homer et al. (2005). Using other criteria may result in different conclusions about the empirical support for repeated reading.
Given a Large body of literature on repeated reading, five literature reviews that have concluded that the practice is effective, and one systematic review that concluded that it is not empirically supported; the goal of the current review is to evaluate this research base using review systems based on (a) Gersten et al. (2005) quality indicators for group research, (b) Homer et al. (2005) quality indicators for single case research, (c) WWC (2008) standards for group research and (d) WWC standards for single case research (Kratochwill et al., 2010); and determine whether the results from these evaluations converge with results from previous literature reviews. We addressed the following research questions:
1. Do systematic reviews using Gersten et al.'s (2005), Homer et al.'s (2005), and WWC systems yield similar conclusions about repeated reading as previous literature reviews?
2. Are the results of systematic reviews using Gersten et al.'s (2005) and Homer et al.'s (2005) systems similar to those using WWC standards?
Search for Relevant Studies
To evaluate the degree to which results from different review systems converge, we established a common set of primary studies to be submitted to these review processes. In order to ensure that the systematic reviews would be comparable to previous reviews, our common set of studies included the primary studies cited in one or more of three recent literature reviews on repeated reading (i.e., Chard et al., 2002; NICHD, 2000; Therrien, 2004). Seventy four (74) articles were identified from these previous literature reviews. The National Reading Panel's 14 "immediate effects" (nontransfer) studies (NICHD, 2000) were excluded, since the current review focuses on transfer studies alone. Three studies were excluded because they were unpublished manuscripts or dissertations (Monda, 1989; Stout, 1997; Sutton, 1991). One study was excluded (Reutzel & Hollingsworth, 1993) because it was a duplicate report of the same data set as another study that was included (Eldredge, Reutzel, & Hollingsworth, 1996). One dissertation (Cohen, 1988) was replaced with a published article that appeared to be the same study (Cohen, Torgesen, & Torgesen, 1988). This process resulted in 56 published articles, three of which contained two separate studies (Deutsch-Smith, 1979; Faulkner & Levy, 1999; Levy, Abello, & Lysynchuk, 1997) for a total of 59 studies. A full reference list of articles from the literature reviews is available from the first author.
Screening for Relevant Studies
Articles were screened to eliminate studies that were not relevant to the review. WWC uses an explicit system for screening studies that includes the following categories: 1) Relevant Topic, 2) Relevant Timeframe, 3) Relevant Intervention, 4) Relevant Sample, 5) Relevant Outcome, 6) Adequate Outcome Measure, 7) Adequate Statistics Reported (WWC, 2008; see Table 1). We constructed a screening protocol that included these categories and was based on the model of the WWC Beginning Reading protocol (WWC, 2007d).
Table 1 Screening Criteria for Group and Single Case Designs. Screening Group Studies Single Case Criterion Relevant Topic Study must be about Study must be about reading and include an reading and include an intervention intervention Relevant Published within 20 years Published within 20 years Timeframe of review. * of review. * Relevant Participants read Participants read Intervention connected text more than connected text more than one time, in English, one time, in English; Include comparison or Include baseline and control group treatment phases Relevant Sample Students in grades K-12 Students in grades K-12 Relevant At least one reading At least one reading Outcome outcome with adequate face outcome with adequate face validity validity Adequate Name outcome test (if Measures are relevant to Measures standardized); measures topic; measure reading are relevant to topic; skill on unpracticed minimum reliability; passage measure reading skill on unpracticed passage Adequate Group design with pre-and Single case reported as Reporting post-measures; inferential unit of intervention and statistics used; means, analysis; Outcome variable standard deviations, and data is reported group sizes reported repeatedly over each phase * Relevant timeframe criterion was coded but older studies were not excluded so we could examine the effect of this criterion.
The relevant topic criterion included only studies with reading outcomes and that evaluated the effects of an intervention rather than student characteristics or an assessment tool. The relevant timeframe for a WWC review includes only studies published within the 20 years prior to the review. To test if this criterion influenced the outcome of this review, we first screened studies with the other six screening criteria. Studies that would have been screened out only on the "relevant timeframe" criterion were retained in the next steps of the review process (assessing methodological quality, summarizing outcomes for individual studies, and describing the strength of evidence for a treatment). This process allowed us to determine if the timeframe criterion arbitrarily screened out studies because of publication date that would otherwise qualify as having adequate methodological quality. A relevant intervention was defined as one that included at least two oral readings of a given passage by the student (one of these may also have been used as an assessment). Two criteria were added to the relevant intervention definition that were clearly used by WWC to generate the intervention and topic reports (see WWC, 2006, 2007a), but that were not explicitly listed in the general criteria (WWC, 2008) or the beginning reading protocol (WWC, 2007d). WWC excludes studies in which the intervention being evaluated is combined with another intervention (e.g., an intervention that includes a repeated reading component and a self-questioning strategy). In addition, WWC does not use studies in which two versions of the same intervention were compared (e.g., repeated reading with a model vs. repeated reading without a model) as evidence for the intervention. These studies were screened out. Optional features of relevant interventions included modeling by an adult or peer, error correction procedures, cueing, and rewards for engagement or improvement. A relevant sample was defined as one that included students in kindergarten through 12th grade. Relevant outcomes included reading fluency, comprehension, accuracy, prosody, and/or general reading achievement. Adequate outcome measures required a transfer passage--that is, a passage that had not been practiced during the intervention. Adequate statistics reported included means, standard deviations, and numbers of students in each group. Gersten et al.'s (2005) criteria do not include a screening procedure; however, some screening procedure is necessary to identify studies that are relevant to the review. Therefore, we used the screening criteria described above to select studies to be evaluated based on the Gersten et al. (2005) standards.
Neither WWC's single case design technical documentation (Kratochwill et al., 2010) nor Horner et al.'s (2005) criteria include specific screening procedures for single-case designs. We used the following criteria from the group screening process: 1) Relevant Topic; 2) Relevant Timeframe; 3) Relevant Intervention; 4) Relevant Sample; 5) Relevant Outcome; 6) Adequate Outcome Measure (WWC, 2008; see Table 1). In place of the group criterion "Adequate Statistics Reported," we used the WWC single case design documentation (Kratochwill et al., 2010) to develop a seventh criterion, "Single Case Design," to determine if the study included adequate individual case data. To meet the criterion, the individual case had to be the unit of intervention and the unit of data analysis, measured repeatedly over time, and included a baseline phase. Studies that failed to meet one or more of the screening criterion (except relevant timeframe) were eliminated from the pool and not reviewed further. As with group studies, the relevant timeframe criteria was to be applied after the quality of studies was determined to explore if the timeframe criterion screened out studies that would have met the methodological quality criteria.
We developed four detailed protocols to guide the remaining steps in the review process: evaluation for methodological quality, summarization of outcomes, and finally, determination of the nature and strength of evidence related to repeated readings.
Gersten Quality Indictors: Group Research
Determining methodological adequacy. The quality indicators for group experimental and quasi-experimental research (Gersten et al., 2005) were used to develop a review protocol for evaluating the methodological quality of the group studies (see Table 2). The protocol listed each of the criteria and summarized the description of each criterion. Several authors have constructed Likert-type scales for each of these quality indicators; however, they have sometimes found it difficult to establish adequate interrater agreement using these scales (Baker, Chard, Ketterlin-Geller, Apichatabutra, & Doabler, 2009; Chard et al., 2009; Jitendra, Burgess, & Gajria, 2011). Due to this difficulty, we divided each category into more specific questions which could be answered as yes, no, or not applicable (adapted from Maggin and Chafouleas'  method). To operationalize the categories, additional specifications had to be made to Gersten et al.'s (2005) general criteria. For example, Gersten et al. (2005) included a category on describing participants adequately. We divided that category into seven more specific criteria on whether or not each student's attributes was described.
Table 2 Standards of Methodological Quality of Group Studies Based on Gersten et al. (2005) and WWC Standards (2008). Based on Gersten et al. Based on WWC Standards Participant Documentation of No criteria bevond Description disability status; gender, screening criteria race, English status, economic status, Equivalent characteristics across groups. Interventionist Interventionists described No criteria Description & equivalent across groups. Study Design, Randomization not Randomized control Randomization required. trial (RCT), or quasi-experimental design (QED) Pretest Group Groups equivalent (d [less Groups equivalent (d Equivalence than or equal to] .25 [less than or equal difference) at pretest on to] .25 difference) at at least one measure. pretest only if RCT with high attrition qv QED Attrition Overall attrition [less Low overall and than or equal to] 30%; differential attrition Differentia! attrition based on bias model [less than or equal to] (WWC, 2008, p. 14) 10% across groups. Intervention Clearly described No confounds with procedures, materials, intervention (e.g., no time, comparison teacher/group condition; Details of confound) comparison condition documented. Audio or video excerpts to describe intervention. Intervention Intervention fidelity No criteria Fidelity described & assessed; Fidelity measures include quality features. Outcome Measures Multiple measures; No criteria bevond Measures delivered at screening criteria appropriate times; Multiple reliability stats as appropriate; Data collectors blind and equally familiar; Maintenance assessment Evidence of criterion and construct validity.. Data Analysis Analysis linked to Criteria used for research questions; Unit summarizing study of assignment and analysis outcomes and adjusting aligned; Rationale for significance and effect analyses; Adjustment for sizes for pretest differences d > misalignment. .25; Inferential statistics & effect sizes reported; Clear, coherent report of results. Note: Standard in bold are considered "essential" and those in plain text are "desirable."
Summarizing study outcomes. Gersten et al. (2005) defined a "high quality" study as one that meets all but one of the "Essential Quality Indicators" and at least four of the "Desirable Quality Indicators." "Acceptable quality" studies meet all but one of the "Essential Quality Indicators" and at least one of the "Desirable Quality Indicators" (Gersten et al., 2005).
Describing the strength of evidence for a treatment. For an intervention to be considered "evidence-based" according to Gersten et al. (2005), a minimum of four acceptable quality studies or two high quality studies must support the intervention; and the weighted effect size must be significantly greater than zero. For a practice to be considered "promising," the same number and quality of studies are required, but there must be a "20% confidence interval for the weighted effect size that is greater than zero" (Gersten et al., 2005, p. 162). These standards were applied as written.
Horner Quality Indicators: Single Case Research.
Determining methodological adequacy. Horner et al. (2005) identified 21 quality indicators for high quality single case studies. These criteria were used to develop a review protocol for evaluating the single case studies. We adapted the Horner et al. criteria to be assessed through a series of dichotomous (yes/no) questions, as described above for the Gersten et al. (2005) criteria. Additional specifications to the criteria were made, such as requiring that treatment fidelity be measured at least 20% of sessions, with at least one measure of treatment fidelity per phase, and that treatment fidelity reach at least 90%, if quantified as percentage of steps completed accurately. The specifications were added to increase the objectivity of the criteria. No explicit criteria are noted for determining minimum methodological adequacy, therefore we required that the 21 quality indicators be met for a study to have minimally acceptable methodological adequacy.
Summarizing study outcomes. Homer et al. (2005) did not present specific criteria for summarizing individual study outcomes. We established the criteria that experimental control had to be demonstrated (based on visual inspection), and favor the repeated reading treatment across the majority of cases in the study.
Describing the strength of evidence for a treatment. According to Homer et al. (2005), an intervention can be considered empirically supported when:
(a) a minimum of five single case studies that meet minimally acceptable methodological criteria and document experimental control have been published in peer-reviewed journals, (b) the studies are conducted by at least three different researchers across at least three different geographical locations, and (c) the five or more studies include a total of at least 20 participants (p. 176).
WWC Evidence Standards, Group Designs.
Determining methodological adequacy. A review protocol was developed for the WWC evidence standards based on the WWC general criteria (WWC, 2008) and modeled on the beginning reading protocol (WWC, 2007d; see Table 2). The WWC Procedures and Standards Handbook (WWC, 2008) includes specific criteria for rating the methodological quality of studies based on whether random assignment is used (i.e., randomized control trials versus quasi-experimental designs), problems with overall and differential attrition, confounds with the intervention (e.g., one teacher teaching the intervention group and another teaching the comparison group), and pretest equivalence. We followed these standards as written.
Summarizing study outcomes. Based on the ratings for methodological quality, each study was rated as (a) meets evidence standards, (b) meets evidence standards with reservations, or (c) does not meet evidence standards. Randomized control trials without attrition problems were rated as meeting evidence standards (regardless of pretest equivalence). Randomized control trials with attrition problems that showed pretest equivalence and quasi-experimental designs without attrition problems that showed pretest equivalence were rated as meeting evidence standards with reservations. Randomized control trials with confounds or attrition problems and lack of pretest equivalence and quasi-experimental designs with attrition problems or lack of pretest equivalence were rated as not meeting evidence standards.
Describing the strength of evidence for a treatment. After each study was rated, the WWC Intervention Rating Scheme (WWC, 2008) was used to determine whether or not repeated reading could be considered an empirically supported treatment. WWC gives each intervention one of six ratings based on the number of studies meeting evidence standards, the effect size of the comparisons, and the statistical significance of the comparisons. The ratings are: "positive effects, potentially positive effects, mixed effects, no discernible effects, potentially negative effects, or negative effects" (WWC, 2008, p. 22-23).
WWC Evidence Standards, Single Case Designs.
Determining methodological adequacy. WWC (Kratochwill et al., 2010) has established criteria for evaluating the evidence presented in single case designs. These criteria include two stages for determining methodological quality. First, studies are evaluated for the quality of design (adequate measurement of the dependent variable, systematic manipulation of the independent variable, sufficient number of contrasts between intervention and comparison conditions at different times, and sufficient number of data points per phase). Each study is categorized as (a) meets evidence standards, (b) meets evidence standards with reservations, or (c) does not meet evidence standards. Only studies that meet evidence standards (with our without reservations) are evaluated further. Second, studies are evaluated for whether they demonstrate a causal relation. A causal relationship is established through visual analysis of the data with respect to level, trend, variability, immediacy of effect, proportion of data overlap between phases and consistency across cases. The patterns of these effects are used to summarize study outcomes.
Summarizing study outcomes. Studies are rated as showing (a) strong evidence if they include at least three demonstrations of an effect with no non-effects, (b) moderate evidence if they have at least three demonstrations of an effect and at least one demonstration of non-effects or (c) no evidence if they fail to show three demonstrations of an effect (Kratochwill et al., 2010).
Describing the strength of evidence for a treatment. Studies that meet evidence standards (with or without reservations) are then used to determine whether the treatment is supported by adequate evidence. For a treatment to be supported by adequate evidence, Kratochwill et al. (2010) suggest that five studies that meet evidence standards (with or without reservations), conducted by three different research teams in three different geographic locations and include at least 20 cases across the studies (Kratochwill et al., 2010, p. 21) support the treatment.
Screening for Relevant Studies
From the original pool of 59 studies derived from previous reviews, 48 studies (81%) failed to meet the screening criteria (see Figure 1).
Group design studies. Of the 41 group design studies in our original review set, 31 did not meet the screening criteria: 0 failed on the relevant topic criterion, 68% failed on the relevant intervention criterion, 3% failed the relevant sample criterion, 0 failed the relevant outcome criterion, 58% failed the criterion of relevant measures, and 45% failed the criteria of statistical reporting (see Table 3). These percentages add to more than 100% because some studies failed on more than one criterion. Studies did not meet criteria either by not reporting enough information or the information reported did not meet the criterion. For the relevant intervention and relevant sample criteria, most of the studies failed because the information they reported did not correspond to the criterion. For the relevant measures and statistical reporting, the studies tended to fail because they did not report adequate information. Most studies failed to meet multiple criteria. Ten studies met the screening criteria and were coded further.
Table 3 Summary of Group Design Studies that Did Not Meet Screening Criteria (n =31). Study Year Topic Intervention Sample Outcome Bryant et al. 2000 met Failed met met criteria criteria criteria criteria Cohen et al. 1988 met Failed met met criteria criteria criteria criteria Dixon-Krauss 1995 met Failed met met criteria criteria criteria criteria Dowhower 1987 met Failed met met criteria criteria criteria criteria Faulkner & 1999 met Failed met met Levy, Study 1 criteria criteria criteria criteria Study 2 met Failed Failed met criteria criteria criteria criteria Fuchs et al. 1997 met Failed met met criteria criteria criteria criteria Herman 1985 met met met met criteria criteria criteria criteria Hollingsworth 1970 met Failed met met criteria criteria criteria criteria Hollingsworth 1978 met Failed met met criteria criteria criteria criteria Labbo & Teale 1990 met met met met criteria criteria criteria criteria Levy et al., 1997 met met met met Study 1 criteria criteria criteria criteria Study 2 met met met met criteria criteria criteria criteria Lindsay et 1985 met Failed met met al. criteria criteria criteria criteria Lorenz & 1979 met Failed met met Vockell criteria criteria criteria criteria Marston et 1995 met Failed met met al. criteria criteria criteria criteria Mercer et al. 2000 met Failed met met criteria criteria criteria criteria Miller et al. 1986 met Failed met met criteria criteria criteria criteria Moseley 1993 met Failed met met criteria criteria criteria criteria O'Shea et al. 1984 met Failed met met criteria criteria criteria criteria O'Shea et al. 1985 met Failed met met criteria criteria criteria criteria O'Shea et al. 1987 met met met met criteria criteria criteria criteria Rashotte & 1985 met met met met Torgesen criteria criteria criteria criteria Rasinski 1990 met met met met criteria criteria criteria criteria Rasinski et 1994 met met met met al. criteria criteria criteria criteria Shany & 1995 met Failed met met Biemiller criteria criteria criteria criteria Sindelar et 1990 met met criteria met met al. criteria criteria criteria Stoddard et 1993 met Failed met met al. criteria criteria criteria criteria Van Bon et 1991 Met Failed met met al. criteria criteria criteria criteria Winter 1986 met Failed met met criteria criteria criteria criteria Winter 1988 met Failed met met criteria criteria criteria criteria Number of Count 0/31 21/31 1/31 0/31 Group Studies Failing Each Criterion % 0% 68% 3% 0% Study Year Measures Statistical Reporting Bryant et al. 2000 Failed met criteria criteria Cohen et al. 1988 Failed Failed criteria criteria Dixon-Krauss 1995 Failed Failed criteria criteria Dowhower 1987 met met criteria criteria Faulkner & 1999 Failed met Levy, Study 1 criteria criteria Study 2 Failed met criteria criteria Fuchs et al. 1997 met met criteria criteria Herman 1985 Failed met criteria criteria Hollingsworth 1970 met Failed criteria criteria Hollingsworth 1978 met Failed criteria criteria Labbo & Teale 1990 met Failed criteria criteria Levy et al., 1997 Failed met Study 1 criteria criteria Study 2 Failed met criteria criteria Lindsay et 1985 met Failed al. criteria criteria Lorenz & 1979 met Failed Vockell criteria criteria Marston et 1995 met met al. criteria criteria Mercer et al. 2000 met met criteria criteria Miller et al. 1986 Failed met criteria criteria Moseley 1993 Failed Failed criteria criteria O'Shea et al. 1984 Failed Failed criteria criteria O'Shea et al. 1985 Failed Failed criteria criteria O'Shea et al. 1987 Failed met criteria criteria Rashotte & 1985 met Failed Torgesen criteria criteria Rasinski 1990 Failed met criteria criteria Rasinski et 1994 Failed met al. criteria criteria Shany & 1995 met met Biemiller criteria criteria Sindelar et 1990 Failed met al. criteria criteria Stoddard et 1993 Failed met al. criteria criteria Van Bon et 1991 Failed Failed al. criteria criteria Winter 1986 met Failed criteria criteria Winter 1988 met Failed criteria criteria Number of Count 18/31 14/31 Group Studies Failing Each Criterion % 58% 45% Note. * = met criteria, # = failed criteria
To assess the reliability of coding based on the WWC Screening, three doctoral students were trained to use the review tools. Raters scored three initial studies until they achieved at least 90% agreement with the first author. Agreement between each rater and the first author was calculated for each rating category as the number of agreements divided by the sum of agreements and disagreements. Interrater agreement was assessed on the screening of 28 group studies (68% of group studies). The mean interrater agreement across group study screening categories was 96.43%, with a range of 82-100%. Agreement for the final rating of studies (i.e., whether a study should be screened out or in based on the previously coded categories) was 92.86%.
Single case design studies. The original set of research included 18 single case studies; 17 of these failed to meet the screening criteria. Of those that failed, 0 failed the relevant topic criterion, 35% did not meet the relevant intervention criterion, 0 failed the relevant sample or outcome criteria, 53% did not meet the relevant measures criterion, and 47% did not meet the criteria for being an adequate single case design (see Table 4). These failures were primarily characterized by studies that included relevant information, but the reported information did not meet the screening criteria, as opposed to lack of reporting. One study (Swain & Allinder, 1996) met the screening criteria and was coded further.
Table 4 Single Case Design Studies that Did Not Meet Screening Criteria (n = 17). Study Year Topic Intervention Sample Outcome Blum et al. 1995 met met criteria met met criteria criteria criteria Carver & 1981 met met criteria met met Hoffman criteria criteria criteria Daly & 1994 met met criteria met met Martens criteria criteria criteria Deutsch 1979 met failed met met Smith, Study criteria criteria criteria criteria 1 Study 2 met met criteria met met criteria criteria criteria Gilbert et 1986 met met criteria met met al. criteria criteria criteria Kamps et 1994 met met criteria met met al. criteria criteria criteria Langford et 1974 met met criteria met met al. criteria criteria criteria Law & 1993 met failed met met Kratochwill criteria criteria criteria criteria Lovitt & 1976 met met criteria met met Hansen criteria criteria criteria Mefferd & 1997 met met criteria met met Pettegrew criteria criteria criteria Morgan 1976 met failed met met criteria criteria criteria criteria Morgan & 1979 met failed met met Lyon criteria criteria criteria criteria Rose 1984 met met criteria met met criteria criteria criteria Rose & 1986 met failed met met Beattie criteria criteria criteria criteria Tingstrom et 1995 met failed met met al. criteria criteria criteria criteria Weinstein & 1992 met met criteria met met Cooke criteria criteria criteria Number of Count 0/17 6/17 0/17 0/17 Single Subject Studies Failing Each Criterion % 0% 35% 0% 0% Study Year Measures Single Case Design Blum et al. 1995 failed failed criteria criteria Carver & 1981 met failed Hoffman criteria criteria Daly & 1994 failed met Martens criteria criteria Deutsch 1979 failed failed Smith, Study criteria criteria 1 Study 2 failed met criteria criteria Gilbert et 1986 failed met al. criteria criteria Kamps et 1994 failed met al. criteria criteria Langford et 1974 met failed al. criteria criteria Law & 1993 met failed Kratochwill criteria criteria Lovitt & 1976 failed met Hansen criteria criteria Mefferd & 1997 met failed Pettegrew criteria criteria Morgan 1976 met failed criteria criteria Morgan & 1979 met failed Lyon criteria criteria Rose 1984 failed met criteria criteria Rose & 1986 failed met Beattie criteria criteria Tingstrom et 1995 met met al. criteria criteria Weinstein & 1992 failed met Cooke criteria criteria Number of Count 10/17 8/17 Single Subject Studies Failing Each Criterion % 59% 47% Note. * = met criteria, # = failed criteria
Interrater agreement was assessed on the screening of 13 single case studies (72% of single case studies). Mean interrater agreement on single case screening criteria was 95.73% (range, 92-100%). Overall agreement on whether a study should be screened in or out was 100%.
Determining Methodological Quality
The ten group studies that met screening criteria were reviewed with two protocols; one based on the Gersten Quality Indicators for group studies and one based on the WWC Evidence Standards for group designs (WWC, 2008). The one single case study that met screening criteria was also reviewed with two protocols; one based on the Horner Quality Indicators for single case studies and one based on the WWC Evidence Standards for single case studies.
Group design studies. Ten group studies were evaluated with the protocol based on the Gersten quality indicators; none of the studies met the criteria to be considered high or acceptable quality (see Table 5). All studies failed to adequately describe their participants (i.e., they missed one or more criteria in this category), and failed to conduct data analyses as Gersten et al. (2005) specified. Most studies (7 of 10) failed one or more of the criteria regarding group equivalence and failed one or more of the intervention fidelity criteria. All but one study met all of the intervention description criteria. All studies met the outcome measure criteria. Interrater reliability was assessed on six of the ten group studies. The mean interrater agreement across categories was 91.98% (range, 83-100%). On the overall rating of each study (i.e., not acceptable, acceptable or high quality), there was 100% agreement.
Table 5 Evaluation of Screened Studies with Gersten et al. (2005) Group Quality Indicators Criteria (n = 10). Criteria Essential Indicators Quality Participants' Group Intervention Intervention Study Year Description Equivalence Description Fidelity Conte & 1989 failed failed met criteria met criteria Humphreys criteria criteria Eldredge 1990 failed failed met criteria failed criteria criteria criteria Eldredge 1996 failed failed met criteria failed et al. criteria criteria criteria Homan et 1993 failed failed failed failed al. criteria criteria criteria criteria Mathes & 1993 failed failed met criteria met criteria Fuchs criteria criteria Simmons et 1994 failed met met criteria failed al. criteria criteria criteria Simmons et 1995 failed met met criteria failed al. criteria criteria criteria Thomas 1989 failed failed met criteria failed &Clapp criteria criteria criteria Vaughn et 2000 failed met met criteria met criteria al. criteria criteria Young et 1996 failed failed met criteria failed al. criteria criteria criteria #of Desirable Quality Outcome Data Indicators Study Year Measures Analysis Met Conte & 1989 met failed 4 Humphreys criteria criteria Eldredge 1990 met failed 2 criteria criteria Eldredge 1996 met failed 3 et al. criteria criteria Homan et 1993 met failed 3 al. criteria criteria Mathes & 1993 met failed 4 Fuchs criteria criteria Simmons et 1994 met failed 4 al. criteria criteria Simmons et 1995 met failed 5 al. criteria criteria Thomas 1989 met failed 3 &Clapp criteria criteria Vaughn et 2000 met failed 4 al. criteria criteria Young et 1996 met failed 3 al. criteria criteria Note. * = met criteria, # = failed criteria
Of the 10 group studies evaluated with the protocol based on the WWC evidence standards, 4 met these standards. Three of these studies met evidence standards (Conte & Humphreys, 1989; Homan et al., 1993; Young et al., 1996), and one study met evidence standards with reservations (Eldredge et al. 1996; see Table 6). Six studies did not meet evidence standards; five of these studies failed to meet the criterion of equivalence of groups at baseline, and all six failed on the basis of differential attrition. For the criteria on attrition, most studies did not meet these criteria because they did not report enough information about the samples at pretest and posttest for coders to determine attrition. Interrater reliability for the WWC evidence standards was assessed on five (50%) of the group studies. The mean interrater agreement across categories was 94.05% (range, 66-100%).
Table 6 Evaluation of Screened Studies with WWC Group Criteria (n = 10). Criteria Study Design, Lack of Overall Differential Confound Study Year Randomization Attrition Attrition met evidence 1989 met criteria met met met criteria standards criteria criteria Conte & Humphreys Eldredge 1990 met criteria met met failed criteria criteria criteria met evidence 1996 met criteria met failed failed standards criteria criteria criteria with reservation Eldredge et al. met evidence 1993 met criteria met met met criteria standards criteria criteria Homan et al. Mathes & 1993 met criteria met failed failed Fuchs criteria criteria criteria Simmons et 1994 met criteria met met failed al. QED criteria criteria criteria Simmons et 1995 met criteria met met failed al. QED criteria criteria criteria Thomas & 1989 met criteria met met failed Clapp criteria criteria criteria Vaughn et 2000 met criteria failed met failed al. criteria criteria criteria met evidence 1996 met criteria met met met criteria standards criteria criteria Young et al. Baseline Equivalence Study Year met evidence 1989 failed standards criteria Conte & Humphreys Eldredge 1990 failed criteria met evidence 1996 met standards criteria with reservation Eldredge et al. met evidence 1993 Failed standards criteria Homan et al. Mathes & 1993 Failed Fuchs criteria Simmons et 1994 failed al. criteria Simmons et 1995 met al. criteria Thomas & 1989 Failed Clapp criteria Vaughn et 2000 Failed al. criteria met evidence 1996 failed standards criteria Young et al. Note. # = met criteria, # = failed criteria, * met evidence standards, ** met evidence standards with reservation
Single case design studies. One single case study (Swain & Allinder, 1996) was evaluated with Homer et al.'s (2005) criteria. The study failed to meet criteria in six of the seven evaluation categories. Interrater reliability was assessed on this study. The mean interrater agreement across categories was 95.12%.
The one single case study (Swain & Allinder, 1996) that passed screening was rated on the WWC single case design criteria. This study met standards for systematic manipulation of the independent variable, experimental control and sufficient number of data points, but failed to meet standards for intervention fidelity (not reported) or measurement of the dependent variable (authors did not report interrater agreement in each phase). Therefore, this study does not meet WWC evidence standards. Interrater reliability was assessed on this study and was 100%.
Summarizing Outcomes from Individual Studies
Three of the four review systems yielded zero studies with adequate methodological quality that they would be examined with respect to their outcomes. The WWC Group Designs review system identified four studies of sufficient methodological quality to warrant evaluation of outcomes (Conte & Humprheys, 1989; Eldredge et al., 1996; Homan et al., 1993; Young et al., 1996). For phonics outcomes, Conte and Humphreys (1989) found a statistically significant effect size (Hedges g corrected for small sample size; WWC, 2008) of -.57 favoring the control group on a word reading measure. Eldredge et al. (1996) reported statistically significant effects of .42 on a phonics outcome. Homan et al. (1993) and Young et al. (1996) reported small, nonsignificant effects on phonics from repeated reading. On the outcome of fluency, Homan et al. (1993) and Young et al. (1996) found small, nonsignificant effects for repeated reading. Conte and Humphreys (1989), and Eldredge et al. (1996) did not include adequate fluency outcomes. For comprehension outcomes, Conte and Humphreys (1989), Homan et al. (1993), and Young et al. (1996) reported small, nonsignificant effects for repeated reading. Eldredge et al. (1996) reported a statistically significant, but small effect size (.18) for comprehension.
Describing the Strength of Evidence for a Treatment
The two systems for reviewing single case research (Horner et al., 2005; Kratochwill et al., 2010) found no studies of acceptable quality to evaluate repeated reading. Therefore they both find no evidence in support of the treatment. The system for reviewing group research based on the Gersten et al. (2005) quality indicators also found that no studies met criteria. As a result, repeated reading would not be considered a promising or empirically supported treatment. The review system based on the WWC Standards identified four acceptable studies. This results in a finding that repeated reading has "Mixed Effects" for phonics and comprehension and "No Discernible Effects" for fluency outcomes.
Recall that the WWC screening criterion of "Relevant Timeframe" was not enforced in the screening step of the review process so that we could examine how this criterion would affect subsequent steps and the eventual findings. If this criterion had been implemented with a cutoff date of 1991 (i.e., 20 years before the current review), one study (Conte & Humphreys, 1989) deemed to have adequate quality would have been excluded from the review. However, the exclusion of this study would not have changed the WWC ratings of repeated reading.
The first research question asked whether the findings from systematic reviews designed to identify empirically supported treatments (based on criteria from Gersten et al., 2005; Horner et al., 2005; Kratochwill et al., 2010; WWC, 2008) correspond to findings from previous reviews (i.e., Chard et al., 2002; Chard et al., 2009; Dowhower, 1989; Meyer & Felton, 1999; NICHD, 2000; Therrien, 2004). The finding that repeated reading is generally not empirically supported directly contradicts the conclusions of two previous narrative literature reviews and three meta-analyses which concluded that repeated reading is effective for increasing reading fluency and comprehension. In contrast, our finding based on these systematic reviews concurs with the result of the systematic review by Chard et al. (2009) which concluded that there was not enough evidence to declare repeated reading to be empirically supported for use with students with learning disabilities. These results raise important questions about the validity of using each kind of review for determining which interventions are likely to improve student outcomes.
The second research question asked whether applying the different review systems (Gersten et al., 2005; Homer et al., 2005; Kratochwill et al., 2010; WWC, 2008) results in similar conclusions about the evidence when they are applied to a single set of research studies. In general, the results of the review systems converged on the conclusion that repeated reading is not an empirically supported treatment. The fact that the reviews of repeated reading using these systems resulted in similar conclusions about the practice provides convergent evidence that they may be measuring similar constructs. However, in the evaluation of group studies, the WWC (2008) and Gersten et al. (2005) systems did not agree about which studies were of acceptable quality. Given 10 studies to evaluate, the Gersten et al. (2005) system found no studies to have acceptable quality and the WWC (2008) system found four to have acceptable quality. This result suggests that these two review systems emphasize different aspects of the construct empirically supported treatments. For example, Gersten et al.'s (2005) and Homer et al.'s (2005) criteria require that if students are described as having disabilities, the disability must defined and confirmed in the study. The WWC group and single case systems do not have such strict criteria on participant description. Another criterion that was different between the review tools for group studies was an evaluation of treatment fidelity. The Gersten et al. (2005) quality indicators include a criterion requiring that treatment fidelity be assessed and reported. WWC (2008) evidence standards do not require a measure of treatment fidelity. These examples show some of the differences that contribute to differences in ratings for the same studies.
Application of the review systems began with a relatively large pool of 59 studies that had been included in previous reviews; however, 48 of these studies (81%) were rejected in the initial screening step based on the WWC screening criteria. In the subsequent step of evaluating methodological quality, both the systems for evaluating single subject studies rejected all of the remaining studies, and one of the systems for evaluating group studies rejected all the remaining studies. Only one of the four systems found any studies to be of sufficient methodological quality, and that system accepted 4 of the 41 group studies that entered the review (90.3% rejection rate). This result is consistent with topic reports previously issued by WWC (e.g., WWC, 2007a; WWC, 2007c; WWC, 2007e). For example, in the topic report on beginning reading, 887 studies were initially considered and 836 (94%) were rejected either in the screening step or in the evaluation of methodological quality (WWC, 2007a; p. 1). In the elementary school math topic report, 236 studies were located and 227 (96%) were rejected (WWC, 2007c). In the middle school math topic report, 137 of 158 studies (87%) were eliminated based on screening and quality evaluation (WWC, 2007e). These results are characteristic of empirically supported treatment review approaches that use a "threshold" of quality (Detrich, 2008; Drake, Latimer, Leff, McHugo, & Burns, 2004) - studies above the threshold are evaluated further and studies below are not considered to be evidence. In contrast, a "hierarchical" (Detrich, 2008; Drake et al., 2004) or best evidence available (Kazdin, 2004) approach considers studies with a wider range of methodological quality and provides ratings of both methodological quality and outcomes. This type of review is able to consider a larger body of research that may be relevant to the treatment in question, and at the same time considers the quality of that evidence.
At this point, there is no evidence to suggest which form of review (threshold or hierarchical) most accurately identifies practices that produce positive student outcomes. An advantage of using a threshold approach with high standards includes the likely minimization of false positives (i.e. identifying practices as effective, when they are not). In addition, the clear and public standards set forth by systematic reviews may have a social consequence of "raising the bar" of generally accepted standards for intervention research. Currently the use of these standards greatly reduces the number of studies considered to be legitimate evidence; however, these standards may have the long-term effect of increasing the number of high quality studies. A disadvantage of the threshold approach is the increased risk of false negatives (e.g., a treatment deemed "not empirically supported" by the review process, but which would actually be effective if implemented) that is inherent in making the criteria very strict (Detrich, 2008). In this study, the conclusion that repeated reading is not empirically supported could be a false negative. If this is the case, one of the consequences of using the threshold approach would be to discourage use of an intervention that is, in fact, effective.
Hierarchical approaches typically can provide more information about a practice, because they summarize the evidence that is available, even if it is not of high methodological quality. The review of the intervention is qualified by a rating of the quality of evidence supporting it (Drake et al., 2004; Kazdin, 2004). Although most of the meta-analyses and literature reviews on repeated reading did not directly address the quality of studies, the results of these reviews are similar to the results of a hierarchical review approach. For example, these reviews gave recommendations about which components of repeated reading were likely to be more effective for certain outcomes. This information would not be available from the threshold reviews, because the contributing studies were eliminated.
There is a trade-off between reduction of false positive results and false negative results, and both types of errors are costly. The cost of incorrectly recommending ineffective treatments is obvious--and reducing this negative outcome is a main focus of evidence-based practice. But failing to recommend treatments is also costly--it leaves practitioners without guidance on how to address important educational objectives. In considering the approaches for identifying ESTs, stakeholders will need to take into account whether the education system can tolerate more false positives or false negatives in evaluating treatments.
There are ways to reduce false negatives without increasing the risk of false positives. One such strategy is to ensure that no studies are eliminated from consideration for reasons that do not reflect important methodological flaws. For example, the WWC's "Relevant Timeframe" criterion eliminates studies based on their age alone. This criterion means that even though studies may be considered to have adequate quality, if they are not within the 20-year window, they would be excluded. Such an arbitrary exclusion criterion might be tolerable in a situation in which the problem were managing a large number of high quality studies - however it seems less reasonable in situations where decisions are being made based on a complete lack of eligible studies. To explore this issue, we did not exclude studies based on the "Relevant Timeframe" criterion in the screening step (as would have been done by WWC). After evaluating the studies for methodological quality, we applied the "Relevant Timeframe" criterion. Using this method, we found that one high quality study would have been excluded by the "Relevant Timeframe" screening criterion; however, inclusion of this study would not have changed the overall rating of repeated reading. In another review, this screening criterion could easily affect the rating of an intervention. This criterion suggests that the evidence for a practice would have to be renewed at least every 20 years. The WWC system allows the Principal Investigator for a review to determine review-specific screening and rating criteria based on their knowledge of the topic area. The "Relevant Timeframe" criterion may benefit from this more nuanced approach.
This review has several limitations. First, our intent was to compare the review systems applied to a common set of studies. This focus means that we could not consider whether the various review systems would obtain the same corpus of studies to review. In actual use, it is likely that different review systems would differ in how the literature was searched and obtained, resulting in different sets of studies entering the review process. For example, WWC includes unpublished manuscripts and dissertations as possible evidence for interventions while many other review systems limit their scope to studies published in peer reviewed journals. Our decision to begin the review process with a single set of studies would tend to reduce differences between review systems and increase convergence. Second, we used the WWC screening criteria to screen studies prior to evaluation by the various quality rating systems. We felt this was reasonable because some means of refining our pool of studies was necessary and the Horner et al. (2005) and Gersten et al. (2005) systems did not include screening criteria. The fact that we used a single, common screening procedure undoubtedly inflated the correspondence among the final ratings for the four sets of criteria and overestimated the convergent evidence among them. Third, coders read each article then coded it using two sets of criteria (i.e., Gersten et al., 2005, and WWC, 2008, for group studies; Homer et al., 2005, and Kratochwill et al., 2010, for single case studies) in succession. Therefore, the experience of applying the first set of standards may have influenced ratings on the second set of standards. Again, this may have increased the convergence among the sets of criteria, compared to using each set alone to rate the studies. The additive effect of these three factors may have substantially increased the apparent convergence among review systems. Future research should attempt to reduce these biases.
As we have emphasized throughout, the purpose of this review was primarily methodological -- we were interested in how the various review systems would correspond when applied to a single set of literature. This review should not be read as a current review of the evidence related to repeated reading as an intervention. We included only literature that had been cited. in the previous narrative and meta-analyses in order to establish a common basis among these reviews.
The overall results from this review indicated that review systems resulted in different conclusions about repeated reading than previous narrative reviews and meta-analyses, even when the literature base is the same. The systematic reviews generally corresponded with one another in finding that virtually the entire literature on repeated reading should be excluded from consideration on methodological grounds. The few remaining acceptable studies did not find statistically significant or sizeable differences between repeated reading and the comparison condition(s), or found mixed results. Therefore, repeated reading would be judged as "not empirically supported" (or in the case of the WWC evaluation of group studies to have mixed effects on phonics and comprehension, and indeterminate effects on fluency). These conclusions are distinctly different than the very positive conclusions about repeated reading from previous literature reviews and meta-analyses. The differences in these results emphasize a difference in these review processes: threshold reviews (i.e., Gersten et al., 2005; Horner et al., 2005; Kratoch will et al., 2010; WWC 2008) consider only relatively high quality studies to inform their decisions about the effectiveness of a practice; as a result, they would be expected to produce more false negative judgments of practices. Conversely, while previous meta-analyses and narrative reviews were informed by lower quality studies, their conclusions bear higher risks of false positive judgments on practices.
Empirically supported treatment is still a relatively new innovation in education and the methods for conducting effective reviews to identify ESTs are in their infancy. Based on our experience in comparing these review systems we believe that review methods can and should continue to develop. There is still a great need to build review methods that can take advantage of a larger range of studies yet can provide recommendations in which practitioner can place a high degree of confidence.
Baker, S. K., Chard, D. J., Ketterlin-Geller, L. R., Apichatabutra, C., & Doabler, C. (2009). Teaching writing to at-risk students: The quality of evidence for self-regulated strategy development. Exceptional Children, 75, 303-318.
Bellini, S., & Akullian, J. (2007). A meta-analysis of video modeling and video self-modeling interventions for children and adolescents with autism spectrum disorders. Exceptional Children, 73, 264-287.
Briggs, D. C. (2008). Synthesizing causal inferences. Educational Researcher, 37(1), 15-22.
Browder, D. M., Wakeman, S. Y., Spooner, F., Ahlgrim-Delzell, L., & Algozzine, B. (2006). Research on reading instruction for individuals with significant cognitive disabilities. Exceptional Children, 72, 392-408.
Carnine, D. (1997). Bridging the research-to-practice gap. Exceptional Children, 63, 513-521.
Chard, D. J., Ketterlin-Geller, L. R., Baker, S. K., Doabler, C., & Apichatabutra, C. (2009). Repeated reading interventions for students with learning disabilities: Status of the evidence. Exception Children, 75, 263-281.
Chard, D. J., Vaughn, S., & Tyler, B. (2002). A synthesis of research on effective interventions for building reading fluency with elementary students with learning disabilities. Journal of Learning Disabilities, 35, 386-406.
Cohen, A. L. (1988). An evaluation of the effectiveness of two methods for providing computer-assisted repeated reading training to reading disabled students. Unpublished doctoral dissertation, Florida State University, Tallahassee.
Cohen, A. L., Torgesen, J. K., & Torgesen, J. L. (1988). Improving
speed and accuracy of word recognition in reading disabled children: An evaluation of two computer program variations. Learning Disability Quarterly, 11, 333-341.
Con frey, J. (2006). Comparing and contrasting the National Research Council Report On Evaluating Curricular Effectiveness with the What Works Clearinghouse approach. Educational Evaluation and Policy Analysis, 28, 195-213.
Conte, R., & Humphreys, R. (1989). Repeated readings using audiotaped material enhances oral reading in children with reading difficulties. Journal of Communications Disorders, 22, 65-79.
Detrich, R. (2008, September). Evidence-based education: Can we get there from here? Presentation at the Association for Behavior Analysis International Education Conference, Reston, VA.
Detrich, R., Keyworth, R., States, J. (2007). A roadmap to evidence-based education: Building an evidence-based culture. Journal of Evidence-Based Practices for Schools, 8, 26-44.
Deutsch-Smith, D. (1979). The improvement of children's oral reading through the use of teacher modeling. Journal of Learning Disabilities, 12, 172-175.
Drake, R. E., Latimer, E. A., Leff, H. S., McHugo, G. J., & Burns, B. J. (2004). What is evidence? Child and Adolescent Psychiatric Clinics (1 North America, 13, 717-728.
Dowhower, S. L. (1989). Repeated reading: Research into practice. The Reading Teacher, 42, 502-507.
Eldredge, J. L., Reuel, D. R., & Hollingsworth, P. M. (1996). Comparing the effectiveness of two oral reading practices: Round-robin reading and the shared book experience. Journal of Literacy Research, 28, 201-225.
Faulkner, H. J., & Levy, B. A. (1999). Fluent and nonfluent forms of transfer in reading: Words and their message. Psychonomic Bulletin and Review, 6,111-116.
Fuchs, D., & Fuchs, L. S. (1996). Bridging the research-to-practice gap with mainstream assistance teams: A cautionary tale. School Psychology Quarterly, 11(3), 244-266.
Gersten, R., Fuchs, L. S., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. (2005). Quality indicators for group experimental and quasi-experimental research in special education. Exceptional Children, 71, 149-164.
Gersten, R., Vaughn, S., Deshler, D., & Schiller, E. (1997). What we know about using research findings: Implications for
improving special education practice. Journal of Learning Disabilities, 30, 466-476.
Gottfredson, G. D., & Gottfredson, D. C. (2001). What schools do to prevent problem behavior and promote safe environments. Journal of Educational and Psychological Consultation, 12, 313-344.
Green, J. L., & Skukauskaite, A. (2008). Becoming critical readers: Issues in transparency, representation, and warranting of claims. Educational Researcher, 37 (1), 30-40.
Greenwood, C. R., & Abbott, M. (2001). The research to practice gap in special education. Teacher Education and Special Education, 24, 276-289.
Hollingsworth, P. M. (1970). An experiment with the impress method of teaching reading. Reading Teacher, 24, 112-114, 187.
Hollingsworth, P. M. (1978). An experimental approach to the impress method of teaching reading. Reading Teacher, 31, 624-626.
Homan, S. P., Klesius, J. P., & Hite, C. (1993). Effects of repeated readings and nonrepetitive strategies on students' fluency and comprehension. Journal of Educational Research, 87, 94-99.
Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179.
Individuals with Disabilities Education Improvement Act of 2004, 20 U.S.C. [section] 1400 et seq.
Jitendra, A. K., Burgess, C., & Gajria, M. (2011). Cognitive strategy instruction for improving expository text comprehension of students with learning disabilities: The quality of evidence. Exceptional Children,77, 135-159.
Kazdin, A. E. (2004). Evidence-based treatments: challenges and priorities for practice and research. Child and Adolescent Psychiatric Clinics of North America, 13, 923-940.
Kratochwill, T. R., Hitchcock, J., Homer, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M & Shadish, W. R. (2010). Singlecase designs technical documentation. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf
Kratochwill, T. R., & Stoiber, K. C. (2002). Evidence-based interventions in school psychology: Conceptual foundations of the Procedural and Coding Manual of Division 16 and the Society for the Study of School Psychology Task Force. School Psychology Quarterly, 17, 341-389.
Lee, J., Grigg, W., & Dion, G. (2007). The Nation's Report Card: Mathematics 2007 (NCES Report 2007-494). Washington, D.C.: U.S. Department of Education, National Center for Education Statistics, Institute of Education Sciences.
Levy, B. A., Abell, B., & Lysynchuk, L. (1 997). Transfer from word training to reading in context: Gains in reading fluency and comprehension. Learning Disability Quarterly, 20, 173-188.
Maggin, D. M., & Chafouleas, S. M. (2010). PASS-RQ: Protocol for assessing single-subject research quality. Unpublished research instrument.
Meyer, M. S., & Felton, R. H. (1999). Repeated reading to enhance fluency: Old approaches and new directions. Annals of Dyslexia, 49, 283-306.
Monda, L. E. (1989). The effects of oral, silent, and listening repetitive reading on the fluency and comprehension of learning disabled students. Unpublished doctoral dissertation, Florida State University, Tallahassee.
Morgan, R. T. (1976). "Paired reading" tuition: A preliminary report on a technique for cases of reading deficit. Child: Care, Health and Development, 2, 13-28.
Morgan, R, & Lyon, E. (1979). "Paired reading" -- A preliminary report on a technique for parental tuition of reading-retarded children. Journal of Child Psychology and Psychiatry, 20, 151-160.
National Institute of Child Health and Human Development. (2000). Report of the National Reading Panel. Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction: Reports of the subgroups (NIH Publication No. 00-4754). Washington, D.C.: U.S. Government Printing Office.
No Child Left Behind (NCLB) Act of 2001, 20 U.S.C.A. [section] 6301 et seq. (West 2003)
Odom, S. L., Brantlinger, E., Gersten, R., Homer, R. H., Thompson, B. & Harris, K. R. (2005). Research in special education: Scientific methods and evidence-based practices. Exceptional Children, 71(2), 137-148.
Perie, M., Grigg, W., & Donahue, P. (2005). The nation's report card: Reading 2005. (NCES Report 2006-451). Washington, D.C.: U.S. Department of Education, National Center for Educational Statistics, U.S. Government Printing Office.
Reutzel, D. R., & Hollingsworth, P. M. (1993). Effects of fluency
training on second graders' reading comprehension. Journal of Educational Research, 86, 325-331.
Samuels, S. J. (1979). The method of repeated readings. The Reading Teacher, 32, 403-408.
Schoenfeld, A. H. (2006). What doesn't work: The challenges and failure of the What Works Clearinghouse to conduct meaningful reviews of studies of mathematics curricula. Educational Researcher, 35, 13-21.
Shany, M. T., & Biemiller, A. (1995). Assisted reading practice: Effects on performance for poor readers in grade 3 and 4. Reading Research Quarterly, 30, 382-395.
Slavin, R. E. (1989). PET and the pendulum: Faddism in education and how to stop it. Phi Delta Kappan, 70, 752-758.
Slavin, R. E. (2002). Evidence-based education policies: Transforming educational practice and research. Educational Researcher, 31, 15-21.
Slavin, R. E. (2008). What works? Issues in synthesizing educational program evaluations. Educational Researcher, 37(1), 5-14.
Slocum, T. A., Detrich, R., & Spencer, T. D. (2012). Evaluating the validity of systematic reviews to indentify empirically supported treatments. Education and Treatment of Children, 35(2), 201-233.
Stenhoff, D. M., & Lignugaris/Kraft, B. (2007). A review of the effects of peer tutoring on students with mild disabilities in secondary settings. Exceptional Children, 74, 8-30.
Stout, T. W. (1997). An investigation of the effects of a repeated reading intervention on the fluency and comprehension of students with language-learning disabilities. Unpublished doctoral dissertation, Georgia State University, Atlanta.
Sutton, P. A. (1991). Strategies to increase oral reading fluency of primary resource students. Unpublished manuscript, Nova University.
Swain, K. D., & Allinder, R. M. (1996). The effects of repeated reading on two types of CBM: Computer maze and oral reading with second-grade students with learning disabilities, Diagnostique, 21, 51-66.
Therrien, W. J. (2004). Fluency and comprehension gains as a result of repeated reading: A meta-analysis. Remedial and Special Education, 25, 252-261.
Vaughn, S., Moody, S. W., & Schumm, J. S. (1998). Broken promises:
Reading instruction in the resource room. Exceptional Children, 64, 211-225.
Wendt, 0., & Miller, B. (2012). Quality appraisal of single-subject experimental designs: An overview and comparison of different appraisal tools. Education and Treatment of Children, 35(2), 235-268.
What Works Clearinghouse. (2006). What Works Clearinghouse phonological awareness training intervention report. Retrieved January 14, 2008, from http://ies.ed.gov/ncee/wwc/pdf/WWC_Phonological_Awareness_121406.pdf
What Works Clearinghouse. (2007a). What Works Clearinghouse beginning reading topic report. Retrieved from http://ies.ed.govincee/wwc/pdf/BR_TR_08_13_07.pdf
What Works Clearinghouse. (2007b). What Works Clearinghouse beginning reading topic report, technical appendix. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/BR_APP_08_13_07.pdf
What Works Clearinghouse. (2007c). What Works Clearinghouse elementary school math topic report. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/ESM_TR_07_16_07.pdf
What Works Clearinghouse. (2007d). What Works Clearinghouse evidence review protocol for beginning reading interventions. Retrieved from http://ies.ed.gov/ncee/wwc/PDF/BR_protocol.pdf
What Works Clearinghouse. (2007e). What Works Clearinghouse middle school math topic report. Retrieved from http://ies.gov/ncee/wwc/pdf/MSM_TR_07_30_07.pdf
What Works Clearinghouse. (2008). Procedures and standards handbook (Version 2.0). Retrieved from http://ies.ed.gov/ncee/wwc/DocumentSum.aspx?sid=19
Young, A. R., Bowers, P. C., & MacKinnon, G. E. (1996). Effects of prosodic modeling and repeated reading on poor readers' fluency and comprehension. Applied Psycholinguistics, 17, 59-84.
Breda V. O'Keeffe University of Utah Timothy A. Slocum Utah State University Cheryl Burlingame University of Connecticut Katie Snyder Utah State University Kaitlin Bundock University of Utah
|Printer friendly Cite/link Email Feedback|
|Author:||O'Keeffe, Breda V.; Slocum, Timothy A.; Burlingame, Cheryl; Snyder, Katie; Bundock, Kaitlin|
|Publication:||Education & Treatment of Children|
|Date:||May 1, 2012|
|Previous Article:||A systematic review of brief functional analysis methodology with typically developing children.|
|Next Article:||The effects of social skills training on the peer interactions of a nonnative toddler.|