Printer Friendly

Comparing results of systematic reviews: parallel reviews of research on repeated reading.

Abstract

Education and related services are relying increasingly on empirically supported treatments (ESTs), which have been shown to improve student outcomes through rigorous research. Many organizations have developed review systems with guidelines for judging the quality of studies and identifying ESTs. However, little explicit attention has been paid to issues of validity of these review systems. In this study, we used the criteria developed by Homer and colleagues (2005), Gersten and colleagues (2005), and the What Works Clearinghouse (WWC, 2008; Kratochwill et al., 2010) to evaluate the research base on repeated reading. The corpus of literature reviewed was derived from previous narrative literature reviews and meta-analyses that concluded that repeated reading was an effective intervention for improving reading fluency. However, the review systems employed in this study resulted in the conclusion that repeated reading did not have enough high quality research support to be considered an EST. The current reviews relied on strict criteria for the quality of each individual study, whereas the previous reviews and meta-analyses included studies with a wider range of quality. These results demonstrate that systematic reviews that strictly appraise the quality of studies and reject those not meeting standards can be substantially more conservative than other scientific review methods. The finding that these different review methods (narrative, meta-analysis, and systematic) can produce diverging recommendations raises issues of validity for practice recommendations.

Widespread concern about the level of academic achievement of American students persists, especially in the areas of reading and mathematics. Year after year, fewer than half of students in the U.S. are deemed "proficient" on national standardized tests in math and reading (Lee, Grigg, & Dion, 2007; Perie, Grigg, & Donahue, 2005). This is a complex problem demanding multifaceted solutions. However, one key aspect of any solution must focus on the specific programs and interventions adopted and implemented in schools. In the past, selection of educational treatments has too frequently been driven by ideology, superficial novelty, and marketing hype rather than empirically demonstrated effectiveness (Slavin, 1989, 2002, 2008). Numerous educators have identified a substantial gap between what is known about effective educational practices and what is actually implemented in schools (Carnine, 1997; Fuchs & Fuchs, 1996; Gersten, Vaughn, Deshler, & Schiller, 1997; Gottfredson & Gottfredson, 2001; Greenwood & Abbott, 2001; Vaughn, Moody, & Schumm, 1998). Major recent federal education legislation such as the No Child Left Behind Act (NCLB, 2003), Reading First (NCLB, 2003) and Individuals with Disabilities Improvement Act (IDEIA, 2004) have attempted to promote, and even mandate, the use of practices that are supported by scientific research. Thus, the movement toward using empirically supported treatments to enhance evidence-based education is at the forefront of education as never before (Detrich, Keyworth, & States, 2007; Slavin, 2008).

The success of evidence-based education is critically dependent on educators' ability to identify treatments that have sufficient research support to suggest that they will be effective if widely disseminated and well implemented--that is, empirically supported treatments (ESTs). Therefore, the process and standards for identifying ESTs are crucial to the entire enterprise of evidence-based education. There is widespread consensus that ESTs should be identified through a particular type of scientific literature review. These systematic reviews use explicit and replicable methods to (a) search for relevant studies, (b) screen studies for relevance and general methodological features, (c) determine the methodological adequacy of each study, (d) summarize the outcomes from individual studies, and (e) describe the nature and strength of evidence related to the treatment. Numerous organizations have developed review systems that specify more detailed procedures, guidelines, standards, and criteria for each of these steps in the review process (e.g., Gersten et al., 2005; Homer et al., 2005; Kratochwill & Stoiber, 2002; www.bestevidence.org; www.campbellcollaboration.org). These review systems have been used to produce many reviews. What Works Clearinghouse (WWC, 2008) has posted reviews of interventions on their website for beginning reading, English language learners, elementary math, middle school math, character education and dropout prevention. Other organizations such as the Best Evidence Encyclopedia (BEE; www.bestevidence.org) have also produced reviews of interventions relevant to broadly defined educational outcomes. In addition, systematic reviews with these features have been published as stand-alone articles in professional journals (e.g., Bellini & Akullian, 2007; Browder, Wakeman, Spooner, Ahlgrim-Delzell, & Algozzine, 2006; Stenhoff & Lignugaris/Kraft, 2007) and the journal Evidence-Based Practice Briefs is devoted to publishing short summaries of systematic reviews.

However, the process of evaluating the reliability and validity of these review systems in education has just begun (e.g., Briggs, 2008; Confrey, 2006; Green & Skukauskaite, 2008; Schoenfeld, 2006; Slavin, 2008; Slocum, Detrich, & Spencer, 2012 [this issue]; Wendt & Miller, 2012 [this issue]). One important source of validity evidence of these review systems is how well results of these review systems correspond with other methods for reviewing scientific research. The type of systematic review that we described above is not the only approach to reviewing research literature as a basis for making practical recommendations. The traditional approach to this task is the narrative review in which the reviewer describes and discusses various research studies that in his or her expert opinion are most relevant to the topic. Based on this discussion, the reviewer draws conclusion about the treatments that he or she believes are most effective. A more recent approach is the meta-analysis in which the reviewer systematically searches for all relevant literature, describes features of the studies, and derives effect size statistics that describe the overall effects of treatments. Systematic reviews (as described above) are the most recent approach to reviewing scientific research and identifying recommended treatments.

This paper focuses on the degree to which results from various types of literature reviews correspond with one another -- that is, criterion evidence of validity. Two important questions regarding this type of evidence are: (a) to what degree do results from systematic reviews correspond with results from other types of scientific literature reviews, and (b) to what degree do results from systematic reviews based on one review system correspond with results from other reviews that use different review systems (i.e., different procedures and standards)? Initial reports of correlations among review systems are sobering. For example, Briggs (2008) examined ratings given to elementary-level mathematics programs by two prominent organizations that perform systematic reviews. He found that when WWC and BEE reviewed the same set of programs, their ratings had a correlation of only .57. Although this is only one example, it clearly suggests that careful examination of the validity of review systems is warranted.

In order to assess the degree to which two review systems converge with each other and with traditional narrative and meta-analytic literature reviews, we selected an instructional practice that has been endorsed by numerous previous narrative and meta-analytic reviews. Repeated reading is an intervention designed to increase reading fluency, and in turn, reading comprehension. The most basic form of repeated reading "consists of rereading a short, meaningful passage several times until a satisfactory level of fluency is reached" (Samuels, 1979, p. 404). Dozens of studies have evaluated the efficacy of repeated reading (with numerous variations on the basic procedure) for improving various aspects of reading fluency and comprehension. Six major reviews of research on repeated reading have been published (Chard, Ketterlin-Geller, Baker, Doabler, & Apichatabutra, 2009; Chard, Vaughn & Tyler, 2002; Dowhower, 1989; Meyer & Felton, 1999; National Institute of Child Health and Human Development [NICHD], 2000; Therrien, 2004).

Dowhower (1989) conducted a narrative review of ten studies on repeated reading. In general, she reported whether or not the intervention in each study resulted in more improvement for the treatment group than the control group, and summarized the effects. She did not evaluate the methodological quality of the studies. She concluded, "We have the research evidence to show that repeated reading procedures produce gains in speed and accuracy, result in better phrasing and expression, and enhance recall and understanding for both good and poor readers" (p. 506).

Meyer and Felton (1999) also narratively summarized ten studies on repeated reading and five studies on single word and phrase fluency training, without addressing the methodological quality of studies. The authors reported whether the students in the repeated reading group showed greater improvement than the students in the other groups. The authors concluded:
  In spite of many limitations such as length of intervention, early
  studies provided positive evidence for the efficacy of fluency
  training. Later research helped define variables such as reader skill
  level and characteristics, type of RR [repeated reading] technique,
  number of passages read, and length of practice to be considered.
  (Meyer & Felton, 1999, p. 297)


Although they noted limitations in the literature, they ultimately identified repeated reading as efficacious and recommended three curricula that used repeated reading to improve fluency.

The National Reading Panel (NICHD, 2000) analyzed 50 studies on "repeated and guided repeated oral reading" (p. 3-11). Their meta-analysis included effect sizes comparing treatment and no-treatment control groups. Studies were selected according to the general methods of the National Reading Panel which required: (a) screening for relevance of the topic, (b) careful description of participants, interventions, methods, and outcome measures, and (c) publication in a peer reviewed journal. In addition, the experimental studies of repeated and guided oral reading were included only if they reported outcomes at pre-and posttest for a treatment group and a control group. Based on outcome measures, mean weighted effect sizes favoring the repeated reading groups were .55 for accuracy, .44 for fluency, and .35 for comprehension. The panel concluded, "Guided repeated oral reading and repeated reading provide students with practice that substantially improves word recognition, fluency, and--to a lesser extent--reading comprehension" (NICHD, 2000, p. 3-3). They also concluded that the patterns found in the single case and other studies supported the results of the meta-analysis. However, the broad definition of repeated reading and "guided repeated oral reading" included methods such as neurological impress (Hollingsworth, 1970, 1978) and paired reading (Morgan, 1976; Morgan & Lyon, 1979; Shany & Biemiller, 1995) in which text was not read repeatedly (i.e., more than once).

Chard, Vaughn and Tyler (2002) reported a meta-analysis of published and unpublished literature on interventions for helping elementary students with learning disabilities build reading fluency. They included 24 studies, 21 of which examined the effects of repeated reading on the reading fluency, accuracy, and/or comprehension of students with learning disabilities. The studies included group experimental, single case, single group, and case study designs, with no explicit characterization of methodological quality. For studies that reported enough information, the authors calculated a standardized mean difference effect size (Cohen's d) for each relevant comparison. For the other studies, the results were summarized in a narrative. On fluency outcomes, they found average effect sizes of d = .68 for repeated reading without a model, and d = .71 for interventions that included multiple features plus repeated reading. The authors concluded, "In general, the findings from this synthesis suggested that repeated reading interventions for students with LD are associated with improvements in reading rate, accuracy, and comprehension" (Chard et al., 2002, p. 402).

Therrien (2004) conducted a meta-analysis of 18 studies on repeated reading to determine the "essential instructional components of repeated reading and the effect of repeated reading on reading fluency and comprehension" (p. 252). Therrien (2004) included only studies that were experimental, reported quantitative information, and included enough information from which to calculate an effect size. No other methodological quality criteria were reported. The meta-analysis provided a great deal of information on the differential effects of the various components of repeated reading (e.g., repeated reading with a model vs. repeated reading without a model) on reading fluency and comprehension. In addition, Therrien (2004) separated these effects into results for "nontransfer" (assessment using a previously practiced text) and "transfer" (assessment using a previously unread passage) of reading skills. For transfer measures, the mean effect sizes for growth were .50 for fluency and .25 for comprehension. Therrien (2004) concluded that, "results from this analysis ... confirmed previous findings that repeated reading improves students' reading fluency and comprehension" (p. 258).

Based on these narrative and meta-analytic literature reviews, repeated reading has been accepted as a research-supported practice for almost 20 years. The authors of these reviews included a broad range of research to provide the most available information on the treatment. However, this effort resulted in certain limitations that may decrease our confidence in the authors' conclusions. For example, these reviews included minimal or no criteria for evaluating the methodological quality upon which the conclusions are based. This inclusive strategy has the potential advantage of providing information from a large sample of research on the treatment and reduces the problem of excluding potentially valuable studies. However, including lower quality studies may produce erroneous conclusions based on critically flawed research.

More recently, Chard et al. (2009) conducted a systematic review of the literature on repeated reading with students with learning disabilities using an adaptation of Gersten et al.'s (2005), and Horner et al.'s (2005) review systems. They found that much of the literature included in previous narrative and meta-analytic reviews did not meet their methodological standards, and that based only on the method-ologically acceptable studies, the treatment would not be considered empirically supported for improving reading fluency of students with learning disabilities. This outcome suggests that results from systematic reviews may not converge with those from narrative reviews and meta-analyses. Importantly though, Chard et al. (2009) focused their review on students with learning disabilities. Thus, one explanation for divergent results could be that relatively few studies have been conducted with children with learning disabilities, but a similar review of the entire body of research on repeated readings might support the practice. In addition, Chard and colleagues (2009) assessed the literature base using the criteria from Gersten et al. (2005) and Homer et al. (2005). Using other criteria may result in different conclusions about the empirical support for repeated reading.

Given a Large body of literature on repeated reading, five literature reviews that have concluded that the practice is effective, and one systematic review that concluded that it is not empirically supported; the goal of the current review is to evaluate this research base using review systems based on (a) Gersten et al. (2005) quality indicators for group research, (b) Homer et al. (2005) quality indicators for single case research, (c) WWC (2008) standards for group research and (d) WWC standards for single case research (Kratochwill et al., 2010); and determine whether the results from these evaluations converge with results from previous literature reviews. We addressed the following research questions:

1. Do systematic reviews using Gersten et al.'s (2005), Homer et al.'s (2005), and WWC systems yield similar conclusions about repeated reading as previous literature reviews?

2. Are the results of systematic reviews using Gersten et al.'s (2005) and Homer et al.'s (2005) systems similar to those using WWC standards?

Method

Search for Relevant Studies

To evaluate the degree to which results from different review systems converge, we established a common set of primary studies to be submitted to these review processes. In order to ensure that the systematic reviews would be comparable to previous reviews, our common set of studies included the primary studies cited in one or more of three recent literature reviews on repeated reading (i.e., Chard et al., 2002; NICHD, 2000; Therrien, 2004). Seventy four (74) articles were identified from these previous literature reviews. The National Reading Panel's 14 "immediate effects" (nontransfer) studies (NICHD, 2000) were excluded, since the current review focuses on transfer studies alone. Three studies were excluded because they were unpublished manuscripts or dissertations (Monda, 1989; Stout, 1997; Sutton, 1991). One study was excluded (Reutzel & Hollingsworth, 1993) because it was a duplicate report of the same data set as another study that was included (Eldredge, Reutzel, & Hollingsworth, 1996). One dissertation (Cohen, 1988) was replaced with a published article that appeared to be the same study (Cohen, Torgesen, & Torgesen, 1988). This process resulted in 56 published articles, three of which contained two separate studies (Deutsch-Smith, 1979; Faulkner & Levy, 1999; Levy, Abello, & Lysynchuk, 1997) for a total of 59 studies. A full reference list of articles from the literature reviews is available from the first author.

Screening for Relevant Studies

Articles were screened to eliminate studies that were not relevant to the review. WWC uses an explicit system for screening studies that includes the following categories: 1) Relevant Topic, 2) Relevant Timeframe, 3) Relevant Intervention, 4) Relevant Sample, 5) Relevant Outcome, 6) Adequate Outcome Measure, 7) Adequate Statistics Reported (WWC, 2008; see Table 1). We constructed a screening protocol that included these categories and was based on the model of the WWC Beginning Reading protocol (WWC, 2007d).
Table 1

Screening Criteria for Group and Single Case Designs.

Screening        Group Studies               Single Case
Criterion

Relevant Topic   Study must be about         Study must be about
                 reading and include an      reading and include an
                 intervention                intervention

Relevant         Published within 20 years   Published within 20 years
Timeframe        of review. *                of review. *

Relevant         Participants read           Participants read
Intervention     connected text more than    connected text more than
                 one time, in English,       one time, in English;
                 Include comparison or       Include baseline and
                 control group               treatment phases

Relevant Sample  Students in grades K-12     Students in grades K-12
Relevant         At least one reading        At least one reading
Outcome          outcome with adequate face  outcome with adequate face
                 validity                    validity

Adequate         Name outcome test (if       Measures are relevant to
Measures         standardized); measures     topic; measure reading
                 are relevant to topic;      skill on unpracticed
                 minimum reliability;        passage
                 measure reading skill on
                 unpracticed passage

Adequate         Group design with pre-and   Single case reported as
Reporting        post-measures; inferential  unit of intervention and
                 statistics used; means,     analysis; Outcome variable
                 standard deviations, and    data is reported
                 group sizes reported        repeatedly over each
                                             phase

* Relevant timeframe criterion was coded but older studies were
not excluded so we could examine the effect of this criterion.


The relevant topic criterion included only studies with reading outcomes and that evaluated the effects of an intervention rather than student characteristics or an assessment tool. The relevant timeframe for a WWC review includes only studies published within the 20 years prior to the review. To test if this criterion influenced the outcome of this review, we first screened studies with the other six screening criteria. Studies that would have been screened out only on the "relevant timeframe" criterion were retained in the next steps of the review process (assessing methodological quality, summarizing outcomes for individual studies, and describing the strength of evidence for a treatment). This process allowed us to determine if the timeframe criterion arbitrarily screened out studies because of publication date that would otherwise qualify as having adequate methodological quality. A relevant intervention was defined as one that included at least two oral readings of a given passage by the student (one of these may also have been used as an assessment). Two criteria were added to the relevant intervention definition that were clearly used by WWC to generate the intervention and topic reports (see WWC, 2006, 2007a), but that were not explicitly listed in the general criteria (WWC, 2008) or the beginning reading protocol (WWC, 2007d). WWC excludes studies in which the intervention being evaluated is combined with another intervention (e.g., an intervention that includes a repeated reading component and a self-questioning strategy). In addition, WWC does not use studies in which two versions of the same intervention were compared (e.g., repeated reading with a model vs. repeated reading without a model) as evidence for the intervention. These studies were screened out. Optional features of relevant interventions included modeling by an adult or peer, error correction procedures, cueing, and rewards for engagement or improvement. A relevant sample was defined as one that included students in kindergarten through 12th grade. Relevant outcomes included reading fluency, comprehension, accuracy, prosody, and/or general reading achievement. Adequate outcome measures required a transfer passage--that is, a passage that had not been practiced during the intervention. Adequate statistics reported included means, standard deviations, and numbers of students in each group. Gersten et al.'s (2005) criteria do not include a screening procedure; however, some screening procedure is necessary to identify studies that are relevant to the review. Therefore, we used the screening criteria described above to select studies to be evaluated based on the Gersten et al. (2005) standards.

Neither WWC's single case design technical documentation (Kratochwill et al., 2010) nor Horner et al.'s (2005) criteria include specific screening procedures for single-case designs. We used the following criteria from the group screening process: 1) Relevant Topic; 2) Relevant Timeframe; 3) Relevant Intervention; 4) Relevant Sample; 5) Relevant Outcome; 6) Adequate Outcome Measure (WWC, 2008; see Table 1). In place of the group criterion "Adequate Statistics Reported," we used the WWC single case design documentation (Kratochwill et al., 2010) to develop a seventh criterion, "Single Case Design," to determine if the study included adequate individual case data. To meet the criterion, the individual case had to be the unit of intervention and the unit of data analysis, measured repeatedly over time, and included a baseline phase. Studies that failed to meet one or more of the screening criterion (except relevant timeframe) were eliminated from the pool and not reviewed further. As with group studies, the relevant timeframe criteria was to be applied after the quality of studies was determined to explore if the timeframe criterion screened out studies that would have met the methodological quality criteria.

Review Protocols

We developed four detailed protocols to guide the remaining steps in the review process: evaluation for methodological quality, summarization of outcomes, and finally, determination of the nature and strength of evidence related to repeated readings.

Gersten Quality Indictors: Group Research

Determining methodological adequacy. The quality indicators for group experimental and quasi-experimental research (Gersten et al., 2005) were used to develop a review protocol for evaluating the methodological quality of the group studies (see Table 2). The protocol listed each of the criteria and summarized the description of each criterion. Several authors have constructed Likert-type scales for each of these quality indicators; however, they have sometimes found it difficult to establish adequate interrater agreement using these scales (Baker, Chard, Ketterlin-Geller, Apichatabutra, & Doabler, 2009; Chard et al., 2009; Jitendra, Burgess, & Gajria, 2011). Due to this difficulty, we divided each category into more specific questions which could be answered as yes, no, or not applicable (adapted from Maggin and Chafouleas' [2010] method). To operationalize the categories, additional specifications had to be made to Gersten et al.'s (2005) general criteria. For example, Gersten et al. (2005) included a category on describing participants adequately. We divided that category into seven more specific criteria on whether or not each student's attributes was described.
Table 2

Standards of Methodological Quality of Group Studies
Based on Gersten et al. (2005) and WWC Standards (2008).

                  Based on Gersten et al.     Based on WWC Standards

Participant       Documentation of            No criteria bevond
Description       disability status; gender,  screening criteria
                  race, English status,
                  economic status,
                  Equivalent characteristics
                  across groups.

Interventionist   Interventionists described  No criteria
Description       & equivalent across
                  groups.

Study Design,     Randomization not           Randomized control
Randomization     required.                   trial (RCT), or
                                              quasi-experimental
                                              design (QED)

Pretest Group     Groups equivalent (d [less  Groups equivalent (d
Equivalence       than or equal to] .25       [less than or equal
                  difference) at pretest on   to] .25 difference) at
                  at least one measure.       pretest only if RCT
                                              with high attrition qv
                                              QED

Attrition         Overall attrition [less     Low overall and
                  than or equal to] 30%;      differential attrition
                  Differentia! attrition      based on bias model
                  [less than or equal to]     (WWC, 2008, p. 14)
                  10% across groups.

Intervention      Clearly described           No confounds with
                  procedures, materials,      intervention (e.g., no
                  time, comparison            teacher/group
                  condition; Details of       confound)
                  comparison condition
                  documented. Audio or video
                  excerpts to describe
                  intervention.

Intervention      Intervention fidelity       No criteria
Fidelity          described & assessed;
                  Fidelity measures include
                  quality features.

Outcome Measures  Multiple measures;          No criteria bevond
                  Measures delivered at       screening criteria
                  appropriate times;
                  Multiple reliability stats
                  as appropriate; Data
                  collectors blind and
                  equally familiar;
                  Maintenance assessment
                  Evidence of criterion and
                  construct validity..

Data Analysis     Analysis linked to          Criteria used for
                  research questions; Unit    summarizing study
                  of assignment and analysis  outcomes and adjusting
                  aligned; Rationale for      significance and effect
                  analyses; Adjustment for    sizes for
                  pretest differences d >     misalignment.
                  .25; Inferential
                  statistics & effect sizes
                  reported; Clear, coherent
                  report of results.

Note: Standard in bold are considered "essential" and those
in plain text are "desirable."


Summarizing study outcomes. Gersten et al. (2005) defined a "high quality" study as one that meets all but one of the "Essential Quality Indicators" and at least four of the "Desirable Quality Indicators." "Acceptable quality" studies meet all but one of the "Essential Quality Indicators" and at least one of the "Desirable Quality Indicators" (Gersten et al., 2005).

Describing the strength of evidence for a treatment. For an intervention to be considered "evidence-based" according to Gersten et al. (2005), a minimum of four acceptable quality studies or two high quality studies must support the intervention; and the weighted effect size must be significantly greater than zero. For a practice to be considered "promising," the same number and quality of studies are required, but there must be a "20% confidence interval for the weighted effect size that is greater than zero" (Gersten et al., 2005, p. 162). These standards were applied as written.

Horner Quality Indicators: Single Case Research.

Determining methodological adequacy. Horner et al. (2005) identified 21 quality indicators for high quality single case studies. These criteria were used to develop a review protocol for evaluating the single case studies. We adapted the Horner et al. criteria to be assessed through a series of dichotomous (yes/no) questions, as described above for the Gersten et al. (2005) criteria. Additional specifications to the criteria were made, such as requiring that treatment fidelity be measured at least 20% of sessions, with at least one measure of treatment fidelity per phase, and that treatment fidelity reach at least 90%, if quantified as percentage of steps completed accurately. The specifications were added to increase the objectivity of the criteria. No explicit criteria are noted for determining minimum methodological adequacy, therefore we required that the 21 quality indicators be met for a study to have minimally acceptable methodological adequacy.

Summarizing study outcomes. Homer et al. (2005) did not present specific criteria for summarizing individual study outcomes. We established the criteria that experimental control had to be demonstrated (based on visual inspection), and favor the repeated reading treatment across the majority of cases in the study.

Describing the strength of evidence for a treatment. According to Homer et al. (2005), an intervention can be considered empirically supported when:
  (a) a minimum of five single case studies that meet minimally
  acceptable methodological criteria and document experimental control
  have been published in peer-reviewed journals, (b) the studies are
  conducted by at least three different researchers across at least
  three different geographical locations, and (c) the five or more
  studies include a total of at least 20 participants (p. 176).


WWC Evidence Standards, Group Designs.

Determining methodological adequacy. A review protocol was developed for the WWC evidence standards based on the WWC general criteria (WWC, 2008) and modeled on the beginning reading protocol (WWC, 2007d; see Table 2). The WWC Procedures and Standards Handbook (WWC, 2008) includes specific criteria for rating the methodological quality of studies based on whether random assignment is used (i.e., randomized control trials versus quasi-experimental designs), problems with overall and differential attrition, confounds with the intervention (e.g., one teacher teaching the intervention group and another teaching the comparison group), and pretest equivalence. We followed these standards as written.

Summarizing study outcomes. Based on the ratings for methodological quality, each study was rated as (a) meets evidence standards, (b) meets evidence standards with reservations, or (c) does not meet evidence standards. Randomized control trials without attrition problems were rated as meeting evidence standards (regardless of pretest equivalence). Randomized control trials with attrition problems that showed pretest equivalence and quasi-experimental designs without attrition problems that showed pretest equivalence were rated as meeting evidence standards with reservations. Randomized control trials with confounds or attrition problems and lack of pretest equivalence and quasi-experimental designs with attrition problems or lack of pretest equivalence were rated as not meeting evidence standards.

Describing the strength of evidence for a treatment. After each study was rated, the WWC Intervention Rating Scheme (WWC, 2008) was used to determine whether or not repeated reading could be considered an empirically supported treatment. WWC gives each intervention one of six ratings based on the number of studies meeting evidence standards, the effect size of the comparisons, and the statistical significance of the comparisons. The ratings are: "positive effects, potentially positive effects, mixed effects, no discernible effects, potentially negative effects, or negative effects" (WWC, 2008, p. 22-23).

WWC Evidence Standards, Single Case Designs.

Determining methodological adequacy. WWC (Kratochwill et al., 2010) has established criteria for evaluating the evidence presented in single case designs. These criteria include two stages for determining methodological quality. First, studies are evaluated for the quality of design (adequate measurement of the dependent variable, systematic manipulation of the independent variable, sufficient number of contrasts between intervention and comparison conditions at different times, and sufficient number of data points per phase). Each study is categorized as (a) meets evidence standards, (b) meets evidence standards with reservations, or (c) does not meet evidence standards. Only studies that meet evidence standards (with our without reservations) are evaluated further. Second, studies are evaluated for whether they demonstrate a causal relation. A causal relationship is established through visual analysis of the data with respect to level, trend, variability, immediacy of effect, proportion of data overlap between phases and consistency across cases. The patterns of these effects are used to summarize study outcomes.

Summarizing study outcomes. Studies are rated as showing (a) strong evidence if they include at least three demonstrations of an effect with no non-effects, (b) moderate evidence if they have at least three demonstrations of an effect and at least one demonstration of non-effects or (c) no evidence if they fail to show three demonstrations of an effect (Kratochwill et al., 2010).

Describing the strength of evidence for a treatment. Studies that meet evidence standards (with or without reservations) are then used to determine whether the treatment is supported by adequate evidence. For a treatment to be supported by adequate evidence, Kratochwill et al. (2010) suggest that five studies that meet evidence standards (with or without reservations), conducted by three different research teams in three different geographic locations and include at least 20 cases across the studies (Kratochwill et al., 2010, p. 21) support the treatment.

Results

Screening for Relevant Studies

From the original pool of 59 studies derived from previous reviews, 48 studies (81%) failed to meet the screening criteria (see Figure 1).

Group design studies. Of the 41 group design studies in our original review set, 31 did not meet the screening criteria: 0 failed on the relevant topic criterion, 68% failed on the relevant intervention criterion, 3% failed the relevant sample criterion, 0 failed the relevant outcome criterion, 58% failed the criterion of relevant measures, and 45% failed the criteria of statistical reporting (see Table 3). These percentages add to more than 100% because some studies failed on more than one criterion. Studies did not meet criteria either by not reporting enough information or the information reported did not meet the criterion. For the relevant intervention and relevant sample criteria, most of the studies failed because the information they reported did not correspond to the criterion. For the relevant measures and statistical reporting, the studies tended to fail because they did not report adequate information. Most studies failed to meet multiple criteria. Ten studies met the screening criteria and were coded further.
Table 3

Summary of Group Design Studies that Did Not Meet
Screening Criteria (n =31).

Study          Year   Topic     Intervention  Sample    Outcome

Bryant et al.  2000   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Cohen et al.   1988   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Dixon-Krauss   1995   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Dowhower       1987   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Faulkner &     1999   met       Failed        met       met
Levy, Study 1         criteria  criteria      criteria  criteria
Study 2               met       Failed        Failed    met
                      criteria  criteria      criteria  criteria

Fuchs et al.   1997   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Herman         1985   met       met           met       met
                      criteria  criteria      criteria  criteria

Hollingsworth  1970   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Hollingsworth  1978   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Labbo & Teale  1990   met       met           met       met
                      criteria  criteria      criteria  criteria

Levy et al.,   1997   met       met           met       met
Study 1               criteria  criteria      criteria  criteria
Study 2               met       met           met       met
                      criteria  criteria      criteria  criteria

Lindsay et     1985   met       Failed        met       met
al.                   criteria  criteria      criteria  criteria

Lorenz &       1979   met       Failed        met       met
Vockell               criteria  criteria      criteria  criteria

Marston et     1995   met       Failed        met       met
al.                   criteria  criteria      criteria  criteria

Mercer et al.  2000   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Miller et al.  1986   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Moseley        1993   met       Failed        met       met
                      criteria  criteria      criteria  criteria

O'Shea et al.  1984   met       Failed        met       met
                      criteria  criteria      criteria  criteria

O'Shea et al.  1985   met       Failed        met       met
                      criteria  criteria      criteria  criteria

O'Shea et al.  1987   met       met           met       met
                      criteria  criteria      criteria  criteria

Rashotte &     1985   met       met           met       met
Torgesen              criteria  criteria      criteria  criteria

Rasinski       1990   met       met           met       met
                      criteria  criteria      criteria  criteria

Rasinski et    1994   met       met           met       met
al.                   criteria  criteria      criteria  criteria

Shany &        1995   met       Failed        met       met
Biemiller             criteria  criteria      criteria  criteria

Sindelar et    1990   met       met criteria  met       met
al.                   criteria                criteria  criteria

Stoddard et    1993   met       Failed        met       met
al.                   criteria  criteria      criteria  criteria

Van Bon et     1991   Met       Failed        met       met
al.                   criteria  criteria      criteria  criteria

Winter         1986   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Winter         1988   met       Failed        met       met
                      criteria  criteria      criteria  criteria

Number of      Count  0/31      21/31         1/31      0/31
Group Studies
Failing Each
Criterion
               %      0%        68%           3%        0%

Study          Year   Measures  Statistical
                                Reporting

Bryant et al.  2000   Failed    met
                      criteria  criteria

Cohen et al.   1988   Failed    Failed
                      criteria  criteria

Dixon-Krauss   1995   Failed    Failed
                      criteria  criteria

Dowhower       1987   met       met
                      criteria  criteria

Faulkner &     1999   Failed    met
Levy, Study 1         criteria  criteria
Study 2               Failed    met
                      criteria  criteria

Fuchs et al.   1997   met       met
                      criteria  criteria

Herman         1985   Failed    met
                      criteria  criteria

Hollingsworth  1970   met       Failed
                      criteria  criteria

Hollingsworth  1978   met       Failed
                      criteria  criteria

Labbo & Teale  1990   met       Failed
                      criteria  criteria

Levy et al.,   1997   Failed    met
Study 1               criteria  criteria

Study 2               Failed    met
                      criteria  criteria

Lindsay et     1985   met       Failed
al.                   criteria  criteria

Lorenz &       1979   met       Failed
Vockell               criteria  criteria

Marston et     1995   met       met
al.                   criteria  criteria

Mercer et al.  2000   met       met
                      criteria  criteria

Miller et al.  1986   Failed    met
                      criteria  criteria

Moseley        1993   Failed    Failed
                      criteria  criteria

O'Shea et al.  1984   Failed    Failed
                      criteria  criteria

O'Shea et al.  1985   Failed    Failed
                      criteria  criteria

O'Shea et al.  1987   Failed    met
                      criteria  criteria

Rashotte &     1985   met       Failed
Torgesen              criteria  criteria

Rasinski       1990   Failed    met
                      criteria  criteria

Rasinski et    1994   Failed    met
al.                   criteria  criteria

Shany &        1995   met       met
Biemiller             criteria  criteria

Sindelar et    1990   Failed    met
al.                   criteria  criteria

Stoddard et    1993   Failed    met
al.                   criteria  criteria

Van Bon et     1991   Failed    Failed
al.                   criteria  criteria

Winter         1986   met       Failed
                      criteria  criteria

Winter         1988   met       Failed
                      criteria  criteria

Number of      Count  18/31     14/31
Group Studies
Failing Each
Criterion
               %      58%       45%

Note. * = met criteria, # = failed criteria


To assess the reliability of coding based on the WWC Screening, three doctoral students were trained to use the review tools. Raters scored three initial studies until they achieved at least 90% agreement with the first author. Agreement between each rater and the first author was calculated for each rating category as the number of agreements divided by the sum of agreements and disagreements. Interrater agreement was assessed on the screening of 28 group studies (68% of group studies). The mean interrater agreement across group study screening categories was 96.43%, with a range of 82-100%. Agreement for the final rating of studies (i.e., whether a study should be screened out or in based on the previously coded categories) was 92.86%.

Single case design studies. The original set of research included 18 single case studies; 17 of these failed to meet the screening criteria. Of those that failed, 0 failed the relevant topic criterion, 35% did not meet the relevant intervention criterion, 0 failed the relevant sample or outcome criteria, 53% did not meet the relevant measures criterion, and 47% did not meet the criteria for being an adequate single case design (see Table 4). These failures were primarily characterized by studies that included relevant information, but the reported information did not meet the screening criteria, as opposed to lack of reporting. One study (Swain & Allinder, 1996) met the screening criteria and was coded further.
Table 4

Single Case Design Studies that Did Not Meet
Screening Criteria (n = 17).

Study         Year   Topic     Intervention  Sample    Outcome


Blum et al.   1995   met       met criteria  met       met
                     criteria                criteria  criteria

Carver &      1981   met       met criteria  met       met
Hoffman              criteria                criteria  criteria

Daly &        1994   met       met criteria  met       met
Martens              criteria                criteria  criteria

Deutsch       1979   met       failed        met       met
Smith, Study         criteria  criteria      criteria  criteria
1
Study 2              met       met criteria  met       met
                     criteria                criteria  criteria

Gilbert et    1986   met       met criteria  met       met
al.                  criteria                criteria  criteria

Kamps et      1994   met       met criteria  met       met
al.                  criteria                criteria  criteria

Langford et   1974   met       met criteria  met       met
al.                  criteria                criteria  criteria

Law &         1993   met       failed        met       met
Kratochwill          criteria  criteria      criteria  criteria

Lovitt &      1976   met       met criteria  met       met
Hansen               criteria                criteria  criteria

Mefferd &     1997   met       met criteria  met       met
Pettegrew            criteria                criteria  criteria

Morgan        1976   met       failed        met       met
                     criteria  criteria      criteria  criteria

Morgan &      1979   met       failed        met       met
Lyon                 criteria  criteria      criteria  criteria

Rose          1984   met       met criteria  met       met
                     criteria                criteria  criteria

Rose &        1986   met       failed        met       met
Beattie              criteria  criteria      criteria  criteria

Tingstrom et  1995   met       failed        met       met
al.                  criteria  criteria      criteria  criteria

Weinstein &   1992   met       met criteria  met       met
Cooke                criteria                criteria  criteria

Number of     Count  0/17      6/17          0/17      0/17
Single
Subject
Studies
Failing Each
Criterion
              %      0%        35%           0%        0%

Study         Year   Measures  Single
                               Case
                               Design

Blum et al.   1995   failed    failed
                     criteria  criteria

Carver &      1981   met       failed
Hoffman              criteria  criteria

Daly &        1994   failed    met
Martens              criteria  criteria

Deutsch       1979   failed    failed
Smith, Study         criteria  criteria
1
Study 2              failed    met
                     criteria  criteria

Gilbert et    1986   failed    met
al.                  criteria  criteria

Kamps et      1994   failed    met
al.                  criteria  criteria

Langford et   1974   met       failed
al.                  criteria  criteria

Law &         1993   met       failed
Kratochwill          criteria  criteria

Lovitt &      1976   failed    met
Hansen               criteria  criteria

Mefferd &     1997   met       failed
Pettegrew            criteria  criteria

Morgan        1976   met       failed
                     criteria  criteria

Morgan &      1979   met       failed
Lyon                 criteria  criteria

Rose          1984   failed    met
                     criteria  criteria

Rose &        1986   failed    met
Beattie              criteria  criteria

Tingstrom et  1995   met       met
al.                  criteria  criteria

Weinstein &   1992   failed    met
Cooke                criteria  criteria

Number of     Count  10/17     8/17
Single
Subject
Studies
Failing Each
Criterion
              %      59%       47%

Note. * = met criteria, # = failed criteria


Interrater agreement was assessed on the screening of 13 single case studies (72% of single case studies). Mean interrater agreement on single case screening criteria was 95.73% (range, 92-100%). Overall agreement on whether a study should be screened in or out was 100%.

Determining Methodological Quality

The ten group studies that met screening criteria were reviewed with two protocols; one based on the Gersten Quality Indicators for group studies and one based on the WWC Evidence Standards for group designs (WWC, 2008). The one single case study that met screening criteria was also reviewed with two protocols; one based on the Horner Quality Indicators for single case studies and one based on the WWC Evidence Standards for single case studies.

Group design studies. Ten group studies were evaluated with the protocol based on the Gersten quality indicators; none of the studies met the criteria to be considered high or acceptable quality (see Table 5). All studies failed to adequately describe their participants (i.e., they missed one or more criteria in this category), and failed to conduct data analyses as Gersten et al. (2005) specified. Most studies (7 of 10) failed one or more of the criteria regarding group equivalence and failed one or more of the intervention fidelity criteria. All but one study met all of the intervention description criteria. All studies met the outcome measure criteria. Interrater reliability was assessed on six of the ten group studies. The mean interrater agreement across categories was 91.98% (range, 83-100%). On the overall rating of each study (i.e., not acceptable, acceptable or high quality), there was 100% agreement.
Table 5

Evaluation of Screened Studies with Gersten et al.
(2005) Group Quality Indicators Criteria (n = 10).

                                                            Criteria
                                              Essential     Indicators
                                              Quality


                  Participants'  Group        Intervention  Intervention
Study       Year  Description    Equivalence  Description   Fidelity

Conte &     1989  failed         failed       met criteria  met criteria
Humphreys         criteria       criteria

Eldredge    1990  failed         failed       met criteria  failed
                  criteria       criteria                   criteria

Eldredge    1996  failed         failed       met criteria  failed
et al.            criteria       criteria                   criteria

Homan et    1993  failed         failed       failed        failed
al.               criteria       criteria     criteria      criteria

Mathes &    1993  failed         failed       met criteria  met criteria
Fuchs             criteria       criteria

Simmons et  1994  failed         met          met criteria  failed
al.               criteria       criteria                   criteria

Simmons et  1995  failed         met          met criteria  failed
al.               criteria       criteria                   criteria

Thomas      1989  failed         failed       met criteria  failed
&Clapp            criteria       criteria                   criteria

Vaughn et   2000  failed         met          met criteria  met criteria
al.               criteria       criteria

Young et    1996  failed         failed       met criteria  failed
al.               criteria       criteria                   criteria


                                      #of

                                      Desirable
                                      Quality
                  Outcome   Data      Indicators
Study       Year  Measures  Analysis  Met

Conte &     1989  met       failed             4
Humphreys         criteria  criteria

Eldredge    1990  met       failed             2
                  criteria  criteria

Eldredge    1996  met       failed             3
et al.            criteria  criteria

Homan et    1993  met       failed             3
al.               criteria  criteria

Mathes &    1993  met       failed             4
Fuchs             criteria  criteria

Simmons et  1994  met       failed             4
al.               criteria  criteria

Simmons et  1995  met       failed             5
al.               criteria  criteria

Thomas      1989  met       failed             3
&Clapp            criteria  criteria

Vaughn et   2000  met       failed             4
al.               criteria  criteria

Young et    1996  met       failed             3
al.               criteria  criteria

Note. * = met criteria, # = failed criteria


Of the 10 group studies evaluated with the protocol based on the WWC evidence standards, 4 met these standards. Three of these studies met evidence standards (Conte & Humphreys, 1989; Homan et al., 1993; Young et al., 1996), and one study met evidence standards with reservations (Eldredge et al. 1996; see Table 6). Six studies did not meet evidence standards; five of these studies failed to meet the criterion of equivalence of groups at baseline, and all six failed on the basis of differential attrition. For the criteria on attrition, most studies did not meet these criteria because they did not report enough information about the samples at pretest and posttest for coders to determine attrition. Interrater reliability for the WWC evidence standards was assessed on five (50%) of the group studies. The mean interrater agreement across categories was 94.05% (range, 66-100%).
Table 6

Evaluation of Screened Studies with WWC Group Criteria (n = 10).

                                             Criteria
                    Study Design,  Lack of   Overall    Differential
                                   Confound

Study         Year  Randomization            Attrition  Attrition

met evidence  1989  met criteria   met       met        met criteria
standards                          criteria  criteria
Conte &
Humphreys

Eldredge      1990  met criteria   met       met        failed
                                   criteria  criteria   criteria

met evidence  1996  met criteria   met       failed     failed
standards                          criteria  criteria   criteria
with
reservation
Eldredge et
al.

met evidence  1993  met criteria   met       met        met criteria
standards                          criteria  criteria
Homan et
al.

Mathes &      1993  met criteria   met       failed     failed
Fuchs                              criteria  criteria   criteria

Simmons et    1994  met criteria   met       met        failed
al.                 QED            criteria  criteria   criteria

Simmons et    1995  met criteria   met       met        failed
al.                 QED            criteria  criteria   criteria

Thomas &      1989  met criteria   met       met        failed
Clapp                              criteria  criteria   criteria

Vaughn et     2000  met criteria   failed    met        failed
al.                                criteria  criteria   criteria

met evidence  1996  met criteria   met       met        met criteria
standards                          criteria  criteria
Young et
al.


                    Baseline
                    Equivalence

Study         Year

met evidence  1989  failed
standards           criteria
Conte &
Humphreys

Eldredge      1990  failed
                    criteria

met evidence  1996  met
standards           criteria
with
reservation
Eldredge et
al.

met evidence  1993  Failed
standards           criteria
Homan et
al.

Mathes &      1993  Failed
Fuchs               criteria

Simmons et    1994  failed
al.                 criteria

Simmons et    1995  met
al.                 criteria

Thomas &      1989  Failed
Clapp               criteria

Vaughn et     2000  Failed
al.                 criteria

met evidence  1996  failed
standards           criteria
Young et
al.

Note. # = met criteria, # = failed criteria,
* met evidence standards, ** met evidence
standards with reservation


Single case design studies. One single case study (Swain & Allinder, 1996) was evaluated with Homer et al.'s (2005) criteria. The study failed to meet criteria in six of the seven evaluation categories. Interrater reliability was assessed on this study. The mean interrater agreement across categories was 95.12%.

The one single case study (Swain & Allinder, 1996) that passed screening was rated on the WWC single case design criteria. This study met standards for systematic manipulation of the independent variable, experimental control and sufficient number of data points, but failed to meet standards for intervention fidelity (not reported) or measurement of the dependent variable (authors did not report interrater agreement in each phase). Therefore, this study does not meet WWC evidence standards. Interrater reliability was assessed on this study and was 100%.

Summarizing Outcomes from Individual Studies

Three of the four review systems yielded zero studies with adequate methodological quality that they would be examined with respect to their outcomes. The WWC Group Designs review system identified four studies of sufficient methodological quality to warrant evaluation of outcomes (Conte & Humprheys, 1989; Eldredge et al., 1996; Homan et al., 1993; Young et al., 1996). For phonics outcomes, Conte and Humphreys (1989) found a statistically significant effect size (Hedges g corrected for small sample size; WWC, 2008) of -.57 favoring the control group on a word reading measure. Eldredge et al. (1996) reported statistically significant effects of .42 on a phonics outcome. Homan et al. (1993) and Young et al. (1996) reported small, nonsignificant effects on phonics from repeated reading. On the outcome of fluency, Homan et al. (1993) and Young et al. (1996) found small, nonsignificant effects for repeated reading. Conte and Humphreys (1989), and Eldredge et al. (1996) did not include adequate fluency outcomes. For comprehension outcomes, Conte and Humphreys (1989), Homan et al. (1993), and Young et al. (1996) reported small, nonsignificant effects for repeated reading. Eldredge et al. (1996) reported a statistically significant, but small effect size (.18) for comprehension.

Describing the Strength of Evidence for a Treatment

The two systems for reviewing single case research (Horner et al., 2005; Kratochwill et al., 2010) found no studies of acceptable quality to evaluate repeated reading. Therefore they both find no evidence in support of the treatment. The system for reviewing group research based on the Gersten et al. (2005) quality indicators also found that no studies met criteria. As a result, repeated reading would not be considered a promising or empirically supported treatment. The review system based on the WWC Standards identified four acceptable studies. This results in a finding that repeated reading has "Mixed Effects" for phonics and comprehension and "No Discernible Effects" for fluency outcomes.

Recall that the WWC screening criterion of "Relevant Timeframe" was not enforced in the screening step of the review process so that we could examine how this criterion would affect subsequent steps and the eventual findings. If this criterion had been implemented with a cutoff date of 1991 (i.e., 20 years before the current review), one study (Conte & Humphreys, 1989) deemed to have adequate quality would have been excluded from the review. However, the exclusion of this study would not have changed the WWC ratings of repeated reading.

Discussion

The first research question asked whether the findings from systematic reviews designed to identify empirically supported treatments (based on criteria from Gersten et al., 2005; Horner et al., 2005; Kratochwill et al., 2010; WWC, 2008) correspond to findings from previous reviews (i.e., Chard et al., 2002; Chard et al., 2009; Dowhower, 1989; Meyer & Felton, 1999; NICHD, 2000; Therrien, 2004). The finding that repeated reading is generally not empirically supported directly contradicts the conclusions of two previous narrative literature reviews and three meta-analyses which concluded that repeated reading is effective for increasing reading fluency and comprehension. In contrast, our finding based on these systematic reviews concurs with the result of the systematic review by Chard et al. (2009) which concluded that there was not enough evidence to declare repeated reading to be empirically supported for use with students with learning disabilities. These results raise important questions about the validity of using each kind of review for determining which interventions are likely to improve student outcomes.

The second research question asked whether applying the different review systems (Gersten et al., 2005; Homer et al., 2005; Kratochwill et al., 2010; WWC, 2008) results in similar conclusions about the evidence when they are applied to a single set of research studies. In general, the results of the review systems converged on the conclusion that repeated reading is not an empirically supported treatment. The fact that the reviews of repeated reading using these systems resulted in similar conclusions about the practice provides convergent evidence that they may be measuring similar constructs. However, in the evaluation of group studies, the WWC (2008) and Gersten et al. (2005) systems did not agree about which studies were of acceptable quality. Given 10 studies to evaluate, the Gersten et al. (2005) system found no studies to have acceptable quality and the WWC (2008) system found four to have acceptable quality. This result suggests that these two review systems emphasize different aspects of the construct empirically supported treatments. For example, Gersten et al.'s (2005) and Homer et al.'s (2005) criteria require that if students are described as having disabilities, the disability must defined and confirmed in the study. The WWC group and single case systems do not have such strict criteria on participant description. Another criterion that was different between the review tools for group studies was an evaluation of treatment fidelity. The Gersten et al. (2005) quality indicators include a criterion requiring that treatment fidelity be assessed and reported. WWC (2008) evidence standards do not require a measure of treatment fidelity. These examples show some of the differences that contribute to differences in ratings for the same studies.

Application of the review systems began with a relatively large pool of 59 studies that had been included in previous reviews; however, 48 of these studies (81%) were rejected in the initial screening step based on the WWC screening criteria. In the subsequent step of evaluating methodological quality, both the systems for evaluating single subject studies rejected all of the remaining studies, and one of the systems for evaluating group studies rejected all the remaining studies. Only one of the four systems found any studies to be of sufficient methodological quality, and that system accepted 4 of the 41 group studies that entered the review (90.3% rejection rate). This result is consistent with topic reports previously issued by WWC (e.g., WWC, 2007a; WWC, 2007c; WWC, 2007e). For example, in the topic report on beginning reading, 887 studies were initially considered and 836 (94%) were rejected either in the screening step or in the evaluation of methodological quality (WWC, 2007a; p. 1). In the elementary school math topic report, 236 studies were located and 227 (96%) were rejected (WWC, 2007c). In the middle school math topic report, 137 of 158 studies (87%) were eliminated based on screening and quality evaluation (WWC, 2007e). These results are characteristic of empirically supported treatment review approaches that use a "threshold" of quality (Detrich, 2008; Drake, Latimer, Leff, McHugo, & Burns, 2004) - studies above the threshold are evaluated further and studies below are not considered to be evidence. In contrast, a "hierarchical" (Detrich, 2008; Drake et al., 2004) or best evidence available (Kazdin, 2004) approach considers studies with a wider range of methodological quality and provides ratings of both methodological quality and outcomes. This type of review is able to consider a larger body of research that may be relevant to the treatment in question, and at the same time considers the quality of that evidence.

At this point, there is no evidence to suggest which form of review (threshold or hierarchical) most accurately identifies practices that produce positive student outcomes. An advantage of using a threshold approach with high standards includes the likely minimization of false positives (i.e. identifying practices as effective, when they are not). In addition, the clear and public standards set forth by systematic reviews may have a social consequence of "raising the bar" of generally accepted standards for intervention research. Currently the use of these standards greatly reduces the number of studies considered to be legitimate evidence; however, these standards may have the long-term effect of increasing the number of high quality studies. A disadvantage of the threshold approach is the increased risk of false negatives (e.g., a treatment deemed "not empirically supported" by the review process, but which would actually be effective if implemented) that is inherent in making the criteria very strict (Detrich, 2008). In this study, the conclusion that repeated reading is not empirically supported could be a false negative. If this is the case, one of the consequences of using the threshold approach would be to discourage use of an intervention that is, in fact, effective.

Hierarchical approaches typically can provide more information about a practice, because they summarize the evidence that is available, even if it is not of high methodological quality. The review of the intervention is qualified by a rating of the quality of evidence supporting it (Drake et al., 2004; Kazdin, 2004). Although most of the meta-analyses and literature reviews on repeated reading did not directly address the quality of studies, the results of these reviews are similar to the results of a hierarchical review approach. For example, these reviews gave recommendations about which components of repeated reading were likely to be more effective for certain outcomes. This information would not be available from the threshold reviews, because the contributing studies were eliminated.

There is a trade-off between reduction of false positive results and false negative results, and both types of errors are costly. The cost of incorrectly recommending ineffective treatments is obvious--and reducing this negative outcome is a main focus of evidence-based practice. But failing to recommend treatments is also costly--it leaves practitioners without guidance on how to address important educational objectives. In considering the approaches for identifying ESTs, stakeholders will need to take into account whether the education system can tolerate more false positives or false negatives in evaluating treatments.

There are ways to reduce false negatives without increasing the risk of false positives. One such strategy is to ensure that no studies are eliminated from consideration for reasons that do not reflect important methodological flaws. For example, the WWC's "Relevant Timeframe" criterion eliminates studies based on their age alone. This criterion means that even though studies may be considered to have adequate quality, if they are not within the 20-year window, they would be excluded. Such an arbitrary exclusion criterion might be tolerable in a situation in which the problem were managing a large number of high quality studies - however it seems less reasonable in situations where decisions are being made based on a complete lack of eligible studies. To explore this issue, we did not exclude studies based on the "Relevant Timeframe" criterion in the screening step (as would have been done by WWC). After evaluating the studies for methodological quality, we applied the "Relevant Timeframe" criterion. Using this method, we found that one high quality study would have been excluded by the "Relevant Timeframe" screening criterion; however, inclusion of this study would not have changed the overall rating of repeated reading. In another review, this screening criterion could easily affect the rating of an intervention. This criterion suggests that the evidence for a practice would have to be renewed at least every 20 years. The WWC system allows the Principal Investigator for a review to determine review-specific screening and rating criteria based on their knowledge of the topic area. The "Relevant Timeframe" criterion may benefit from this more nuanced approach.

This review has several limitations. First, our intent was to compare the review systems applied to a common set of studies. This focus means that we could not consider whether the various review systems would obtain the same corpus of studies to review. In actual use, it is likely that different review systems would differ in how the literature was searched and obtained, resulting in different sets of studies entering the review process. For example, WWC includes unpublished manuscripts and dissertations as possible evidence for interventions while many other review systems limit their scope to studies published in peer reviewed journals. Our decision to begin the review process with a single set of studies would tend to reduce differences between review systems and increase convergence. Second, we used the WWC screening criteria to screen studies prior to evaluation by the various quality rating systems. We felt this was reasonable because some means of refining our pool of studies was necessary and the Horner et al. (2005) and Gersten et al. (2005) systems did not include screening criteria. The fact that we used a single, common screening procedure undoubtedly inflated the correspondence among the final ratings for the four sets of criteria and overestimated the convergent evidence among them. Third, coders read each article then coded it using two sets of criteria (i.e., Gersten et al., 2005, and WWC, 2008, for group studies; Homer et al., 2005, and Kratochwill et al., 2010, for single case studies) in succession. Therefore, the experience of applying the first set of standards may have influenced ratings on the second set of standards. Again, this may have increased the convergence among the sets of criteria, compared to using each set alone to rate the studies. The additive effect of these three factors may have substantially increased the apparent convergence among review systems. Future research should attempt to reduce these biases.

As we have emphasized throughout, the purpose of this review was primarily methodological -- we were interested in how the various review systems would correspond when applied to a single set of literature. This review should not be read as a current review of the evidence related to repeated reading as an intervention. We included only literature that had been cited. in the previous narrative and meta-analyses in order to establish a common basis among these reviews.

The overall results from this review indicated that review systems resulted in different conclusions about repeated reading than previous narrative reviews and meta-analyses, even when the literature base is the same. The systematic reviews generally corresponded with one another in finding that virtually the entire literature on repeated reading should be excluded from consideration on methodological grounds. The few remaining acceptable studies did not find statistically significant or sizeable differences between repeated reading and the comparison condition(s), or found mixed results. Therefore, repeated reading would be judged as "not empirically supported" (or in the case of the WWC evaluation of group studies to have mixed effects on phonics and comprehension, and indeterminate effects on fluency). These conclusions are distinctly different than the very positive conclusions about repeated reading from previous literature reviews and meta-analyses. The differences in these results emphasize a difference in these review processes: threshold reviews (i.e., Gersten et al., 2005; Horner et al., 2005; Kratoch will et al., 2010; WWC 2008) consider only relatively high quality studies to inform their decisions about the effectiveness of a practice; as a result, they would be expected to produce more false negative judgments of practices. Conversely, while previous meta-analyses and narrative reviews were informed by lower quality studies, their conclusions bear higher risks of false positive judgments on practices.

Empirically supported treatment is still a relatively new innovation in education and the methods for conducting effective reviews to identify ESTs are in their infancy. Based on our experience in comparing these review systems we believe that review methods can and should continue to develop. There is still a great need to build review methods that can take advantage of a larger range of studies yet can provide recommendations in which practitioner can place a high degree of confidence.

References

Baker, S. K., Chard, D. J., Ketterlin-Geller, L. R., Apichatabutra, C., & Doabler, C. (2009). Teaching writing to at-risk students: The quality of evidence for self-regulated strategy development. Exceptional Children, 75, 303-318.

Bellini, S., & Akullian, J. (2007). A meta-analysis of video modeling and video self-modeling interventions for children and adolescents with autism spectrum disorders. Exceptional Children, 73, 264-287.

Briggs, D. C. (2008). Synthesizing causal inferences. Educational Researcher, 37(1), 15-22.

Browder, D. M., Wakeman, S. Y., Spooner, F., Ahlgrim-Delzell, L., & Algozzine, B. (2006). Research on reading instruction for individuals with significant cognitive disabilities. Exceptional Children, 72, 392-408.

Carnine, D. (1997). Bridging the research-to-practice gap. Exceptional Children, 63, 513-521.

Chard, D. J., Ketterlin-Geller, L. R., Baker, S. K., Doabler, C., & Apichatabutra, C. (2009). Repeated reading interventions for students with learning disabilities: Status of the evidence. Exception Children, 75, 263-281.

Chard, D. J., Vaughn, S., & Tyler, B. (2002). A synthesis of research on effective interventions for building reading fluency with elementary students with learning disabilities. Journal of Learning Disabilities, 35, 386-406.

Cohen, A. L. (1988). An evaluation of the effectiveness of two methods for providing computer-assisted repeated reading training to reading disabled students. Unpublished doctoral dissertation, Florida State University, Tallahassee.

Cohen, A. L., Torgesen, J. K., & Torgesen, J. L. (1988). Improving

speed and accuracy of word recognition in reading disabled children: An evaluation of two computer program variations. Learning Disability Quarterly, 11, 333-341.

Con frey, J. (2006). Comparing and contrasting the National Research Council Report On Evaluating Curricular Effectiveness with the What Works Clearinghouse approach. Educational Evaluation and Policy Analysis, 28, 195-213.

Conte, R., & Humphreys, R. (1989). Repeated readings using audiotaped material enhances oral reading in children with reading difficulties. Journal of Communications Disorders, 22, 65-79.

Detrich, R. (2008, September). Evidence-based education: Can we get there from here? Presentation at the Association for Behavior Analysis International Education Conference, Reston, VA.

Detrich, R., Keyworth, R., States, J. (2007). A roadmap to evidence-based education: Building an evidence-based culture. Journal of Evidence-Based Practices for Schools, 8, 26-44.

Deutsch-Smith, D. (1979). The improvement of children's oral reading through the use of teacher modeling. Journal of Learning Disabilities, 12, 172-175.

Drake, R. E., Latimer, E. A., Leff, H. S., McHugo, G. J., & Burns, B. J. (2004). What is evidence? Child and Adolescent Psychiatric Clinics (1 North America, 13, 717-728.

Dowhower, S. L. (1989). Repeated reading: Research into practice. The Reading Teacher, 42, 502-507.

Eldredge, J. L., Reuel, D. R., & Hollingsworth, P. M. (1996). Comparing the effectiveness of two oral reading practices: Round-robin reading and the shared book experience. Journal of Literacy Research, 28, 201-225.

Faulkner, H. J., & Levy, B. A. (1999). Fluent and nonfluent forms of transfer in reading: Words and their message. Psychonomic Bulletin and Review, 6,111-116.

Fuchs, D., & Fuchs, L. S. (1996). Bridging the research-to-practice gap with mainstream assistance teams: A cautionary tale. School Psychology Quarterly, 11(3), 244-266.

Gersten, R., Fuchs, L. S., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. (2005). Quality indicators for group experimental and quasi-experimental research in special education. Exceptional Children, 71, 149-164.

Gersten, R., Vaughn, S., Deshler, D., & Schiller, E. (1997). What we know about using research findings: Implications for

improving special education practice. Journal of Learning Disabilities, 30, 466-476.

Gottfredson, G. D., & Gottfredson, D. C. (2001). What schools do to prevent problem behavior and promote safe environments. Journal of Educational and Psychological Consultation, 12, 313-344.

Green, J. L., & Skukauskaite, A. (2008). Becoming critical readers: Issues in transparency, representation, and warranting of claims. Educational Researcher, 37 (1), 30-40.

Greenwood, C. R., & Abbott, M. (2001). The research to practice gap in special education. Teacher Education and Special Education, 24, 276-289.

Hollingsworth, P. M. (1970). An experiment with the impress method of teaching reading. Reading Teacher, 24, 112-114, 187.

Hollingsworth, P. M. (1978). An experimental approach to the impress method of teaching reading. Reading Teacher, 31, 624-626.

Homan, S. P., Klesius, J. P., & Hite, C. (1993). Effects of repeated readings and nonrepetitive strategies on students' fluency and comprehension. Journal of Educational Research, 87, 94-99.

Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179.

Individuals with Disabilities Education Improvement Act of 2004, 20 U.S.C. [section] 1400 et seq.

Jitendra, A. K., Burgess, C., & Gajria, M. (2011). Cognitive strategy instruction for improving expository text comprehension of students with learning disabilities: The quality of evidence. Exceptional Children,77, 135-159.

Kazdin, A. E. (2004). Evidence-based treatments: challenges and priorities for practice and research. Child and Adolescent Psychiatric Clinics of North America, 13, 923-940.

Kratochwill, T. R., Hitchcock, J., Homer, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M & Shadish, W. R. (2010). Singlecase designs technical documentation. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf

Kratochwill, T. R., & Stoiber, K. C. (2002). Evidence-based interventions in school psychology: Conceptual foundations of the Procedural and Coding Manual of Division 16 and the Society for the Study of School Psychology Task Force. School Psychology Quarterly, 17, 341-389.

Lee, J., Grigg, W., & Dion, G. (2007). The Nation's Report Card: Mathematics 2007 (NCES Report 2007-494). Washington, D.C.: U.S. Department of Education, National Center for Education Statistics, Institute of Education Sciences.

Levy, B. A., Abell, B., & Lysynchuk, L. (1 997). Transfer from word training to reading in context: Gains in reading fluency and comprehension. Learning Disability Quarterly, 20, 173-188.

Maggin, D. M., & Chafouleas, S. M. (2010). PASS-RQ: Protocol for assessing single-subject research quality. Unpublished research instrument.

Meyer, M. S., & Felton, R. H. (1999). Repeated reading to enhance fluency: Old approaches and new directions. Annals of Dyslexia, 49, 283-306.

Monda, L. E. (1989). The effects of oral, silent, and listening repetitive reading on the fluency and comprehension of learning disabled students. Unpublished doctoral dissertation, Florida State University, Tallahassee.

Morgan, R. T. (1976). "Paired reading" tuition: A preliminary report on a technique for cases of reading deficit. Child: Care, Health and Development, 2, 13-28.

Morgan, R, & Lyon, E. (1979). "Paired reading" -- A preliminary report on a technique for parental tuition of reading-retarded children. Journal of Child Psychology and Psychiatry, 20, 151-160.

National Institute of Child Health and Human Development. (2000). Report of the National Reading Panel. Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction: Reports of the subgroups (NIH Publication No. 00-4754). Washington, D.C.: U.S. Government Printing Office.

No Child Left Behind (NCLB) Act of 2001, 20 U.S.C.A. [section] 6301 et seq. (West 2003)

Odom, S. L., Brantlinger, E., Gersten, R., Homer, R. H., Thompson, B. & Harris, K. R. (2005). Research in special education: Scientific methods and evidence-based practices. Exceptional Children, 71(2), 137-148.

Perie, M., Grigg, W., & Donahue, P. (2005). The nation's report card: Reading 2005. (NCES Report 2006-451). Washington, D.C.: U.S. Department of Education, National Center for Educational Statistics, U.S. Government Printing Office.

Reutzel, D. R., & Hollingsworth, P. M. (1993). Effects of fluency

training on second graders' reading comprehension. Journal of Educational Research, 86, 325-331.

Samuels, S. J. (1979). The method of repeated readings. The Reading Teacher, 32, 403-408.

Schoenfeld, A. H. (2006). What doesn't work: The challenges and failure of the What Works Clearinghouse to conduct meaningful reviews of studies of mathematics curricula. Educational Researcher, 35, 13-21.

Shany, M. T., & Biemiller, A. (1995). Assisted reading practice: Effects on performance for poor readers in grade 3 and 4. Reading Research Quarterly, 30, 382-395.

Slavin, R. E. (1989). PET and the pendulum: Faddism in education and how to stop it. Phi Delta Kappan, 70, 752-758.

Slavin, R. E. (2002). Evidence-based education policies: Transforming educational practice and research. Educational Researcher, 31, 15-21.

Slavin, R. E. (2008). What works? Issues in synthesizing educational program evaluations. Educational Researcher, 37(1), 5-14.

Slocum, T. A., Detrich, R., & Spencer, T. D. (2012). Evaluating the validity of systematic reviews to indentify empirically supported treatments. Education and Treatment of Children, 35(2), 201-233.

Stenhoff, D. M., & Lignugaris/Kraft, B. (2007). A review of the effects of peer tutoring on students with mild disabilities in secondary settings. Exceptional Children, 74, 8-30.

Stout, T. W. (1997). An investigation of the effects of a repeated reading intervention on the fluency and comprehension of students with language-learning disabilities. Unpublished doctoral dissertation, Georgia State University, Atlanta.

Sutton, P. A. (1991). Strategies to increase oral reading fluency of primary resource students. Unpublished manuscript, Nova University.

Swain, K. D., & Allinder, R. M. (1996). The effects of repeated reading on two types of CBM: Computer maze and oral reading with second-grade students with learning disabilities, Diagnostique, 21, 51-66.

Therrien, W. J. (2004). Fluency and comprehension gains as a result of repeated reading: A meta-analysis. Remedial and Special Education, 25, 252-261.

Vaughn, S., Moody, S. W., & Schumm, J. S. (1998). Broken promises:

Reading instruction in the resource room. Exceptional Children, 64, 211-225.

Wendt, 0., & Miller, B. (2012). Quality appraisal of single-subject experimental designs: An overview and comparison of different appraisal tools. Education and Treatment of Children, 35(2), 235-268.

What Works Clearinghouse. (2006). What Works Clearinghouse phonological awareness training intervention report. Retrieved January 14, 2008, from http://ies.ed.gov/ncee/wwc/pdf/WWC_Phonological_Awareness_121406.pdf

What Works Clearinghouse. (2007a). What Works Clearinghouse beginning reading topic report. Retrieved from http://ies.ed.govincee/wwc/pdf/BR_TR_08_13_07.pdf

What Works Clearinghouse. (2007b). What Works Clearinghouse beginning reading topic report, technical appendix. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/BR_APP_08_13_07.pdf

What Works Clearinghouse. (2007c). What Works Clearinghouse elementary school math topic report. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/ESM_TR_07_16_07.pdf

What Works Clearinghouse. (2007d). What Works Clearinghouse evidence review protocol for beginning reading interventions. Retrieved from http://ies.ed.gov/ncee/wwc/PDF/BR_protocol.pdf

What Works Clearinghouse. (2007e). What Works Clearinghouse middle school math topic report. Retrieved from http://ies.gov/ncee/wwc/pdf/MSM_TR_07_30_07.pdf

What Works Clearinghouse. (2008). Procedures and standards handbook (Version 2.0). Retrieved from http://ies.ed.gov/ncee/wwc/DocumentSum.aspx?sid=19

Young, A. R., Bowers, P. C., & MacKinnon, G. E. (1996). Effects of prosodic modeling and repeated reading on poor readers' fluency and comprehension. Applied Psycholinguistics, 17, 59-84.

Breda V. O'Keeffe University of Utah Timothy A. Slocum Utah State University Cheryl Burlingame University of Connecticut Katie Snyder Utah State University Kaitlin Bundock University of Utah
COPYRIGHT 2012 West Virginia University Press, University of West Virginia
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:O'Keeffe, Breda V.; Slocum, Timothy A.; Burlingame, Cheryl; Snyder, Katie; Bundock, Kaitlin
Publication:Education & Treatment of Children
Article Type:Report
Geographic Code:1USA
Date:May 1, 2012
Words:11833
Previous Article:A systematic review of brief functional analysis methodology with typically developing children.
Next Article:The effects of social skills training on the peer interactions of a nonnative toddler.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters