Statistical conclusion validity of early intervention research with handicapped children.
The ultimate goal of early intervention research with children who are identified as handicapped or at risk is to improve therapeutic and educational services. To achieve this goal, early intervention research must be clinically relevant and empirically valid. Cook and Campbell (1979) have identified four primary types of validity necessary to ensure the integrity of applied research. They are (a) statistical conclusion validity, (b) internal validity, (c) construct validity, and (d) external validity. The latter three types of validity (internal, construct, and external) are familiar to early intervention researchers. Statistical conclusion validity, however, has received little attention in the early intervention or developmental literature. Statistical conclusion validity deals with the sensitivity of the experimental and statistical manipulations, and the ability of the study procedures to reveal a relationship or covariation between variables (Cook & Campbell, 1979; Fagley, 1985).
Three questions are central to determining the statistical conclusion validity of an investigation. First, is the study sensitive enough to permit reasonable statements about relationships between variables? Second, if the investigation is sensitive enough, is there reasonable evidence from which to infer that the presumed cause and effect co-vary'? Finally, if there is such evidence, how strongly do the variables co-vary (cf. Cook & Campbell, 1979, p. 39)? Most early intervention studies address the second question, which involves various quantitative assumptions and criteria used in statistical significance testing. The first and third questions, however, concerning statistical power and the estimation of effect size, respectively, are commonly ignored in early intervention research with handicapped children.
Researchers in early intervention have recently argued that the questions of importance to the field are no longer related to efficacy per se (Dunst, 1986; Guralnick & Bennet, 1987). Meisel (1985), for example, contends that the accumulated data provide a clear and positive answer to the question raised in Bronfenbrenner's (1975) classic paper, "Is early intervention effective?" Along this line, Casto, Mastropieri, and White have reported a series of quantitative reviews (Casto & Mastropieri, 1986; Casto & White, 1985; White & Casto, 1984; White, Mastropieri & Casto, 1984) that provide support for the argument that investigators should move beyond the basic efficacy question and begin to address issues related to how early intervention programs for handicapped children can be most effectively and efficiently implemented (Meisel, 1985).
Dunst (1986) recently argued that investigators should stop asking the question "Does early intervention work?" and begin to explore What dimensions of early intervention are related to changes in different outcome measures?" and "How much variance does early intervention account for beyond that attributed to other formal or informal treatments?" (Dunst, 1986, p. 80). Traditional tests of statistical significance provide all-or-none information regarding the null hypothesis and are of limited value in addressing the questions proposed by Dunst (1986). TYPE I AND TYPE 11 ERRORS The problem of drawing inferences from early intervention studies is complicated by low statistical power and study findings that are reported as not statistically significant. The results of a statistical evaluation may lead the researcher to either (a) support the null hypothesis (i.e., decide based on the sample data that probability indicates that the null hypothesis is true) or (b) fail to support (reject) the null hypothesis. In the latter case, the researcher has produced indirect evidence to support the research hypothesis. The possibility exists that the researcher will make an error in the decision to reject or not reject the null hypothesis. A Type I error occurs when the researcher rejects the null hypothesis when in fact it was true and should have been supported. The probability of making a Type I error is equal to the significance level used in the investigation. For example, if the significance level is set at .05 then there is a .05 (5%) chance that the data will mislead the researcher into rejecting the null hypothesis when it is true.
The converse of a Type I error is a Type 11 error, which occurs when the researcher supports the null hypothesis when it is false and should have been rejected. Unlike Type I error, no simple, direct relationship exists between the level of significance and the probability of making a Type 11 error. A general inverse relationship exists between Type I and Type 11 errors, so that, as the probability of committing a Type I error decreases, the probability of making a Type 11 error increases and vice versa. The conventional .05 level of significance has traditionally been accepted as the best balance between Type I and Type 11 errors in behavioral science and educational research (Cowles & Davis, 1982).
Statistical power is usually defined in terms of the probability of making a Type 11 error, or beta. Specifically, power is equal to I - beta. The smaller the Type 11 error or beta, the greater the statistical power (Cohen, 1977). Statistical power is intimately related to the significance level, sample size, and the effect size associated with a study. The significance level and sample size are commonly understood concepts. The effect size refers to "the degree to which the phenomenon is present in the population" or "the degree to which the null hypothesis is false" (Cohen, 1977, pp. 9-10). The traditional null hypothesis referred to earlier assumes no difference exists between groups (i.e., the effect size is equal to zero). If the groups differ on some outcome measure, this difference is expressed as some non-zero number, which is the effect size. Cohen (1977) has provided standardized metrics to compute effect sizes from commonly used statistical tests.
The larger the effect size, the more powerful the statistical test, assuming a constant significance level and sample size. A similar reciprocal relationship exists between other variables. For example, greater statistical power is required to reveal a small effect size as statistically significant when the sample size and significance level are held constant.
The issue of statistical power becomes important when a change in research focus occurs. As noted previously, Dunst (1986) has proposed that researchers in early intervention focus their empirical efforts on differentiating the components of early intervention rather than addressing traditional efficacy questions such as does early intervention work? Focusing on comparisons between different components or aspects of early intervention programs is likely to result in smaller treatment effects (effect sizes). This is true because all the groups being evaluated are usually receiving some component of the intervention program. When all groups are receiving intervention the difference between the groups on any given outcome measure is likely to be smaller than in those cases where one group receives treatment and the other does not. The net result of the proposed change in research focus may, therefore, be a loss in statistical power. PURPOSE in view of the change in research focus proposed by Dunst (I 986) it seemed timely to explore the statistical conclusion validity statistical power) of early intervention research to detect small, medium, and large treatment effects. A secondary purpose of this study was to examine the relationship between statistical power, treatment effect Size, and cost-effectiveness in early intervention. METHODS Fifty-seven data-based research articles investigating the effectiveness of early intervention with children identified as handicapped were analyzed. The 57 articles were originally included in a comprehensive review by Dunst 1986) and met the following criteria: (a) the intervention began prior to 3 years of age, (b) the majority of children in each program had one or more organic impairments, and (c) the program included some type of evaluation component designed to assess the effect of the intervention (cf. Dunst, 1986; Dunst & Rheingrover, 1981). Determination of Statistical Power For purposes of the present analysis it was necessary to formulate a set of standard conditions on the basis of which the power of each statistical test of the primary hypothesis could be determined. The 05 significance level (Type I error rate) was uniformly assumed when determining power levels across all 57 studies. Further, whether or not otherwise specified, the nondirectional version of the null hypothesis was used for all tests. Thus, a two-sided test for normal, binomial, and t-distributions, as they are usually tabled and used in hypothesis testing, was adopted for all tests. Although this criterion may have produced underestimates of statistical power in some cases, it avoids the more serious problem of inflated significance levels and power in nonpredicted directions.
To determine the power of the statistical evaluations included in the early intervention articles, the type of statistical test and the sample size were coded for each test of a primary hypothesis included in the analyzed articles. Power coefficients were obtained by consulting the appropriate tables in Cohen's (1977) text, Statistical Power Analysis for the Behavioral Sciences. The power of each statistical test to assess a small, medium, or large effect size was read directly from the appropriate table in Cohen's (1977) text, or calculated by interpolation between tabled values. Effect sizes computed for different statistical tests have varying numerical ranges for what Cohen considers small, medium, and large effects. For example, for a study including a two-group comparison (t-test) the appropriate effect size is the d-index. A d-index of .20 to .50 is considered a small effect according to Cohen's guidelines. A d-index of .50 to 80 is labeled as medium, and d-indexes of over .80 are considered large effect sizes.
A note of caution is in order regarding Cohen's (1977) effect size labels. He states that the proposed labels are a "conventional frame of reference" to be used when no other basis for estimating the effect size is available (Cohen, 1977, p. 25). An effect size considered small or trivial in one area may be substantial in another area. The evaluation of effect size in a specific study should include considerations relating to the general difficulty of explaining the phenomenon under investigation, the sensitivity of the measures, and any restrictions placed on the population.
The procedure for determining power values used in this investigation is similar to that used by several previous researchers who have examined power in different areas of behavioral science and education (cf. Brewer, 1972; Chase & Chase, 1976; Freiman, Chalmers, Smith, & Kuebler, 1978; Orme & Combs-Onue, 1986).
Tabled power values were occasionally not available for certain nonparametric tests. When a nonparametric test was reported for which power values were not available, the power coefficients were obtained from the tables for parametric equivalents. Also, in estimating power for studies that used ANOVA with repeated measures, the denominator error term for a particular F-index was set at Cohen's small, medium, or large effect size value, depending on which effect size was being assessed. A medium effect size, however, was assumed for the remaining total denominator term. A medium effect size value was also assumed for the total effect of the covariate or covariates in studies where analysis of covariance was employed. Coding of Power Data All studies were read and coded by two investigators familiar with research design, statistical methodology, and early intervention research. The title, authors, and source of the study were recorded along with information on the type of statistical test or tests performed, the sample size for each test, and the total number of tests related to the primary hypothesis under investigation. Statistical tests that were clearly manipulation checks or concerned with reliability or other exploratory issues not directly associated with the primary hypothesis were not included in the analysis. Interrater agreement for the codings ranged from 91% to 100%. Despite the high level of agreement, any post hoc analysis of study results is necessarily somewhat arbitrary in determining the number of tests conducted and those directly related to the primary null hypothesis. The actual number of tests addressing the null hypothesis cannot be determined precisely without access to the original data. RESULTS Of the 57 articles contained in the original review by Dunst (1986), 49 were included in the final analysis of statistical power reported in this investigation. Eight of the original 57 articles were single-subject studies or did not contain sufficient information concerning sample size or the statistical tests used to allow estimation of power for the three levels of effect size. Statistical power coefficients were estimated for a total of 484 quantitative tests contained in the 49 articles.
Table 1 presents the frequency and summary statistics for the three levels of effect size medium, and large) for various power ranges. The power coefficients included in Table I are based on the average power for each article. For example, if an article contained five statistical tests of the primary hypothesis, power coefficients were determined for each of those tests using the procedures described earlier. The average power of the five tests in that article to detect a small, medium, and large- effect size was then determined using the tables in Coben's 1977 text and those values reported in Table 1. Thus, the unit of analysis for determining the power values that appear in Table I was the study itself.
The sample sizes for the early intervention studies included in the analysis are presented in Table 2. Each sample size is indicated by a combination of "stem and leaf." The "stem and leaf" table provides all the information of a traditional bar graph, but also shows the actual values for all sample sizes. The "stem" includes the initial or beginning value for the sample and appears at the left-hand margin of the table. The "leaf" values represent individual numbers, each of which is associated with the corresponding stem. For example, the numbers 3, 3, and 6 (leaf) to the right of 7 (stem) in Table 2 represent sample sizes of 73, 73, and 76 respectively. Below the stem and leaf plot, the minimum and maximum sample sizes are given along with the first and third quartile Q1, Q3), mean, median, and standard deviation. Inspection of Table 2 reveals that the distribution of sample sizes is positively skewed, indicating that the majority of the samples in the reviewed studies were relatively small. The sample sizes ranged from 7 to 198 with a median of 25. Seventy-five percent of the samples included 40 or fewer subjects. DISCUSSION The results of this investigation suggest that much of the, reported early intervention research with handicapped and at-risk children is characterized by low power and inadequate statistical conclusion validity. Early intervention investigators have a relatively poor chance of rejecting the null hypothesis unless the effect size they are evaluating is in the category Cohen (1977) defines as large. The argument could be made that much of the research in the field of early intervention will be concerned with small or medium effect sizes due to the nature of the variables under investigation, the limited number of sensitive and reliable measuring instruments, and the lack of opportunity for rigorous experimental control in applied settings.
In view of these limitations researchers must be concerned with both pre- and post-study power evaluations if they intend to find statistically significant effects and reduce the probability of Type 11 error in their research. Prestudy power analysis can provide the investigator with information that will ensure adequate power to detect small, medium, and large effects at statistically significant levels. Cohen (1977) and others Kraeimer & Thiemann, 1987) provide detailed guidelines for planning research, including methods to Estimate effect size, sample sizes, and power.
The ability to reject a null hypothesis and thus detect a statistically significant effect is directly related to the power of the quantitative manipulations. The information in Table 1 indicates that the power of tests to detect small, medium, and large effects in the early intervention studies included in the analysis was relatively poor. The average (median) power to detect a large effect was only .46. One way of viewing the results contained in Table I is to determine the proportion of the tests that would meet the criterion of a Type II error level similar to the conventional Type I level, namely, .05 (power would then be at .95 or higher). In such a situation as much emphasis would be placed on Type II errors as Type I errors. Some authorities have suggested such an approach (Cohen, 1962). Table I reveals that none of the studies evidenced sufficient power to detect a small or medium effect size with a power of .95 or greater. Only 6 or 12 % of the studies had adequate power to detect even a large effect size when the criterion for power was set at .95.
A more traditional approach to establishing power for an investigation is to use the formula: power = (I - 4 alpha)). For instance, if the .05 alpha level is selected for a study, then power would be established at I - 4 (.05) or .80. Using this criterion, none of the studies evidenced sufficient power to detect a small effect. The percentage of studies with adequate power (.80 or greater) to detect medium and large effect sizes was 4% and 18% respectively. Table I reveals that none of the studies had even a 50 % chance of successfully detecting a small effect size, assuming a .05 alpha level and a two-tailed test of significance. IMPLICATIONS AND CONCLUSIONS The problem of statistical conclusion validity and Type 11 errors is of particular concern to the field of early intervention in relation to the recommendations made by Dunst (1986), Meisel (1985), and others (see Bricker & Littman, 1982; Guralnick & Bennet, 1987; Sheehan & Keogh, 1982). These recommendations argue that the focus of early intervention research should shift from all-or-none questions of efficacy to comparing and evaluating various components of early intervention programs.
Research on implementation and service delivery models is certain to increase as the result of recent federal legislation (P.L. 99-457) expanding early intervention programming for handicapped children. Future research will be concerned with examining which types of early intervention programs are most cost effective, with what types of children, using what types of service delivery model, and in what settings (Guralnick & Bennet, 1987).
Figure I presents a diagram depicting decision rules for making evaluative determinations relative to cost-effectiveness and efficacy criteria. The diagram is based on a model developed by Fishman (1981) and is widely used by program evaluators, policy analysts, and administrators. The diagram of possible outcomes and costs for two different programs clearly indicates the appropriate action for all but two of the cells. The results of the present investigation should alert early intervention professionals to potential problems related to statistical conclusion validity and Type 11 errors when conducting a cost-effectiveness investigation using the model developed by Fishman i 98 1). A failure to consider factors such as low statistical power when two or more programs are being empirically compared may result in Invalid inferences, poor programmatic decisions, and ineffective service delivery.
As early intervention programs and services expand, questions related to cost effectiveness are certain to receive increased empirical attention. Sacco (1982) has pointed out that in most cost-effectiveness studies two or more programs are quantitatively compared. If the two programs are found to be equally effective (i.e., no statistically significant difference exists between the programs), then the more expensive program is likely to be eliminated. Implicit in this logic is the empirical demonstration of equal effectiveness, which requires a research conclusion that the null hypothesis is "true." As noted earlier, statistical power and Type 11 errors become especially critical in decisions made concerning a failure to reject the null hypothesis.
Frequently, problems with the internal validity of a study are difficult to correct, particularly if the study is conducted in an applied setting. It is often impossible, for example, to randomly assign children to conventional treatment and control groups, for both practical and ethical reasons (Bricker & Littman, 1982). Problems associated with statistical conclusion validity, however, can be addressed more directly. While it may not be feasible to increase sample size in many early intervention studies, it is possible to determine the impact a particular sample size will have on the ability to achieve statistical significance. It is also possible to report post hoc power coefficients and effect size measures for specific studies. This information allows both the investigator and reader to more accurately interpret the study findings. An awareness of the issues associated with statistical conclusion validity will help ensure that future early intervention research will contribute to empirical clarification rather than statistical confusion.