# An Analysis of Statistical Power in Behavioral Accounting Research.

ABSTRACTAttention to statistical power and effect size can improve the design and the reporting of behavioral accounting research. Three accounting journals representative of current empirical behavioral accounting research are analyzed for their power (1 - [beta]), or control of Type II errors ([beta]), and compared to research in other disciplines. Given this study's findings, additional attention should be directed to adequacy of sample sizes and study design to ensure sufficient power when Type I error is controlled at [alpha] = .05 as a baseline. We do not suggest replacing traditional significance testing, but rather augmenting it with the reporting of [beta] to complement and interpret the relevance of a reported [alpha] in any given study. In addition, the presentation of results in alternative formats, such as those suggested in this study, will enhance the current reporting of significance tests. In turn, this will allow the reader a richer understanding of, and an increased trust in, a study's results and implications.

Statistical significance testing is such an Integral part of behavioral accounting research that the importance of demonstrating statistical significance is probably unquestioned by researchers. Yet for decades, researchers In other disciplines such as psychology, education, and management have discussed deficiencies of null-hypothesis testing for making inferences in behavioral sciences. Critics argue that results of statistical significance testing are often misinterpreted and the likelihood of Type II errors (rejecting [H.sub.0] when [H.sub.0] is false) are ignored (Brewer 1972; Cohen 1992; Greenwald et al. 1996; Mone et al. 1996). The debate among methodologists in psychology on ascribing meaning to failure to reject the null hypothesis has gone so far that the American Psychological Association (APA) is considering banning significance tests from its journals and devoted the January 1997 issue of Psychological Science to this question (Shrout 1997).

Accounting researchers would probably accept that the power of a statistical test (the probability of rejecting [H.sub.0] when [H.sub.0] is false) is important, but they may not be aware of the attention that has been given to the calculation and reporting of statistical power in other disciplines such as psychology and education. Yet the discussion is especially pertinent to behavioral accounting researchers who often employ methodologies that have been derived from research in these disciplines. Attention to statistical power and effect size [1] can improve both the design and the reporting of behavioral accounting research. For example, if a researcher cannot reject the null hypothesis, then the ability to demonstrate that the statistical tests performed were of sufficient power to detect an effect would strengthen conclusions that could be drawn from the research.

In particular, Burgstahler (1987, 204) argues that many accounting researchers are "Bayesians who revise their prior beliefs based on observed empirical evidence." If accounting studies are designed such that their hypothesis tests are of low power, then "little probability revision should be induced regardless of whether significant results are observed" (Burgstahler 1987, 212) (emphasis added). In other words, a highly significant test with very low power should "properly have little or no impact on the beliefs of a Bayesian" (Burgstahler 1987, 203). However, most published accounting research reports only significance levels without reporting power. How can readers judge the true impact of a study's findings on their prior beliefs if they cannot assess the actual strength and reliability of those findings?

In this paper, after a brief review of the concepts of statistical power and effect size, we report an analysis of the statistical power of published behavioral accounting research from three journals, Issues in Accounting Education, Behavioral Research in Accounting, and Journal of Management Accounting Research, published 1993 through 1997, and compare our results to retrospective power analyses in other disciplines. This selection of accounting journals allows a comparison of statistical power among studies that use student subjects and studies that use professional accountants as subjects. Obtaining a sufficient sample size when using professionals as subjects is often a critical issue in designing behavioral accounting research. We expect that studies using student subjects will be more powerful because researchers can obtain larger samples at a lower cost than when using professional accountants as subjects.

There is one previous study on statistical power that examines accounting research. Lindsay (1993) analyzed statistical power in studies on budgetary planning and control. Results were analyzed by journal to detect trends in the reporting of power due to type of journal, and articles were examined for evidence that researchers incorporated power considerations into planning their studies. This study extends Lindsay's (1993) research in three ways: (1) by significantly increasing the number of studies included in the relevant time period; [2] (2) by analyzing more recent research to assess changes in power in response to Lindsay (1993) and other authors of power analyses; and (3) by not restricting the analysis to a single research topic. Another contribution of this study is a series of techniques through which accounting researchers can report their research results to provide more information on the power of tests and on effect size.

STATISTICAL POWER ANALYSIS

In 1962, Jacob Cohen published the first study on the statistical power of studies. In what is now considered a classic study on power analysis, Cohen argues that researchers in psychology place disproportionate attention on control of Type I error (i.e., concluding that there is a relation or effect when there is none). However, in designing research, they largely ignore the power of a statistical test, which is related to Type II error. However, as Cohen (1988, 1) noted, "Since statistical significance Is so earnestly sought and devoutly wished for by behavioral scientists, one would think that the a priori probability of its accomplishment would be routinely determined and well understood." However, he demonstrated that studies published in the 1960 Journal of Abnormal Psychology had, on average, very low power to detect even a moderate effect in the population. Subsequent studies in a number of disciplines have found similar results.

Power Determination

Type II error is the probability that a statistical test will fail to reject the null hypothesis when it is false. The probability of a Type II error is referred to as [beta], and power is (1 - [beta]). Power is primarily a function of three determinants: the level of significance ([alpha]), sample size (n), and the effect size in the population (the difference between the null and the alternative hypothesis). In a retrospective power analysis of published research, the power of reported statistical tests can be computed if values can be obtained for the three power determinants (Cohen 1962). For example, the following formula is used to calculate the power of a one-tailed test of differences between two population means:

Power = 1 - [beta] = Probability [Z [less than] Z[alpha] - d]

where [beta] = probability of Type II error, d = effect size = \[[micro].sub.0]\ - [[micro].sub.a]/[sigma], [alpha] = level of significance, and [[micro].sub.0] and [[micro].sub.a] = the respective means of the two samples, and [sigma] = standard deviation of either sample (samples are assumed to be equal).

Level of Significance

Traditionally, researchers have placed more emphasis on avoiding Type I errors than Type II errors (Pollard 1993; Schmidt 1996). "As we do not wish the research literature to be riddled with spurious effects, nor to encourage pointless experiments that build on such effects, we are more concerned with avoiding Type I errors" (Pollard 1993, 450). However, while [alpha] levels have been controlled, Cohen (1962) demonstrated that [beta] levels had been ignored in psychological research, and that the power of published research ranges from 50 to 80 percent. In practice, this means that a study with power of 55 percent has a Type II error rate of 45 percent, causing difficulties in accepting that study's results without serious reservations about the results' validity.

Cohen (1988) suggests that the conventional Type II error rate should be 0.20, which would set power conventionally at 0.80. A materially smaller power would result in an unacceptable risk of Type II error, while a significantly larger value would probably require a larger sample size than is generally practical (Cohen 1992). Setting [beta] at 0.20 is consistent with the prevailing view that Type I error is more serious. Since [alpha] is conventionally set at 0.05, Cohen suggested setting [beta] at four times that value (Cohen 1988, 1992).

Sample Size

The second determinant of power is sample size. Power increases as the number of observations increases. As the sample size increases, the standard deviations of the sampling distributions for [H.sub.0] and [H.sub.1] decrease, which result in less overlap of the distributions and increased power (Sedlmeier and Gigerenzer 1989). The relationship of sample size and statistical power is especially salient for behavioral accounting researchers. Since behavioral accounting research focuses on the response of individuals to accounting issues or information, researchers can rarely use archival data with a large number of observations. Lab experiments, and even surveys, often produce smaller sample sizes of necessity because of the cost of data collection.

Effect Size

The final determinant of power is effect size (d), the true size of the difference between [H.sub.0] and [H.sub.1] (the null hypothesis is that the effect size is 0). Alternatively, effect size can be described as the strength of a relationship among two or more variables (Sawyer and Ball 1981). Other things being equal, the greater the effect size, the greater the power. Probably the most difficult aspect of power analysis is specifying, or at least estimating, the effect size. Although effect size is rarely known in advance, the researcher must have some idea of the degree to which [H.sub.0] is believed to be false (the effect size) in order to determine statistical power of the test and necessary sample size (Cohen 1988, 1992). While the determination of effect size is subjective, a researcher must have some idea a priori of the effect size; otherwise, why would the researcher be studying a given phenomenon?

If the effect size is unknown, which is frequently the case, then it must be estimated in order to determine the power of a test. Fortunately, Cohen (1988) has facilitated estimation of an effect size so that power can be calculated. Based on a review of prior behavioral research, he developed operational definitions for what can be described as small, medium, and large effect-size values for each type of statistical test. Medium effect size is intended to represent an effect "likely to be visible to the naked eye of a careful observer" (Cohen 1992, 156), while small and large effects were set subjectively to be noticeably different from medium effects. [3] Sedlmeier and Gigerenzer (1989) subsequently found that his judgment of a medium effect size closely corresponded to the actual sample median effect size of the articles in his sample, and his definitions have been supported by post hoc calculations of effect size in related research areas (Haase et al. 1982). Therefore, most subsequent power studies have relied on Cohen's (1988) definitions.

Small effects are more likely to be seen in studies of personality and social psychological research because measures used in that type of research often have lower validity (Cohen 1988). Large effects are more likely in research fields such as experimental psychology where there may be experimental control and more reliable measurements. In behavioral accounting research, as in many behaviorally focused disciplines, studies should be designed to detect, at a minimum, medium effect size in order to provide valid information to the reader. [4]

Cost/Benefit Analysis of Reporting Power

A small effect size coupled with a small sample size will result in research results with insufficient power. While the benefits of increasing the power of a study have been discussed, there are also other benefits, as well as significant costs, associated with increasing power. Expecting a researcher to consider power and control for Type II errors when analyzing data has been likened to expecting a researcher to more stringently control for Type I errors by changing the allowable standard for [alpha] from .05 to .01. While the benefit is increased reliability of the statistical inferences, does this benefit outweigh the cost of developing a large enough sample to result in power of .80? Given the results of prior studies reported in Table 1, it seems that either editors and reviewers have not considered power important, or the enhanced reliability is not cost justified.

The paramount cost associated with increasing the power of studies is related to achieving the sample size necessary to meet the power requirements. For example, Cohen (1992, 158) calculated that at a = .05, a sample size of 128 (64 in each group) is required to detect a medium-effect size if the researcher is using the t-test for means; at [alpha] .01, the required sample size jumps to 190. It may be that the difficulty of obtaining adequate sample sizes leads behavioral accounting researchers to disregard or not report their studies' power.

A second serious cost is caused by the need for the behavioral accounting researcher to subjectively categorize what level of effect size (small, medium, or large) is appropriate for the type of study in question. This is not easily or intuitively done. Consistent with Cohen's (1988) work, a medium-effect size could be assumed to be the norm for behavioral accounting research.

The benefits of conducting studies with adequate controls on both Type I and Type II errors derive from an increased reliability in the findings of the reported statistical analyses. Increased reliability allows the reader to realistically revise (or not) their prior beliefs in a Bayesian framework. Another potential benefit could be an increase in the publication of studies that fail to reject the null hypothesis with very high power. These studies should receive the same consideration for publication now accorded most current published studies that control only for Type I errors. Many of these studies report highly significant findings with very low power.

Power Studies in Accounting and Other Disciplines

Cohen's (1962) analysis of the power of studies published in the 1960 volume of Journal of Abnormal and Social Psychology was the first retrospective power analysis and the basis for power analyses that followed it. Cohen (1962) calculated power for three effect sizes--small, medium, and large--based on the relevant metric-free population parameter. He found mean power of 0.18, 0.48, and 0.83 for detecting small, medium, and large effects, respectively, and concluded that the investigators publishing in the Journal that year had a relatively poor chance of rejecting their null hypothesis unless the effect size was large.

Subsequent power analyses have been conducted in a variety of disciplines, most notably communications, education, and psychology (see Table 1). In particular, there have been several studies on the power of psychological research, using different journals and different time frames than Cohen's (1962) study. Examples of power analyses in other disciplines, starting with the seminal research by Cohen (1962), and the results of this study, are presented in Table 1. A comparison across disciplines reveals that the average power to detect small, medium, and large effect sizes for nonbusiness journals was 0.23, 0.61, and 0.84, respectively. The power of studies analyzed in business-related journals was 0.26, 0.71, and 0.91 for small, medium, and large effects, respectively.

Sedlmejer and Gigerenzer (1989) suggest that the concept of statistical power has been largely ignored despite Cohen's (1988) work, because it is not a concept in Fisherian statistics. Fisherian statistical methods dominate fields such as psychology, finance, medicine, and accounting. They also suggest that the reason power has not increased significantly over time is due to the frequent use of alpha-adjusted procedures. These procedures were not employed when Cohen did his original study. Alpha-adjusted procedures limit the probability of experiment-wise error rate by reducing the error-rate (alpha) per comparison. [5] Since power increases with alpha, the use of alpha-adjusted procedures decreases the power of the test. Sedlmeier and Gigerenzer (1989) examined the power of studies published in 1984 volume of Journal of Abnormal Psychology and found that the median power of tests using alpha adjustments were .10, .28, and .72 for small, medium, and large effects, respectively.

In the years intervening since Sedlmeier and Gigerenzer (1989), a time comparison across 35 years (1962-1997) shows that by decade, the power of published studies has increased, (assuming that the journals listed in Table 1 are representative of social science and business research). Compared with Cohen's original 1962 study showing power of .48 when detecting a medium effect size, the 1970 studies have an average power of .60. Power increases to .66 in the 1980 studies, while the average power of the 1990 studies is .70. The increase may reflect greater awareness of the importance of power in research design. However, this increase in power is still insufficient to detect a medium effect size.

There has been one retrospective power analysis of accounting research, but it was limited in scope. Lindsay (1993) analyzed the power of 43 studies on budgeting planning and control published from 1970 to 1987 in three accounting journals, Journal of Accounting Research, The Accounting Review, and Accounting, Organizations and Society. The mean power of these studies to detect small, medium, and large effect sizes was .16, .59, and .83, respectively. Given the small size of his sample (43 studies over an 18-year period) the lack of reporting by year and by publication, and the limitation to a single research topic, no conclusions can be drawn about trends in increasing power either by year or by publication. [6]

Although the power of published studies is generally sufficient to detect large effect sizes, reported power levels are still a matter of concern. While it is encouraging that power has improved over time, it is disheartening to realize that much published research does not control for Type II errors. The power of tests to detect a small effect size remains quite low, and a number of researchers have suggested that the effect size of interest is often small, particularly In exploratory research where there is less experimental control (Brewer 1972; Clark-Carter 1997). A well-designed study with insufficient power to detect significant results might never be reported if researchers are reluctant to submit research which lacks significant results to Journals (Chase and Chase 1976).

Most retrospective power studies have been done on psychological research, which often uses student subjects. Certain behavioral accounting research may have even lower statistical power than psychological research given the difficulty of accessing large numbers of accounting professionals to serve as research subjects. However, this should not be the case when considering behavioral accounting research using student subjects.

METHOD

The journals included in this study were chosen as representative of the wide variety of recently published empirical studies of behavioral accounting research. This study includes all regular and supplemental issues of Issues in Accounting Education, Behavioral Research Accounting, and Journal of Management Accounting Research published during the five-year period 1993-1997. The three journals selected yielded 96 behavioral research articles over the five-year period, more than double the number of articles that Lindsay (1993) was able to identify over an 18-year period.

There were ten issues of Issues use Accounting Education (IAE), Volumes 8-12, in the 1993-1997 period, with a total of 82 articles exclusive of instructional resource and Educator's Forum articles. Of these, 34 articles contained a total of 559 statistical tests, for an average of 16 tests per article. For the same time period, the eight issues of Behavioral Research in Accounting (BRTA), Volumes 5-9 plus Supplements, contained 80 articles, of which 48 presented the results of 819 statistical tests, for an average of 17 tests per article. Of 48 articles in Volumes Five through Nine of the Journal of Management Accounting Research (JMAR), 404 statistical tests were reported in 14 articles, for an average of 29 tests per article.

Statistical power was assessed at small, medium, and large effect size levels using Cohen's (1992) values for common statistical tests such as t-test for means, which are presented in Table 1. The total of 96 articles using significance testing in 1,782 cases was evaluated for statistical power using Cohen's (1988) standard power tables. [7] For example, if a study was undertaken with 100 participants (50 male, 50 female, so [n.sub.1] = [n.sub.2] = 50) to detect differences (nondirectional, therefore two-tailed) in learning styles, and alpha was set at .05, the power of the study at different effect sizes is as follows:

Power to detect small (d = .20) effect sizes .17

Power to detect medium (d = .50) effect sizes .70

Power to detect large (d = .80) effect sizes .98

This study could be considered to have sufficient power ([greater than or equal to] .80) only if the researcher intended to study a large effect size, which is not usually the case. A sample of 128 participants (where [n.sub.1] = [n.sub.2] = 64) would be necessary for a researcher to test for a medium size effect. The methodology in this study of using the total number of tests per article (regardless of the fact that they use the same sample repeatedly within each article) is consistent with that of all retrospective power analyses presented in Table 1. Power varies not only by sample size, but also by the type of statistical tests utilized in a study.

Certain methodological criteria were standardized so that results of this study could be compared with prior power analyses. The use of three effect-size levels allows for a comparison of the power of accounting research with that reported in other disciplines, and for a comparison across the three accounting journals. In this study, as with prior power analyses, the level of significance at different effect-sizes is assumed to be [alpha] = .05 for all tests, which is stringent enough to control Type I error, but not overly restrictive. The power level is specified at .80 so that [beta] = .20; this level is considered to be an acceptable control of the risk of Type II errors (Cohen 1988, 1992). To be consistent with other studies, the nondirectional version of the null hypothesis was used in all power calculations. [8]

RESULTS AND DISCUSSION

For the behavioral sciences, Cohen (1992, 156) defines a medium effect size as "an effect likely to be visible to the naked eye of a careful observer," and recommends its use in data analysis unless the nature of the research explicitly demands the use of a small or large effect size. In counselor education research, "the assumption of an average medium effect size is certainly more warranted than for that of either a small or a large effect size" (Haase 1974, 130), an assumption echoed by researchers in social psychology (Cooper and Findley 1982) and applied psychology (Chase and Chase 1976). The nature of behavioral accounting research is similar to these fields, warranting the use of a medium effect size when assessing statistical results. In general, like educational and psychological research, and unlike medical and pharmaceutical research, the danger of not detecting a small effect size does not result in personal injury or death. Usually, large effect sizes are easily detectable and do not require a s tatistical analysis to state the obvious. The following analyses assume a medium effect size when assessing the power of the studies included during the relevant time

Results of Current Power Analyses

The results of the power analysis at [alpha] = .05 for each effect-size level are summarized by year by journal in Table 2. There are no apparent trends present in any of the three journals over the five-year period that indicate any awareness of and/or attempt to address power issues raised by insufficient sample size and the consequent low power of the published articles. Only two volumes of ME (1993 and 1996) and two volumes of JMAR (1995 and 1996) contained articles with an average power exceeding the acceptable level of .80 at the medium-effect-size level. However, the overall power for each effect-size level (.23, .71, .93) for the five-year period beginning in 1993 is a distinct improvement over that reported by Lindsay (1993) for the 18-year period ending in 1987 (.16, .59, .83). This improvement in power can be attributed to an increase in the sample sizes used in the more recent studies included in the current analysis for several reasons: first, in both studies the critical level of a was .05; sec ond, both studies used Cohen's (1988) definitions and levels of effect sizes for the appropriate statistical tests; and, third, the current study analyzes a much larger sample over a more narrow period of time and more truly represents the state of current accounting behavioral research. Sample size is the only factor remaining that affects power.

The details by journal are presented in Table 3. When the power of the articles in the three accounting journals is assessed at the medium effect-size level and Type I error is controlled at [alpha] = .05, only 34 percent of the 96 articles had adequate control of the Type II error ([beta]), where power was at least .80. In other words, in 66 percent of these articles, [beta] ranged from 20 to 80 percent in not rejecting the null hypothesis when it is indeed false. If the aim of a study is to detect a small effect size, as could be the case in some accounting research, then only one of the 96 articles had power of at least .80 and could be considered reliable.

The beta-to-alpha ratio can be interpreted as a researcher's "conception of the relative seriousness of the two possible errors" (Sedlmeier and Gigerenzer 1989, 312). An acceptable ratio is assumed to be 4-to-1 or better, where power = .80, so [beta] = .20 and a = .05. For these three journals, the ratio is 6-to-1 ([beta] = 1 - .29, [alpha] = .05) when detecting medium size effects. This ratio suggests that researchers are willing to design studies as though they believe that mistakenly rejecting the null hypothesis is six times more serious than mistakenly accepting it. While this may be due to the inability of the researcher to get a large enough sample, the result is still a study with low power, and therefore low reliability. When assessed by individual journal, the ratios are as follows: BRIA 6-to-1, IAE 6-to-1, and JMAR 4-to-1. At the level necessary to detect small effect sizes, the average ratio for the three journals is 15-to-1 ([beta] = .77, [alpha] = .05), ranging from 16-to-1 for BRIA and IAE, an d 13-to-1 for JMAR.

When comparing the three journals, JMAR appears to have acceptable power and beta-to-alpha ratios at the medium effect-size level (.80). This analysis is deceptive, however. When the individual years are analyzed, the 1996 volume of JMAR was an aberration, with articles having much higher power than any other of the sample years, skewing the five-year average.

To further understand these results, articles were classified by type of sample: professional, student, and mixed (both professional and student). Of the 96 articles analyzed, 50 used subjects drawn from the professional level, such as auditors, CPAs, and accounting faculty; 39 used student subjects from graduate and/or undergraduate classes; and seven used a mix of both professional and student subjects. A priori, it was assumed that studies using student subjects would be more prevalent and more likely to have higher power because of (1) the ready availability; (2) the captive nature of students as participants; and (3) difficulty of getting busy professionals to participate in surveys, case studies, or laboratory experiments.

The findings do not support that assumption. Regardless of sample composition, the results show a remarkable consistency regarding power. The power of studies using professional subjects to detect small effect sizes is .2017, compared to .2067 for student-sample studies; for medium effect sizes, .6791 vs. .6555; and for large effect sizes .9053 vs. .8939. Perhaps some studies using professional subjects were sponsored, endorsed, or funded by professional organizations, increasing sample sizes to mirror those of student-sample studies. Whether analyzed by journal or type of sample, the final interpretation is unchanged; power is still at levels too low to detect small and medium effect sizes in most studies.

Analyses of Power Studies over Time

A gross analysis of current results compared with the prior power studies reported in Table 1 was inconclusive regarding any trend in published research showing increasing power over time. Regression analyses of the effects of time, number of statistical tests, number of articles on each study's power, and timeplots of each study's power over time, were undertaken to uncover any such trends.

When analyzing the data in the timeplot and regressions, the definition of the variable YEAR must be clear. It is not the publication date of the study that reported the power analyses for various journals, but the publication date of the articles in the journals that were analyzed. For example, Cohen's study was published in 1962, but analyzed articles in the Journal of Abnormal and Social Psychology published in 1960. It is the latter date that is used in the regression analyses which follow.

A timeplot of small, medium, and large effect sizes in presented in Figure 1. A visual inspection reveals a mixed but slightly increasing trend for medium and large effect sizes. However, there seems to be no such increase for small effect sizes over time. Data were also grouped by five-year increments, with similar results shown in Table 4, Panel A. Regressions with effect sizes as the dependent variables were then undertaken to better interpret the timeplot and incremental results.

The small, medium, and large effects sizes were the dependent variables in the regression models, with the year of journal article (YEAR), the number of articles in the journal studied (ARTICLE), and the number of tests (TEST) as independent variables. Results for these three regressions are presented in Table 4, Panel B. It was assumed [alpha] priori that studies have become more powerful over time, given the continued emphasis on power in articles stemming from Cohen's seminal 1962 study. This assumption is moderately supported by the regression results. Neither the number of articles nor the number of tests was significant in affecting the overall power. Both medium and large effect sizes are significantly and positively related to YEAR, i.e., power has increased over time, with more recent studies more powerful (when looking for medium and large effect sizes) than older studies. The results of regressions using the five-year Increments rather than individual years yielded very similar results in both sig nificance and direction.

ENHANCING TRADITIONAL SIGNIFICANCE TESTING

Power analysis need not be cast as a substitute for traditional significance testing. However, its role as a complement to significance testing should not be ignored. Certainly testing statistical significance is essential to evaluating the inferential validity of a hypothesis against a chance distribution. However, significance testing does not provide information on the magnitude of an effect. A sufficiently large sample size could lead to rejection of the null hypothesis at an appropriate alpha level, and yet the effect size would have little practical importance (Haase et al. 1982).

There are a number of steps that researchers can take in reporting results that would provide readers with more information on statistical power of the tests. Both [alpha] and [beta] levels can be reported and journal editors should consider both levels and an effect-size estimate rather than only emphasizing [alpha] levels (Carver 1993). Then a decision on whether to replicate a study with results that are statistically insignificant, but consistent with the direction of the hypothesis, can be based on the power and effect size of the study and whether either can be reasonably increased in a replication (Sawyer and Ball 1981). This would require that researchers, editors, and reviewers become more comfortable with the ex post subjective categorization of effect sizes as small, medium, and large.

Some psychologists have suggested replacing significance tests with confidence intervals. Reporting confidence intervals allows the reader to determine statistical power by examining the size of the confidence interval; the smaller the interval, the greater the power (Hunter 1997; Loftus 1996). We do not suggest replacing significance testing with confidence intervals, but confidence intervals could be provided as additional information to increase the understanding and analysis of the research findings. For example, the previous reporting of the comparison of studies using professional samples with those using student samples would be greatly improved by the addition of confidence intervals. If looking to detect a medium effect-size, the power of the 50 studies using professional samples was reported as .6791; the confidence intervals at 95 percent would be .6791 [+ or -] .0685. For the 39 student-sample studies, the power and confidence intervals were .6555 and .6555 [+ or -] .0735, respectively. The addit ional confidence interval reporting allows the readers to reach more informed conclusions about the strength of the study's findings.

The practical, rather than the statistical, significance of the data should be the focus of reported research results (Kirk 1996). Greenwald et al. (1996) provide detailed recommendations for future research, which include the retention of significance testing, but with additional information not commonly reported:

* augment null hypothesis significance testing by the exact reporting of p-values;

* minimize the importance of the p = .05 level;

* set power at .80 so that p = .05 for all but the smallest sample sizes:

* report p-values for all tests, not just "significant" tests, and report data in enough detail, including effect sizes, to permit secondary analysis.

Other alternatives to null-hypothesis significance testing include plotting the data in place of tabular presentations, using meta-analyses, reporting effect sizes, and performing planned comparisons on a data set (Loftus 1996).

Bayesian analysis also offers a means of accepting the null (Greenwald 1975). In a Bayesian analysis, a posterior likelihood distribution can be constructed from the observed data. The posterior likelihood distribution can then be compared to an assumed uniform prior distribution to determine the posterior probability measure of acceptance of the null hypothesis (Greenwald 1975).

There are a few voices supporting traditional significance testing in psychological research (Harris 1997: Abelson 1997). Critics of power analysis point out that confidence intervals are subject to the same Type I and II error rates and that any of the suggested alternative procedures should augment, not replace, significance testing. One solution is to expand the traditional hypothesis testing to three-valued (sensible) hypothesis-testing logic which includes two-tailed, split-tailed, and one-tailed versions of the two-sample t-test (Harris 1997). An alternative is to increase power without increasing sample size by choosing the most reliable measurement instrument given the parameters of the study's population variance (Williams and Zimmerman 1989).

CONCLUSIONS AND RECOMMENDATIONS

Only one-third of the studies in this analysis had power of 0.80 or more to detect a medium effect size. This should be of concern, especially since the power of these published studies could be higher than all behavioral accounting research, published and unpublished. While Cohen (1962, 152) noted "if anything, published studies are more powerful than those which do not reach publication, certainly not less powerful," this may not still be valid. A well-designed study might never be published if results are not statistically significant, causing researchers to be unaware of interesting studies with inadequate sample size that merit replication (Chase and Chase 1976). Further, even if editors are receptive to a well-designed study that failed to reject the null, researchers may be less likely to submit results for publication if the null has not been rejected (Greenwald 1975).

Power analysis can be a useful tool in research design. If the researcher specifies [alpha] and effect size a priori, then Cohen's (1988) tables can be used to furnish the minimal sample size required to attain a reasonably high power. If the sample size is fixed because of financial or practical constraints, then at least power can be computed for several effect sizes at a given [alpha]. If the power is unacceptably low for all effect sizes, then perhaps the study could be redesigned using a stronger statistical test that requires fewer subjects to attain an acceptable level of power (Brewer 1972). Experiments with higher power should increase the frequency of null hypothesis rejection, but awareness of statistical power may also cause researchers, reviewers and, editors to give null results more weight (Greenwald 1975).

Certainly there are costs and benefits associated with designing research to achieve higher power. Often the easiest way to increase statistical power is to increase sample size, but this can increase sampling costs, assuming that the researcher even has access to a larger pool of subjects. However, increasing the power of a test also increases the ability of a researcher to support retention of the null hypothesis when results are statistically insignificant, and there may be times when the intention of the study is to retain the null.

In conclusion, if these three journals are representative of a variety of current behavioral accounting research (including educational and professional issues, auditing, ethics, and many others), attention must be directed to adequacy of sample sizes and study design to ensure power of at least .80 ([beta] = 0.20), assuming Type I error is to be controlled at [alpha] = .05. At a minimum, the reporting of [beta] would complement and interpret the true value of a reported [alpha] in any given study. In addition, more reporting of results in alternative formats, such as those suggested in the prior section, will enhance the current reporting of significance tests, and allow the reader a richer understanding of, and an increased trust in, a study's results and implications.

We are indebted to two anonymous reviewers from the 1998 BRIA Conference for their comments and suggestions. Data are available upon request from the first author.

(1.) Effect size can be defined as "the degree to which the phenomenon is present In the population (i.e., the degree to which the 'null' hypothesis Is not really null]" (Mazen et al. 1987, 404).

(2.) Lindsay (1993) reported on 43 studies published over an 18-year period, which is less than three articles per representative year.

(3.) For example, in an analysis of overall variance in scores that can be accounted for by differences in groups, small-, medium-, and large-effect sizes would account for approximately 1 percent, 6 percent and 14 percent, respectively, of the overall variance (Clark-Carter 1997).

(4.) Cohen (1992, 157-158) provides definitions of effect sizes and values for a number of statistical tests, as well as the required sample size to achieve power of 0.80.

(5.) Included in this group are F-tests, Newman-Keuls, Duncan and Scheffe procedures.

(6.) Bailey et al. (1999) surveyed 104 empirical articles in the Journal Auditing between Fall 1990 and Fall 1996, and found 11 cases in which the authors' wording implied acceptance of the null hypothesis, but with no discussion of power. Bailey et al. (1999) provided an explanation of power analysis, and applied it to two articles in their survey, but did not do a retrospective power analysis.

(7.) A few studies using nonparametric tests could not be analyzed using Cohen's (1988) tables and were therefore omitted from the study.

(8.) If a test of a null hypothesis is directional (one-tailed) rather than nondirectional (two-tailed), then the power of the test increases. For example, in the study with 100 participants, changing from two-tailed to one-tailed would increase the power to detect small-, medium-, and large-effect sizes from .17 to .26, from .70 to .80 and from .98 to .99, respectively.

REFERENCES

Abelson, R. 1997. On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science 8 (1) January: 12-15.

Bailey, C. D., L. L. Hoffman. and A. Sloan. 1999. Divulging statistical power in auditing research. Working paper, University of Central Florida.

Brewer, J. K. 1972. On the power of statistical tests in the American Educational Research Journal. American Educational Research Journal 9 (2): 391-410.

Burgstahler, D. 1987. Inference from empirical research. The Accounting Review 62 (1): 203-214.

Carver, R. 1993. The case against statistical significance testing. Journal of Experimental Education 61 (4): 287-292.

Chase, L. J., and S. J. Baran. 1976. An assessment of quantitative research in mass communication. Journalism Quarterly 53: 308-311.

-----. and R. B. Chase. 1976. A statistical power analysis of applied psychological research. Journal of Applied Psychology 61 (2): 234-237.

-----. and R. K. Tucker. 1975. A power-analytic examination of contemporary communication research. Speech Monographs 42 (3): 29-41.

Clark-Carter. D. 1997. The account taken of statistical power in research published in British Journal of Psychology. British Journal of Psychology 88 (1): 71-83.

Cohen, J. 1962. The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology 65 (3): 145-153.

-----. 1988. Statistical Power Analysis for the Behavioral Sciences. Second edition. hillsdale, NJ: Lawrence Erlbaum.

-----. 1992. A power primer. Psychological Bulletin 112 (1): 155-159.

Cooper, H., and M. Findley. 1982. Expected effect sizes: Estimates for statistical power analysis in social psychology. Personality and Social Psychology Bulletin 8: 163-173.

Greenwald, A. G. 1975. Consequences of prejudice against the null hypothesis. Psychological Bulletin 82 (1): 1-20.

-----. R. Gonzalez, R. Harris, and D. Guthrie. 1996. Effect sizes and p values: What should be reported and what should be replicated? Psychotherapy 2 (March): 175-183.

Haase, R. F. 1974. Power analysis of research in Counselor education. Counselor Education and Supervision 14 (December): 124-132.

-----. D. M. Waechter, and G. S. Solomon. 1982. How significant is a significant difference? Average effect size in research in counseling psychology. Journal of Counseling Psychology 29 (1): 58-65.

Harris, R. 1997. Significance tests have their place. Psychological Science 8 (1) January: 8-11.

Hunter, J., 1997. Needed: A ban on the significance test. Psychological Science 8 (1) January: 3-7.

Katzer, J., and J. Sodt. 1973. An analysis of the use of statistical testing in communication research. The Journal of Communication 23 (9): 251-265.

Kirk, R. 1996. Practical significance: A concept whose time has come. Educational and Psychological Measurement 56 (5) October: 746-759.

Kroll, R. M., and L. J. Chase. 1975. Communication disorders: A power analytic assessment of recent research. Journal of Communication Disorders 8: 237-247.

Lindsay, R. M. 1993. Incorporating statistical power into the test of significance procedure: A methodological and empirical inquiry. Behavioral Research In Accounting 5: 211-236.

Loftus, G. 1996. Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science 5 (6) December: 161-171.

Mazen, A. M., M. Hemmasi, and M. F. Lewis. 1987. Assessment of statistical power in contemporary strategy research. Strategic Management Journal 8 (July-August): 403-410.

Mone, M. A., G. C. Mueller, and W. Mauland. 1996. The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology 49 (Spring): 103-120.

Pollard, P. 1993. How significant is "Significance"? In A Handbook for Data Analysts in the Behavioral Sciences: Methodological Issues, edited by G. Keren and C. Lewis, 449-460. Hillsdale, NJ: Lawrence Erlbaum Associates.

Sawyer, A. G., and A. D. Ball. 1981. Statistical power and effect size in marketing research. Journal of Marketing Research 18 (8): 275-290.

Schmidt, F. L. 1996. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.

Psychological Methods 1 (2): 115-129.

Sedlmeier, P., and G. Gigerenzer. 1989. Do studies of statistical power have an effect on the power of studies? Psychological Bulletin 105 (2): 309-316.

Shrout, P. E. 1997. Should significance tests be banned? Introduction to a special section exploring the pros and cons. Psychological Science 8 (1): 1-2.

Williams, R. H., and D. W. Zimmerman. 1989. Statistical power analysis and reliability of measurement. Journal of General Psychology 116 (4): 359-369.

Results of Selected Prior Power Studies Journal Study Journal of Abnormal and Social Cohen Psychology American Educational Research Journal Brewer Journal of Research in Science Teaching Brewer The Research Quarterly Brewer Journal of Communication Katzer and Sodt Counselor Education and Supervision Haase American Sociological Review Spreitzer (cited in Chase and Tucker 1976) American Forensic Association Journal Chase and Tucker Central States Speech Journal Chase and Tucker Journal of Communication Chase and Tucker Quarterly Journal of Speech Chase and Tucker Southern Speech Communication Journal Chase and Tucker Speech Monographs Chase and Tucker Speech Teacher Chase and Tucker Today's Speech Chase and Tucker Western Speech Chase and Tucker American Speech and Hearing Research, Journal of Communication Disorders Kroll and Chase Journalism Quarterly, Journal of Chase and Baran Broadcasting Journal of Applied Psychology Chase and Chase Journal of Marketing Research Sawyer and Ball Strategic Management Journal, Academy of Management Journal Mazen et al. Journal of Abnormal Psychology Sedlmeier and Gigerenzer Journal of Accounting Research, The Accounting Review Accounting, Organizations and Society Lindsay Academy of Management Journal Mone et al. Administrative Science Quarterly Mone et al. Journal of Applied Psychology Mone et al. Journal of Management Mone et al. Organizational Behavior and Human Mone et al. Decision Processes Personnel Psychology Mone et al. Strategic Management Journal Mone et al. British Journal of Psychology Clark-Carter Publication Effect Size Journal Date Small Journal of Abnormal and Social 1962 .18 Psychology American Educational Research Journal 1972 .14 Journal of Research in Science Teaching 1972 .22 The Research Quarterly 1972 .14 Journal of Communication 1973 .23 Counselor Education and Supervision 1974 .10 American Sociological Review 1974 .55 American Forensic Association Journal 1975 .16 Central States Speech Journal 1975 .23 Journal of Communication 1975 .16 Quarterly Journal of Speech 1975 .28 Southern Speech Communication Journal 1975 .34 Speech Monographs 1975 .17 Speech Teacher 1975 .13 Today's Speech 1975 .08 Western Speech 1975 .11 American Speech and Hearing Research, Journal of Communication Disorders 1975 .16 Journalism Quarterly, Journal of 1976 .34 Broadcasting Journal of Applied Psychology 1976 .25 Journal of Marketing Research 1981 .41 Strategic Management Journal, Academy of Management Journal 1987 .23 Journal of Abnormal Psychology 1989 .21 Journal of Accounting Research, The Accounting Review Accounting, Organizations and Society 1993 .16 Academy of Management Journal 1996 .20 Administrative Science Quarterly 1996 .32 Journal of Applied Psychology 1996 .35 Journal of Management 1996 .33 Organizational Behavior and Human 1996 .17 Decision Processes Personnel Psychology 1996 .30 Strategic Management Journal 1996 .18 British Journal of Psychology 1997 .17 Journal Medium Large Journal of Abnormal and Social .48 .83 Psychology American Educational Research Journal .58 .78 Journal of Research in Science Teaching .71 .87 The Research Quarterly .52 .80 Journal of Communication .56 .79 Counselor Education and Supervision .37 .74 American Sociological Review .84 .94 American Forensic Association Journal .49 .81 Central States Speech Journal .76 .94 Journal of Communication .49 .78 Quarterly Journal of Speech .61 .88 Southern Speech Communication Journal .62 .86 Speech Monographs .45 .74 Speech Teacher .53 .80 Today's Speech .26 .56 Western Speech .44 .78 American Speech and Hearing Research, Journal of Communication Disorders .44 .73 Journalism Quarterly, Journal of .76 .91 Broadcasting Journal of Applied Psychology .67 .86 Journal of Marketing Research .89 .98 Strategic Management Journal, Academy of Management Journal .59 .83 Journal of Abnormal Psychology .50 .84 Journal of Accounting Research, The Accounting Review Accounting, Organizations and Society .59 .83 Academy of Management Journal .62 .93 Administrative Science Quarterly .80 .95 Journal of Applied Psychology .82 .95 Journal of Management .82 .96 Organizational Behavior and Human .60 .87 Decision Processes Personnel Psychology .83 .97 Strategic Management Journal .63 .87 British Journal of Psychology .59 .82 Summary of Power by Journal and by Year Journal n of Subjects Behavioral Research in Accounting 524 1,135 387 1.230 1,830 Issues in Accounting Education 2,469 2,298 219 914 1.834 Journal of Management Accounting Research 669 472 143 574 219 Journal Time Period Behavioral Research in Accounting 1993 1994 1995 1996 1997 Five-year average Issues in Accounting Education 1993 1994 1995 1996 1997 Five-year average Journal of Management Accounting Research 1993 1994 1995 1996 1997 Five-year average Average of three journals over five-year period Effect Size Journal Small Medium Large Behavioral Research in Accounting .17 .67 .92 .16 .62 .90 .16 .65 .95 .14 .56 .87 .24 .78 .96 .20 .69 .93 Issues in Accounting Education .32 .83 .96 .22 .66 .90 .13 .55 .85 .18 .83 .97 .16 .67 .93 .21 .70 .92 Journal of Management Accounting Research .27 .76 .94 .19 .75 .96 .38 .99 .99 .76 .98 .99 .19 .65 .78 .35 .80 .94 .23 .71 .93 Frequency and Cumulative Percentage Distributions of the Power of Articles in the 1993-1997 Issues by Journal Small Size Effects Power Frequency Aggregate of the three journals (1,782) tests .95 - .99 .90 - .94 .80 - .89 1 .70 - .79 2 .60 - .69 1 .50 - .59 .40 - .49 4 .30 - .39 5 .20 - .29 20 .10 - .19 50 .00 - .09 13 n 96 Mean .23 Median .15 Behavioral Research in Accounting (819 tests) .95 - .99 .90 - .94 .80 - .89 .70 - .79 .60 - .69 .50 - .59 .40 - .49 1 .30 - .39 3 .20 - .29 12 .10 - .19 26 .00 - .09 6 n 48 Mean .20 Median .13 Power Cumulative % Aggregate of the three journals (1,782) tests .95 - .99 .90 - .94 .80 - .89 1 .70 - .79 3 .60 - .69 4 .50 - .59 .40 - .49 8 .30 - .39 14 .20 - .29 34 .10 - .19 86 .00 - .09 100 n Mean Median Behavioral Research in Accounting (819 tests) .95 - .99 .90 - .94 .80 - .89 .70 - .79 .60 - .69 .50 - .59 .40 - .49 2 .30 - .39 8 .20 - .29 33 .10 - .19 88 .00 - .09 100 n Mean Median Medium Size Effects Power Frequency Aggregate of the three journals (1,782) tests .95 - .99 10 .90 - .94 11 .80 - .89 13 .70 - .79 8 .60 - .69 12 .50 - .59 16 .40 - .49 12 .30 - .39 9 .20 - .29 5 .10 - .19 .00 - .09 n 96 Mean Median Behavioral Research in Accounting (819 tests) .95 - .99 2 .90 - .94 6 .80 - .89 8 .70 - .79 4 .60 - .69 5 .50 - .59 10 .40 - .49 6 .30 - .39 5 .20 - .29 2 .10 - .19 .00 - .09 n 48 Mean Median Power Cumulative % Aggregate of the three journals (1,782) tests .95 - .99 10 .90 - .94 22 .80 - .89 34 .70 - .79 44 .60 - .69 56 .50 - .59 73 .40 - .49 85 .30 - .39 95 .20 - .29 100 .10 - .19 .00 - .09 n Mean .71 Median .69 Behavioral Research in Accounting (819 tests) .95 - .99 4 .90 - .94 17 .80 - .89 33 .70 - .79 42 .60 - .69 52 .50 - .59 73 .40 - .49 85 .30 - .39 96 .20 - .29 100 .10 - .19 .00 - .09 n Mean .69 Median .61 Large Size Effects Power Frequency Aggregate of the three journals (1,782) tests .95 - .99 51 .90 - .94 12 .80 - .89 14 .70 - .79 11 .60 - .69 1 .50 - .59 6 .40 - .49 .30 - .39 .20 - .29 1 .10 - .19 .00 - .09 n 96 Mean .93 Median .78 Behavioral Research in Accounting (819 tests) .95 - .99 25 .90 - .94 7 .80 - .89 7 .70 - .79 7 .60 - .69 .50 - .59 2 .40 - .49 .30 - .39 .20 - .29 .10 - .19 .00 - .09 n 48 Mean .93 Median .96 Power Cumulative % Aggregate of the three journals (1,782) tests .95 - .99 53 .90 - .94 66 .80 - .89 80 .70 - .79 92 .60 - .69 93 .50 - .59 99 .40 - .49 .30 - .39 100 .20 - .29 .10 - .19 .00 - .09 n Mean Median Behavioral Research in Accounting (819 tests) .95 - .99 52 .90 - .94 67 .80 - .89 81 .70 - .79 96 .60 - .69 .50 - .59 100 .40 - .49 .30 - .39 .20 - .29 .10 - .19 .00 - .09 n Mean Median Issues In Accounting Education (559 tests) .95 - .99 5 15 18 .90 - .94 4 26 4 .80 - .89 1 3 4 38 6 .70 - .79 1 6 2 44 3 .60 - .69 6 62 1 .50 - .59 5 76 2 .40 - .49 3 15 4 88 .30 - .39 1 18 2 94 .20 - .29 6 35 2 100 .10 - .19 18 88 .00 - .09 4 100 n 34 34 34 Mean .21 .70 .92 Median .15 .75 .96 Journal of Management Accounting Research (404 tests) .95 - .99 3 21 8 .90 - .94 1 29 1 .80 - .89 1 36 1 .70 - .79 1 7 2 50 1 .60 - .69 1 14 1 57 .50 - .59 1 64 2 .40 - .49 2 79 .30 - .39 1 21 2 93 1 .20 - .29 2 36 1 100 .10 - .19 6 79 .00 - .09 3 100 n 14 14 14 Mean .35 .80 .94 Median .14 .67 .97 Issues In Accounting Education (559 tests) .95 - .99 53 .90 - .94 65 .80 - .89 82 .70 - .79 91 .60 - .69 94 .50 - .59 100 .40 - .49 .30 - .39 .20 - .29 .10 - .19 .00 - .09 n Mean Median Journal of Management Accounting Research (404 tests) .95 - .99 57 .90 - .94 64 .80 - .89 71 .70 - .79 79 .60 - .69 .50 - .59 93 .40 - .49 .30 - .39 100 .20 - .29 .10 - .19 .00 - .09 n Mean Median Shaded areas indicate articles falling within the acceptable range of [beta] (at least .80). Means By Five-Year Increments and Regression Results Panel A: Means by Five-Year Increments Number Small- Medium- Large- Number Five-Year of Effect Effect Effect of Increment Journals Size Size Size Articles 1960-1964 1 .18 .48 .83 70 1965-1969 0 .00 .00 .00 0 1970-1974 18 .21 .56 .81 437 1975-1979 1 .41 .89 .98 23 1980-1984 2 .22 .55 .84 98 1985-1989 1 .16 .59 .83 43 1990-1994 14 .24 .71 .92 309 1995-1999 9 .26 .74 .92 50 Subtotals 46 1,030 Overall Means .23 .65 .87 22 per journal Number Five-Year of Increment Tests 1960-1964 2,088 1965-1969 0 1970-1974 10,546 1975-1979 475 1980-1984 4,735 1985-1989 1,871 1990-1994 28,572 1995-1999 924 Subtotals 49,211 Overall Means 1,070 per journal Panel B: Regression Results Actual (Expected) Direction Probability Effect Size Year Article Test [greater than]F Small + (+) - (?) + (?) .6474 Medium + (+) [**] - (?) + (?) .0039 [**] Large + (+) [**] - (?) + (?) .0006 [**] Effect Size [R.sup.2] Small .0382 Medium .2700 Large .3360 (**.)Significant at [alpha] = .01.

[Graph omitted]

Printer friendly Cite/link Email Feedback | |

Author: | Borkowski, Susan C.; Welsh, Mary Jeanne; Zhang, Qinke Michael |
---|---|

Publication: | Behavioral Research in Accounting |

Article Type: | Statistical Data Included |

Geographic Code: | 1USA |

Date: | Jan 1, 2001 |

Words: | 8976 |

Previous Article: | A Research Note concerning Practical Problem-Solving Ability as a Predictor of Performance in Auditing Tasks. |

Next Article: | National Culture and the Implementation of High-Stretch Performance Standards: An Exploratory Study. |

Topics: |