# Using confidence intervals in supply chain and operations research.

Confidence intervals ... should be used for major findings in both the main text of a paper and its abstract. (Gardner and Altman, Statistics in Medicine, 1986, Vol. 292, p. 746)

INTRODUCTION

The usefulness of hypothesis testing methods has long been the subject of debate in the scientific community (Boring 1919; Kaiser 1960; Morrison and Henkel 1970; Hunter 1997; Wilkinson and the Task Force on Statistical Inferences. 1999), with both the tenor and tone of these criticisms becoming more pronounced in recent years (Harlow, Mulaik and Steiger 1997; Kline 2004). This controversy has resulted in much debate in the social sciences regarding the appropriate use of hypothesis testing. In an attempt to provide further clarity to this highly charged subject, we make the important distinction between informative and noninformative hypothesis testing and explain why, whenever possible, confidence intervals should replace the use of hypothesis testing.

Interestingly, business researchers have been mostly apathetic with regard to the hypothesis testing controversy. As a consequence, article-acceptance decisions in leading business journals, including those in well-regarded supply chain and operations management outlets, continue to rely almost exclusively on the results of hypothesis testing methods. In fact, and not withstanding the sage advice from Gardner and Altman (1986), many supply chain management researchers remain unfamiliar with the use and interpretation of confidence intervals. For instance, our of 234 empirical articles published in 2005, 2006 and 2007 in the Journal of Operations Management, Production and Operations Management and the Journal of Supply Chain Management, only 26 articles (11. 1 percent) reported a confidence interval. This founding is typical across the various business disciplines. Bonett and Wright (2007) reported general management journals, the Academy of Management Journal and Administrative Science Quarterly, in which only one out of the more than 130 empirical articles published in 2003 and 2004 reported a confidence interval.

This continued reliance on hypothesis testing is the result of a number of misconceptions regarding populations, population parameters and the type of information provided by tests of hypotheses (Bonett and Wright 2007). In this article we will: (1) briefly review the basic concepts of a population and a population parameter which provide the necessary foundation for the interpretation of confidence intervals; (2) outline limitations of statistical hypothesis testing: (3) provide five examples to illustrate how confidence intervals may be used in place of hypothesis testing methods and (4) close with five publication guideline suggestions designed to foster a more effective use of confidence intervals in supply chain research. We begin our discussion with some basic definitions and ideas regarding populations and population parameters.

POPULATIONS AND POPULATION PARAMETERS

Research involving the planning, implementing and controlling of the operations of such business functions as the supply chain often requires the use of inferential statistical methods to answer the questions of interest. Answers to these research questions typically involve an understanding of the characteristics of some large se of units. The units might be firms, suppliers, customers, employees, products, packages, orders or items in inventory, just to name a few. The set of units under investigation is called the study population. Some examples of study populations in supply chain research are: all Fortune 1,000 firms, all customers in the firm's database, all employees who work in a network of firms, all packages shipped by a firm during some period of time, all orders received by the firm during some period of time and all items of a particular type in a firm's inventory.

The researcher will typically be interested in particular characteristics of the unites that make up the study population. For instance, the researcher may want to study the amount of communication between a firm and its suppliers in a study population of all Fortune 1,000 companies. Examples of other characteristics include the length of time a supplier has worked with a firm, time-to-delivery expectations of customers, employee knowledge of inventory management, product quality, package delivery time, order size and age of inventory items. These characteristics may be measured using various scales or questionnaires.

It is often the case that a summary statistic of the measurements of all units in the study population can provide useful information for research or decision-making purposes. Commonly used summary statistics are means, proportions, standard deviations, correlations and slopes. The value of a summary statistic for the entire study population is called a population parameter.

Researchers may not have the time or resources to measure every unit in the study population and be able to compute the population parameter of interest. In situations such as these, the researcher can instead measure a random sample of units from the study population and use inferential statistical methods to make certain types of statements about the population parameter value. A random sample of size n is selected in such a way that every possible sample of size n from the specified study population has the same chance of being selected.

Most inferential methods can be classified into hypothesis testing methods and interval estimation methods. The difference between these two methods can be illustrated in the case of a simple linear regression model where variable x predicts variable y and p is the unknown slope parameter. If the researcher was able to measure x and y for every unit in the study population, the value of p could be computed and the researcher would be able to report its exact value. When p has been estimated from a random sample, the researcher will not be able to determine the exact value of P but instead will only be able to make specific types of vague and less-than-certain statements regarding its value.

In a hypothesis testing application, the researcher might want to test the null hypothesis [H.sub.0]: [beta] = 0, and if rejected, conclude that either [beta] > 0, or [beta] < 0. Concluding that [beta] > 0, for instance, is a vague statement about the value of p. Furthermore, the researcher cannot be certain that [beta] > 0 but must admit, with probability [alpha], that [beta] might be [less than or equal to] 0. In hypothesis testing applications, the researcher has control over the value of [varies] and typically sets a to be [varies] small value such as .05 or .01.

Alternatively, the researcher could use the information in a random sample to specify a range of possible values for p, called a confidence interval, with some specified degree of confidence. For instance, the researcher might report a 95 percent confidence interval for P equal to (2.1, 4.5). Like the hypothesis testing result, the confidence interval also is a vague statement about p because only a range of possible values for p is given. Furthermore, the researcher cannot be certain that P is within this range but can be only 95 percent confident. Bonett and Wright (2007) provide more details on how confidence intervals may be interpreted using a degree-of-belief definition of probability rather than the relative frequency definition used in many introductory statistics texts.

Confidence intervals are almost always preferred to hypothesis tests because confidence intervals provide more information than hypothesis tests. Confidence intervals provide information about the magnitude of the population parameter. Confidence intervals may also be used to test certain types of hypotheses. For instance, in our example where the 95 percent confidence interval for [beta] is (2.1, 4.5), the researcher also could reject [H.sub.0]: ([beta] = 0 and conclude that [beta] > 0.

LIMITATIONS OF HYPOTHESIS TESTING

Well aware of the limitations of hypothesis testing, Graybill and Iyer (1994) recommended "that traditional statistical tests of hypotheses for a parameter, say [theta] (where one rejects or does not reject the null hypothesis) never be used if a confidence interval for [theta] is available because confidence intervals always provide more information than tests" (p. 35). We note that [theta] may represent a single parameter, a difference or ratio of parameters, or a linear or nonlinear function of multiple parameters. Unfortunately and failing to heed Graybill and Iyer's (1994) advice, business researchers routinely rely almost exclusively on hypothesis-testing methods to address research questions and report results as being "significant" when the null hypothesis has been rejected or "nonsignificant" when the null hypothesis has not been rejected. In fact, a "significant" result is often a necessary criterion for acceptance in such well-regarded journal outlets as the journal of Operations Management, Production and Operations Management and the Journal of supply Chain Management. However, a "significant" result simply means that the null hypothesis has been rejected and does not imply that an interesting or meaningful result has been obtained. Alternatively, a "nonsignificant" result should not be interpreted as evidence that the null hypothesis is true.

Bonett and Wright (2007) classify tests of hypotheses into informative and noninformative hypotheses. Informative hypotheses provide useful information but non-informative hypotheses do not. To highlight this important distinction, we provide several examples of both informative and noninformative hypotheses. For instance, three examples of informative hypotheses are listed below where [theta] denotes a population parameter (or some function of two or more population parameters) and h is some specified value. The first example of an informative hypothesis is a directional one-sided hypothesis,

[H.sub.0]: [theta] [less than or equal to] h

[H.sub.1]: [theta] > h

the second example is a directional two-sided hypothesis,

[H.sub.0]: [theta] = h

[H.sub.1]: [theta] > h

[H.sub.2]: [theta] < h

and the last example is a finite-interval hypothesis.

[H.sub.0]: |[theta]| [less than or equal to] h

[H.sub.1]: [theta] > h

In each of these examples, a statistical hypothesis testing procedure will provide useful information. For instance, in a directional two-sided hypothesis, if the null hypothesis is rejected the researcher will conclude that [theta] > h or will conclude that [theta] < h. This is an informative test because the researcher would typically not know with certainty, before conducting the test, if [theta] > h is true or if [theta] < h is true.

Although informative tests of hypotheses are useful, especially in applied decision-making settings, informative tests of the type given above can be obtained from confidence intervals. For instance, if a one-sided lower limit for [theta] is greater than [h.sub.1] then [H.sub.1]: [theta] > h may be accepted. Or if a two-sided confidence interval for [theta] has a lower limit that is greater than [h.sub.1] then [H.sub.1]: [theta] > h may be accepted and if the two-sided upper limit is less than [h.sub.1] then [H.sub.2]: [theta] < h may be accepted. Finally, using a two-sided confidence interval to test a finite interval hypothesis, if the two-sided confidence interval for [theta] has a lower limit that is greater than-h and an upper limit that is less than [h.sub.1] then [h.sub.0]: |[theta]< h may be accepted.

The confidence interval approach is preferred to informative hypothesis testing because confidence intervals provide more information than the hypothesis tests. Specifically, confidence intervals provide information about the magnitude of the population parameter so that instead of simply concluding, for example, that [theta] > [h.sub.1] the confidence interval would provide a range of plausible values of [theta].

As explained by Bonett and Wright (2007), not all hypothesis tests are informative. In fact, a number of frequently used hypothesis tests are noninformative. For instance, F-tests in ANOVA, ANCOVA and regression models are examples of noninformative tests. The basic type of a noninformative hypothesis test is a test of null hypotheses such as

[H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k]

or

[H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k] = 0.

For [H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k], the alternative hypothesis states that there is at least one pair of parameter values that are not idential. For [H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k] = 0, the alternative hypothesis states that there is at least one parameter that is not exactly equal to zero. Statistical tests of this type of hypothesis are noninformative because we know in advance, with near certainty, that the null hypothesis is false and the alternative hypothesis is true. Tests of noninformative hypotheses are common in supply chain management research. In addition, and unfortunately these methods are also overemphasized in many business statistics textbooks. As a consequence, we advise supply chain researchers to pay very careful attention to the admonition given by Casella and Berger (2002, p. 525) that The classic ANOVA test is a test of the null hypothesis [H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k], a hypothesis that, in many cases, is silly, uninteresting, and not true." The classic ANOVA test is but one example of the noninformative tests described by Bonett and Wright (2007).

The F-test reported in a multiple regression analysis with k predictor variables is another widely used noninformative test. This F-test is a test of the null hypothesis

[H.sub.0]: [[beta].sub.1] = [[beta].sub.2] = ... = [[beta].sub.k] = 0. which is noninformative because we know in advance (with near certainty in any real application) that none of the [[beta].sub.i] values are exactly equal to zero and therefore we know that [H.sub.0] must be false. Another common example of a noninformative test is the [X.sub.2] goodness-of-fit test with k degrees of freedom routinely reported in structural equation models as a test of [H.sub.0]: [[theta].sub.1] = [[theta].sub.2] = ... = [[theta].sub.k] = 0, where [[theta].sub.i] is a model parameter. We know with near certainty that none of the [[theta].sub.i] values are exactly equal to zero and thus a statistical test that rejects [H.sub.0] does not tell us anything we did not already know. In general, F-tests with numerator degrees of freedom > 1 and [X.sup.2] tests with degrees of freedom > 1 are used to test noninformative hypotheses.

As in the other business disciplines, supply chain management researchers often compound the problem and routinely misinterpret the results of noninformative tests. For instance, when the statistical test lacks power to reject [H.sub.0], the researcher often interprets the result as evidence that the null hypothesis is true. Alternatively, when the statistical test rejects [H.sub.0], the researcher will claim that a "significant" result has been obtained with an added implication that the findings have important scientific or practical value. It is the widespread misinterpretation of hypothesis testing results, coupled with the noninformative nature of certain popular statistical tests that has led to recommendations to ban the reporting of statistical tests in scientific journals (Gardner and Altman 1986; Hunter 1997).

In response to the growing concerns about the inappropriate use of hypothesis testing in the social sciences, the American Psychological Association commissioned a group of a leading statisticians and researchers to examine the issue and make recommendations. One of their key recommendations is that "Interval estimates should be given for any effect size involving principal outcomes" (Wilkinson and Task Force on Statistical Inference 1999, p. 599). This is not a new recommendation. As noted almost 50 years ago by one of the world's greatest statisticians, John Tukey (I960, p. 429), "Probably the greatest ultimate importance, among all types of statistical procedures we now know, belongs to the confidence procedure which, by making interval estimates, attempt to reach as strong conclusions as are reasonable." The hypothetical examples given in the following section illustrate how confidence intervals maybe used to provide more useful information than hypothesis test results.

EXAMPLES

Example 1

A random sample of 275 Certified Purchasing Managers (CPMs) was obtained from a study population of 11,468 CPMs who are members of a particular professional group. Each of the 275 CPMs was contacted and asked to report the percent change in number of primary suppliers used by their firm over the last 2 years and to also report the quality of the relationship with the suppliers on a 1-10 scale with larger numbers indicating a higher quality relationship. The typical method of analyzing the data from this study would involve a test of the null hypothesis of a zero correlation between percent change in the number of suppliers and the quality of the relationship with the suppliers. If the null hypothesis of a zero correlation can be rejected, the hypothesis testing results might be reported as: "A highly significant negative correlation between percent change in number of suppliers and quality of relationship with suppliers was found, t(273)= - 5.003, p < .001." This hypothesis testing result simply indicates that the population correlation is less than zero.

An alternative analysis could report the following confidence interval result: "A 95% confidence interval for the correlation between reported percent change in number of suppliers and supplier relationship in the study population of 11,468 managers ranges from -0.395 to -0.178." This confidence interval result provides all the information needed to reject the null hypothesis but also provides additional information regarding the magnitude of the unknown population correlation. Specifically the confidence interval suggests that the absolute value of the population correlation between percent change in number of suppliers and supplier relationship could be as small as 0.178 or as large as 0.395.

Example 2

A random sample of 250 suppliers of stamped metal products was obtained from a study population of about 6,400 suppliers. The CEOs of the 250 sample firms were contacted and asked to answer a few questions as part of a university-sponsored research project in which all information would be treated confidentially. The CEOs were asked to name the firm's largest customer and rate on a 1-7 scale the extent of use of concurrent engineering ([x.sub.1]) and computer aided design ([x.sub.2]) in their firms. Each firm's largest customer was contacted and asked to rate its satisfaction (y) with the supplier on a 1-7 scale. Hypothesis testing results might be reported as: "A multiple linear regression analysis was conducted with satisfaction as the outcome variable and extent of concurrent engineering and extent of computer aided design as the two predictor variables. The overall test for the predictive ability of the two explanatory variables was highly statistically significant, F(2, 247) = 4.48, p=.012." This is a noninformative test because we know in advance that the alternative hypothesis, which states that at least one of the two explanatory variables has a nonzero regression coefficient in the population, is almost certainly true.

A confidence interval for the population squared multiple correlation might be more useful and could be reported as: "We are 95% confident that differences in extent of concurrent engineering and computer-aided design are associated with 0.2% to 8.3% of the variance of customer satisfaction in this population of 6,400 stamped-metal suppliers." The confidence interval results show that the explanatory variables are at best weakly associated with customer satisfaction, even though the hypothesis testing results were "significant."

Confidence intervals for the unstandardized regression coefficients also provide useful information. For instance, the effect of concurrent engineering could be reported as: "The 95% confidence interval for the concurrent engineering regression coefficient in the study population of 6,400 suppliers range from 0.08 to 0.36. This result suggests that any 1-point increase in the concurrent engineering rating is associated with a 0.08 to 0.36 increase in customer satisfaction scores." The test that the population regression coefficient equals zero would be "significant" at the .05 level but the confidence interval for the population regression coefficient suggests that the effect of concurrent engineering on customer satisfaction is very weak.

Example 3

A random sample of 100 firms was selected from a study population of 1,200 component-assembly firms. The 100 sample firms were classified into one of the four stages of supply chain integration proposed by Stevens (1989) and the 2-year percent change in sales was determined for each of the 100 firms. A typical statistical analysis for a study such as this would use a one-way ANOVA and these results might be reported as "The one-way ANOVA revealed highly significant differences in 2-year percent change in sales across the four stages of supply chain integration, F(3,96) = 54.5 p< .00.1." This result is noninformative because it simply indicates that the population means are not identical across the four types of firms. The F-test does not indicate how the population means are ordered nor does it provide any information about the magnitudes of the differences among the population means. A more informative analysis would report simultaneous confidence intervals for all six pairwise comparisons as shown in Table 1.

TABLE 1 Simultaneous Confidence Intervals for Six Pairwise Comparisons Comparison 95% lower 95% upper limit limit Stage 2 - Stage 1 -1.32 1.26 Stage 3 - Stage 1 2.49 4.11 Stage 4 - Stage 1 4.59 6.27 Stage 3 - Stage 2 1.95 3.84 Stage 4 - Stage 2 4.06 5.94 Stage 4 - Stage 3 2.20 3.99

Suppose that a difference in population means < 1.5 is considered to be small and unimportant in the context of this study. Note that the confidence interval includes zero for the Stage 2 - Stage 1 comparison and that the population mean for Stage 2 firms is at most 1.26 greater than the population mean for Stage 1 firms. This result suggests that the advantage of moving from Stage 1 to Stage 2 may be small and unimportant. The confidence intervals for the other comparisons indicate that Stage 4 firms in the study population have greater mean sales growth than Stage 3 firms, Stage 3 firms have greater mean sales growth than Stage 2 firms, and that these differences are not trivial because the lower limits are all > 1.5.

Example 4

A study population of about 4,500 small manufacturing firms was identified and a random sample of 150 firms was selected from this study population. Each of the 150 firms was contacted and asked about their use of the just-in-time (JIT) inventory strategy in 2006 and in 2007. Although JIT can lower a firm's storage and carrying costs, the researchers wanted to know if the sharp increase in gasoline prices from 2006 to 2007 affected the use of JIT during this period. Of the 150 firms, 36 reported that they used JIT in 2006, 20 reported that they used JIT in 2007 and 16 reported that they used JIT in both 2006 and 2007. The McNemar test is the appropriate statistical test for this repeated measures design and the results might be reported as: "The McNemar test was computed and a significant decrease from 2006 to 2007 in the population proportion of firms using JIT was found (z = - 3.27, p < .001)." An alternative analysis would report a 95 percent confidence interval for the difference in population proportions. The results might be stated as: "We are 95% confident that the proportion of firms using JIT in 2007 is 0.23-0.34 less than the proportion of firms using JIT in 2006 in our study population of 4,500 small manufacturing firms."

Example 5

Covariance structure models (also known as causal models or structural equation models) are being used with increasing frequency in supply chain management research. As noted in Bonett and Wright (2007) a particular causal hypothesis will imply that some of the model parameters should be very small while other model parameters should be meaningfully large. Suppose that a particular causal hypothesis implies that variable A (e.g., amount of communication between the firm and supplier) and variable C (e.g., length of relationship between the firm and supplier) with both variables B and C causing variable D (e.g., quality performance of firm). This hypothesis implies that there will be meaningfully large path coefficients from variables A to B, A to C, B to D and C to D. The causal hypothesis also implies that the path coefficients from variables A to D and variables B to C should be small and unimportant. In current practice, the researcher fits a model that includes the path coefficients from variables A to B, A to C, B to D and C to D. The path coefficients from variables A to D and B to C are constrained to equal zero. The [x.sup.2] goodness-of-fit-test in this example has 2 degrees of freedom. A typical, but inappropriate, interpretation of a [x.sup.2] goodness-of fit test for this model might be stated as "The chi-square goodness-of-fit test (2) = 4.55, p> .10, was nonsignificant and thus provides support for our hypothesized causal model."

This data analysis approach has two serious limitations. First, the [x.sup.2] test is actually a test that the population path coefficients from variables A to D and B to C are exactly zero. A "nonsignificant" [x.sup.2] test does not imply that these two population path coefficients are zero. These coefficients are almost certainly nonzero in the population and the [x.sup.2] test will become significant" if the sample size is large enough even if the path coefficients are small and unimportant. Second, a nonsignificant [x.sup.2] test (which researchers interpret as a "good fit") does not provide information about the magnitude of the population path coefficients from variables A to B, A to C, B to D and C to D. Even when the [x.sup.2] test suggests a "good fit," these four path coefficients could be small and unimportant. The goodness-of-fit indices popularized by Bender and Bonett (1980) do not provide the type of information needed to assess the size and importance of the path coefficients.

Bonett and Wright (2007) suggest that a better way to assess the hypothesized model involves the computation of simultaneous Bonferroni confidence intervals for all six path coefficients. If the simultaneous confidence intervals suggest that the two path coefficients that should be small are small and the four path coefficients that should be large are large, then this would provide strong support for the hypothesized model. It may be difficult to specify "small" and "large" values of the path coefficients because their values depend on the variance of the predictor variables. However, standardized path coefficients less than about 0.2 could be considered small and standardized path coefficients greater than about 0.5 could be considered large in most applications. In practice, a causal model will often be partially supported because not all path coefficients that should be large will be large and not all path coefficients that should be small will be small. This approach to causal model assessment is consistent with the notion of exposing theories to "risky tests" described by Edwards (2008).

RECOMMENDATIONS

The following recommendations address some common weaknesses in articles that frequently appear in prestigious business journals, including those in production, operations management and supply chain journal outlets. If supply chain, production and operations researchers follow these basic recommendations, the quality of their research will improve. More specifically, and consistent with Carter, Sanders and Dong (2008), our recommendations are designed to help provide the necessary "tipping point" for the results of studies reported in these outlets to have a greater scientific and practical impact.

Recommendation 1

Inferential statistical methods allow the researcher to use information in a random sample to make certain types of statements about the study population from which the random sample was taken. It is important then to clearly define the study population so that there is no confusion about the study population for which the statistical results apply. The study population will often be a subset of a larger population, called the target population, which is the population of theoretical interest. In these cases, authors will want to generalize the results from the study population to the target population using logical arguments. Unless the study population is clearly defined, it will be difficult for the authors to argue convincingly that the results of the study population can be extended to some target population.

Recommendation 2

Clearly describe the sampling procedure and explain how the sample was obtained. If the sample is a convenience sample and not a true random sample, make this fact clear to the reader. Random samples can be difficult to obtain and many supply chain researchers avoid this difficulty by using convenience samples and then tacitly assume that the convenience sample is a random sample of some "hypothetical" target population to which the inferential statistical results can be generalized. When convenience samples are used, it is the responsibility of the researcher to provide the reader with information that would support the claim that inferential statistical results will generalize to the hypothetical target population. These arguments can be extremely difficult to make and supply chain researchers will find that it may be easier to take random samples where the results of the statistical inferential methods (assuming assumptions of the methods are satisfied) may be generalized to the study population without explanation.

Recommendation 3

Always report and interpret a confidence interval for all important effect-size parameters and not just the point estimates of these parameters. Bonett and Wright (2007) explain why the current practice of reporting only point estimates of effect-size measures can be very misleading.

When studying the relation between two variables, if the variables are measured in a metric that is well understood (time, dollars, distance, etc.), report a confidence interval for an unstandardized measure of the effect size (e.g., difference in means or unstandardized regression coefficient). If the metric of the variable is not well understood (e.g., scores from author-constructed questionnaires) a confidence, interval for a standarized measure of effect size (e.g., standardized difference in means, correlation coefficient, standardized regression coefficient) should be reported along with or in place of the understandaridized measure. General methods for computing confidence intervals for standardized differences in means are described by Bonett (2008).

Recommendation 4

Clearly define all variables used in the statistical analysis. It is helpful to report large-sample estimates of means and standard deviations for every quantitative variable so that the reader will be better able to interpret the population parameters that are described by the confidence intervals (it may be possible to obtain these large-sample). For instance, in a linear regression analysis the reader will not be able to meaningfully interpret a confidence interval for a population slope coefficient in the absence of accurate estimates of the population standard deviations of the variables.

If a questionnaire is used to measure some attribute, report the psychometric properties of the questionnaire. Measure of test-retest reliability are often more informative than the more commonly reported measures of internal consistency such as Cronbach's alpha. Furthermore, researchers should not assume that the name of the questionnaire (e.g., "Customer Satisfaction Survey") accurately reflects the attribute measured by the questionnaire. Ideally, several forms of validity evidence such as convergent validity, discriminant validity and predictive validity (Kerlinger and lee 2000) should be reported to support the claim that the questionnaire is measuring what it claims to measure.

It is now common practice in supply chain management research to report point estimates of reliability and validity coefficients. It is important to also report confidence intervals for the study population values of the reliability and validity coefficients. When researchers develop new scales, they often perform only the most rudimentary psychometric analyses and do not provide adequate evidence of what Messick (1995) refers to as consequential validity. Evidence of consequential validity provides information to help understand the metric of the response variable. In the absence of adequate consequential validity, standardized measures of effect size are usually preferred to understandardized measures.

Recommendation 5

Do not use sophisticated statistical methods when more simple and easy-to-understand methods yield the same basic conclusions. When using a new "state-of-the-art" statistical method, be sure to clearly describe the assumptions of the method and the effects of violating the assumptions. Most sophisticated statistical methods are based on assumptions that are difficult to justify and do not perform properly when their assumptions have been violated. The problem with assumption violations is exacerbated by the fact that standard diagnostic methods often are unable to detect the types of assumption violations that would have serious negative effects on the performance of the statistical procedure.

CONCLUSION

Consistent with Bonett and Wright's (2007) findings for general management journals, only a small percentage of empirical articles in leading supply chain and operations journals currently report confidence intervals. Over 20 years ago, Nelder (1985, p. 238) noted that "The grotesque emphasis on significance tests in statistics courses of all kinds...is taught to people, who if they come away with no other notion, will remember that statistics is tests about significant differences". This "grotesque emphasis" still exists today. Meaningful change in the reporting of statistical results will not occur until journal gatekeepers provide more tangible guidelines and support for authors seeking to implement this change. One suggestion involves journals using "moral suasion" tactics and actively encouraging that all accepted manuscripts report confidence intervals whenever appropriate.

We propose that leading production, supply chain and operations management journals, such as the Journal of Supply Chain Management, actively recruit editorial board members who are able to work with authors of conditionally accepted papers to insure that the appropriate confidence intervals have been reported and correctly interpreted. Furthermore, and consistent with Bonett and Wright's (2007) recommendations for general management research, we encourage that the publication criteria focus not on the size of the p-value, but on more important criteria such as the quality of the sampling design, the quality of the research design, the reliability and validity of the variables under investigation and the confidence interval widths. It is our hope that these suggestions will stimulate further dialogue and the acceptance of our recommendations regarding the use of hypothesis tests and confidence interval methods in supply chain management journals.

REFERENCES

Bentler, P.M. and D.G. Bonett. "Significance Tests and Goodness of Fit in the Analysis of Covariance Structures", Psychological Bulletin, (88), 1980, pp. 588-606.

Bonett, D.G. "Confidence Intervals for Standardized Linear Contrasts of Means", Psychological Methods, (13), 2008, pp. 99-109.

Bonett, D.G. and T.A. Wright. "Comments and Recommendations Regarding the Hypothesis Testing Controversy", Journal of Organizational Behavior, (28), 2007, pp. 647-659.

Boring, E.G. "Mathematical Versus Statistical Importance, Psychological Bulletin, (16), 1919, pp. 335-338.

Carter, C.R., N.R. Sanders and Y. Dong "Paradigms, Revolutions, and Tipping Points: The Need for Using Multiple Methodologies within the Field of Supply Chain Management," Journal of Operations Management, (26:6), 2008, pp. 693-696.

Casella, G. and R.L. Berger. Statistical Inference, 2nd ed., Duxbury, Pacific Grove, CA, 2002.

Edwards, J.R. To Prosper, Organizational Psychology Should...Overcome Methodological Barriers to Progress", Journal of Organizational Behavior, (29), 2008, pp. 469-491.

Gardner, MJ. and D.G. Altman. "Confidence Intervals Rather Than P Values: Estimation Rather Than Hypothesis Testing", British Medical Journal, (292), 1986, pp. 746-750.

Graybill, F.A. and H.K. Iyer. Regression Analysis: Concepts and Applications, Duxbury, Belmont, CA, 1994.

Harlow, L.L., S.A. Mulaik and J.H. Steiger (Eds.). What If There Were No Significance Tests?, Erlbaum, Mahwah, NJ, 1997.

Hunter, J.E. "Needed: A Ban on the Significance Test", Psychological Science, (8), 1997, pp. 3-7.

Kaiser, H.F. "Directional Statistical Decisions," Psychological Review, (67), 1960, pp. 160-167.

Kerlinger, F.N. and H.B. Lee. Foundations of Behavioral Research, 4th ed., Wadsworth, Belmont, CA, 2000.

Kline, R.B. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research, APA, Washington, DC, 2004.

Messick, S. "Validity of Psychological Assessment: Validation of Inferences form Person's Responses and Performances as Scientific Inquiry into Score Meaning", American Psychologist, (50), 1995, pp. 741-749.

Morrison, D.E. and R.E. Henkel (Eds.). The Significance Test Controversy, Aldine, Chicago, 1970.

Nelder, J.A. "Discussion on Papers by chatfield," Journal of the Royal Statistical Society, Series A, (148), 1985, p. 238.

Stevens, G.C."Integrating the Supply Chain," International Journal of Physical Distribution and Logistics Management, (19:8), 1989, pp. 3-8.

Tukey, J.W. "Conclusions Vs. Decisions," Technometrics, (2), 1960, pp. 423-433.

Wilkinson, L. and the Task Force on Statistical Inferences. "Statistical Methods in Psychology Journals: Guidelines and Explanations" American Psychologist, (54), 1999, pp. 594-604.

Douglas G. Bonett (Ph.D., UCLA) is a proferssor of statistics and psychology at Iowa State University in Ames, IA. Dr. Bonett's current research interests focus on the development of interval estimation methods and sample size requirement methods.

Thomas A. Wright (Ph.D., University of California, Berkeley) is the Jon Wefald Leadership Chair and a professor of organizational behavior at Kansas State University in Manhattan, KS. In recognition of his career accomplishments, Dr. Wright was awarded Fellow status in the Association for Psychological Science (APS) in 2007. He has published his work in such outlets as the Academy of Management Review, Journal of Applied Psychology, Psychometrika, Academy of Management Executive, Organizational Dynamics Journal of Organizational Behavior, Journal of Occupational Health Psychology, Journal of Management and the Journal of Management Inquiry.

* Like all invited papers and invited notes, the original version of this manuscript underwent a double-blind review process.

DOUGLAS G. BONETT

Iowa State University

THOMAS A. WRIGHT

Kansas State University

Printer friendly Cite/link Email Feedback | |

Author: | Bonett, Douglas G.; Wright, Thomas A. |
---|---|

Publication: | Journal of Supply Chain Management |

Geographic Code: | 1USA |

Date: | Jan 1, 2009 |

Words: | 6199 |

Previous Article: | Triads in supply networks: theorizing buyer-supplier-supplier relationships. |

Next Article: | Introduction to the forum on modeling versus empiricism. |

Topics: |

## Reader Opinion