# Using excel 2013 for regression-based cost estimation: Part 1.

Cost estimation is one of the most fundamental tools in the management accountant's toolkit. It is used for establishing predetermined overhead application rates for product costing purposes, preparing budgets (both financial and operating), decision making (such as projecting costs associated with various decision alternatives), and contract bidding. Mechanisms for generating cost estimates range from the relatively simple--like professional judgment and intuition or the use of a scattergraph--to the sophisticated, such as the use of regression analysis.

Microsoft Excel is a flexible and powerful tool that can be used to help generate cost estimates for managerial planning and decision making. For example, Excel can be used to fit alternative regression models to a set of past observations of a cost and one or more cost drivers (i.e., independent or explanatory variables). Yet the use and interpretation of output from regression models can be complex and intimidating to some.

This article is the first in a two-part series that aims to help demystify the cost-estimation process using Excel in a managerial accounting setting. Part 1 illustrates the estimation of a simple (i.e., one-variable) linear cost function using two different methods in Excel: (1) a chart and its related functionality, and (2) the Regression analysis tool in the Analysis ToolPak. (1) Part 2 covers the estimation of learning curve models (one form of a nonlinear model) and the use of linear regression models that include more than a single independent variable.

This series should be of interest for managers who require decision-useful information, management accountants charged with the responsibility of supplying that information, and accounting students who did not have prior exposure to the use of Excel for estimating cost functions or who found that exposure (e.g., in statistics class) less than satisfying.

To illustrate the process, we will use a hypothetical example of a package-delivery service located in the Charlotte, N.C., metropolitan area. (2) With a fleet of four small vans and six pickup trucks, the company spends a considerable amount of money on vehicle upkeep. Figure 1 shows a screenshot of the monthly data from the past year in Excel, including the total vehicle expenses vs. number of deliveries.

It is possible to fit a one-variable linear regression model to the data in Figure 1 using Excel. This model could then be used to estimate monthly vehicle upkeep costs for the coming year. The functional form of the model to be estimated is: Y = a + bX, where Y is the monthly vehicle upkeep cost (the dependent variable), X is the number of deliveries per month (the independent variable), a is the fixed cost component of total monthly cost, and b is the variable cost rate (the slope of the cost function or, equivalently, the rate of change in Y per unit change in X).

GRAPHING THE DATA

The first step in fitting a regression equation to the data set in Figure 1 is to plot the data in a scattergraph in Excel (see Figure 2). (3) This initial step serves three important purposes:

1. The chart provides a visual means for assessing whether a linear relationship between the two variables (X and Y) is plausible.

2. It provides a visual means for identifying seasonal patterns, which is an issue when time-series data is used to estimate the cost function.

3. The chart allows the user to isolate any possible atypical or abnormal observations. This ability is critical because regression estimates of the cost coefficients (a and b) are sensitive to the existence of outlier observations, as we will see.

Looking at Figure 2, it appears the data doesn't have drastic swings (which suggests that a linear cost function is plausible) and that there are no apparent outliers in the data set. With that established, the next step is to begin the cost estimation. Excel provides several options for fitting a linear regression model to a set of observations. We will demonstrate two of these: (1) use of options under Excel's chart functionality, and (2) use of the built-in Regression analysis tool.

COST ESTIMATION USING AN EXCEL CHART

After generating the scattergraph in Figure 2, right-click anywhere on the set of data points and select Add Trendline from the shortcut menu. (4) (Alternatively, click on the chart, and a green plus sign will appear in the upper-right-hand corner of the chart. This will allow you to add a trendline along with other components such as axis titles.) Excel will add the trendline to the chart and open the Format Trendline dialog. Make sure that the type of trendline is set to "Linear" and that the boxes for "Display Equation on chart" and "Display R-squared value on chart" are both checked.

The estimated cost function for the given data set is Y = -\$22,297 + \$29.351X. (The equation Excel displays on the chart might appear as y = 29.315x - 22297, but for illustration purposes and consistency, we are following the Y = a + bX format used previously.) Thus, the variable cost rate (b) is approximately \$29 per delivery. The negative estimate for the constant term (a) means that the relevant range of the data set (approximately 5,000 to 7,500 deliveries per month) is far from the zero point for X. Thus, in this case, the coefficient a cannot be properly interpreted as the fixed cost per month. Alternatively, the estimation cost equation is useful only within the relevant range and therefore may not be reliable for estimating costs (Y) for values of X outside this range.

The calculated R-squared value for this function is 89.7%. (5) R-squared is a relative measure of "goodness-of-fit." It tells us the proportion of the variability in the dependent variable (Y) that is explained by varying the independent variable (X). Alternatively, it tells us something about the strength of the relationship between these two variables: The higher the R-squared value, the stronger the relationship between X and Y and, as such, the greater the "fit" of the regression model to the data set. The minimum value for R-squared is zero, which implies there is no linear relationship between X and Y. The maximum value would be 100%, which implies a situation where all the data points lie directly on the regression line--in other words, a "perfect fit" of the model to the data set. (6)

There are several options for using the regression equation for cost-estimation purposes:

Option #1: If time-series data is used, go to Trendline Options and simply extend the trendline in the chart forward or backwards a specified number of periods. Since the data in our example is not in the form of a time series, a different option must be used here.

Option #2: Enter a formula for Y (cost) using the estimated coefficients a (-\$22,297) and b (\$29.351) and an assumed value of X (the cost driver/independent variable) within the relevant range.

Option #3: Use Excel's INDEX function to enter the estimated coefficients into separate cells. (7) The formula to use for the estimated slope coefficient (b) is =INDEX(LINEST (known_y's, known_x's),1). (8) For our example, it would be =INDEX(LINEST (C2:C13,B2:B13),1), which returns a value of \$29.351. The "1" in the formula directs Excel to estimate the slope coefficient (b) of the cost function we are estimating. The estimated constant term (a) is found by entering =INDEX(LINEST(C2:C13,B2:B13),2) into a different cell. (9) That returns a value of -\$22,297. The "2" in the function directs Excel to generate an estimate of the constant (a) in the linear cost function. Remember, this form of LINEST is valid only when estimating a one-variable linear regression model. In a third cell, enter the formula to estimate Y (cost) for a given value of X (the cost driver), incorporating into the formula the values for a (constant term) and b (variable cost rate) derived from the two cells referenced above.

Option #4: Use the FORECAST function to generate cost estimates based on a simple linear regression model. (10) The syntax for this function is =FORECAST(x, known_y's, known_x's). The FORECAST function returns an estimate of Y (cost) for a given value of X (the cost driver). As an example, for X = 6,000 and using the data set in Figure 1, the formula would be =FORECAST(6000,C2:C13,B2:B13). That yields an estimated value (Y) of \$153,811 (when X = 6,000). This is the sum of -\$22,297 and \$176,108 (\$29.351 per delivery x 6,000 deliveries).

COST ESTIMATION USING EXCEL'S REGRESSION ANALYSIS TOOL

The second possible method for cost estimation is the Regression analysis tool, which is part of the Analysis ToolPak. (11) To access it, go to the Data tab and click on Data Analysis. (12) A window will pop up listing the data analysis tools for Excel. Select Regression from the list, and click OK to open the Regression dialog box. Figure 3 shows the completed dialog box using the data set for our example. (13) As shown in Figure 4, the summary output (pasted to cell A18) consists of the coefficient estimates for the variable cost rate (b) and the constant (a), R-squared for the estimated cost function, and a complete ANOVA (Analysis of Variance) table. (14)

Note that the results are the same as what we found using the chart functionality: The slope coefficient (b), the estimated constant term (a), and the calculated Rsquared value are all exactly the same. The difference is that the Regression option provides much more detailed information than was obtained via the chart functionality.

The additional information available in the ANOVA table (starting in row 27 of Figure 4) allows us to see precisely how R-squared is calculated. As noted previously, R-squared is a measure of the strength of the relationship between X and Y. More precisely, it represents the proportion of the change in Y in the data set that is related to (or explained by) varying X. The following equation relates to the results in the column labeled "SS" (for sums of squares) in the ANOVA table generated for the data set contained in Figure 1:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Total Sum of Squares (TSS) = Error Sum of Squares (ESS) + Regression Sum of Squares (RSS) (15)

The term on the left side of the equation, the total sum of squares (TSS), should look familiar: It is basically the numerator in the calculation of the variance in a set of data. It represents a measure of the total variability of the actual Y values around the mean Y value ([bar.Y]). As the equation shows, this total variance can be broken down into a portion that is explained by the regression equation (the regression sum of squares, RSS) and a portion that is unexplained by the regression equation (the error sum of squares, ESS). These variance components are used to calculate R-squared: (16)

= 4,664,748,529/5,198,842,973 = 89.7%

Alternatively, we can say that:

[R.sup.2] = 1 - (ESS/TSS)

= 1 - (534,094,444/5,198,842,973) = 1 - 10.3% = 89.7%

In addition, the results from the Regression routine help illustrate how the "least-squares" criterion works. Basically, the tool chooses the values for the constant term (a) and the slope coefficient (b) so as to minimize ESS. For no other values of these two coefficients will the calculated ESS be smaller. It is in this sense that the ordinary least squares (OLS) procedure provides us with the "line of best fit." In the example, the ESS is minimized (at 534,094,444) when a = -\$22,297 and b = \$29.351 per delivery. For no other combination of a and b will the ESS associated with the data set presented in Figure 1 be smaller.

The Standard Error of the Regression

The regression results contain a second goodness-of-fit measure: the standard error of the regression (SE). While R-squared is a relative measure of goodness-of-fit (since it is expressed in percentage terms), SE is an absolute measure whose amount is partly a function of the measuring unit used for the dependent variable.

SE is analogous to the standard deviation. While the standard deviation is a measure of the dispersion of a set of values around the mean of those values, the SE is a measure of the dispersion (or spread) of the actual Y values in a data set around the regression line. SE is calculated by taking the square root of the mean square error term, which is equal to the ESS divided by the degrees of freedom (df). In the example ESS = 534,094,444 (see Figure 5) and df = n - k - 1 = 10, where k = number of independent variables in the model and n = sample size. Thus, for the present data set, the calculation is:

SE = [square root of (534,094,444/10)]

= [square root of (53,409,444)]

= 7,308

This is exactly the amount reflected in the regression results reported in Figure 4. Note that the minimum value for SE is zero, which would imply that all of the actual observations ([Y.sub.i]) are on the regression line.

Relationship between R-Squared and the Standard Error of the Regression

Based on the results contained in the ANOVA table, we can now show the relationship between R-squared and SE. SE is the square root of the mean squared error, [square root of (ESS/(n - k - 1))], while R-squared is RSS (the explained variance) divided by TSS (the total variance), or 1 - (ESS/TSS). From this formula, we can see that SE decreases as the ratio of ESS to degrees of freedom (n - k - 1) decreases.

A decrease in SE caused by a reduction in ESS leads directly to an increase in calculated R-squared. That is, a decrease in ESS results in a decrease in SE; a decrease in SE causes an increase in R-squared. Also, when ESS is zero, SE is zero; and when SE equals zero, R-squared is 1, which implies a perfect fit to the data.

Using the Standard Error of the Regression to Construct Confidence Intervals

Our estimated regression model, [bar.Y] = -\$22,297 + \$29.351X, produces point estimates (or forecasts) of costs (Y) for alternative values of X. Such "future values" are subject to uncertainty. One way to capture this uncertainty is to provide a range estimate--or confidence interval--around the point estimate for Y. To develop a confidence interval around a regression-predicted value of Y for a specified value of X, we need to determine the variance associated with the estimate and choose a confidence level, such as 90%. To construct such an interval, the approximation [??][+ or -] [t.sub.[alpha]] SE (where [??] is the point estimate of Y for a specified value of X and SE is the standard error of the regression) is generally considered to be "good enough" for practical purposes. (17) In building the confidence interval, we must specify a confidence level or percentage (like 90%), which is equal to 1 - [alpha] (thus, a 90% confidence interval corresponds to [alpha] = 10%) and is reflected in a t value with n - 2 degrees of freedom. The upper end of the confidence interval is defined as [??] + [t.sub.[alpha]] SE, while [??] - [t.sub.[alpha]] SE equals the lower end of the confidence interval.

A table containing values of the t-distribution can be consulted to determine [t.sub.[alpha]] for a specific number of degrees of freedom. (18) Alternatively, the T.INV.2T function in Excel can be used to generate the critical t value for constructing a confidence interval. (19) For example, the critical t value for a two-tailed t-distribution for [alpha] = 0.10 (which corresponds to a 90% confidence interval) and df = 10 is 1.812. (20) Table 1 contains sample t-values for alternative combinations of probabilities and degrees of freedom (df). The table shows that there is a positive relationship between the desired confidence percentage and the size of the confidence interval.

For the present cost function, for X = 6,000 deliveries, the point estimate for monthly vehicle expense ([??]) is \$153,811 (that is, \$22,297 + [\$29.351 per delivery x 6,000 deliveries]). Since SE for the estimated cost function is \$7,308 (see Figure 4) and df is 10, a 90% confidence level built around the point estimate (\$153,811) is:

Y = \$153,811 [+ or -] (t x SE) = \$153,811 [+ or -] (1.812 x \$7,308) = \$153,811 [+ or -] \$13,242

Thus, the lower limit on the 90% confidence interval is \$140,569, while the upper limit is \$167,053. The better the "fit" of the regression model to the data (as measured by the standard error of the regression), the tighter (or smaller) the range estimate for a given confidence level.

Improving the Standard Error of the Regression

Since SE plays an important role in the cost-estimation process, how can SE be improved? After all, the Regression tool chooses the constant term (a) and the slope coefficient (b) to minimize SE. Therefore, what strategies are available for decreasing SE and, by extension, improving the quality of the cost estimates? This is both a legitimate and complex topic that's beyond the scope of this article, but we can present an intuitive and general argument that can serve as the basis for embracing particular strategies.

SE is defined as the square root of the mean squared error (MSE), where MSE is the error sum of squares divided by degrees of freedom (see Figure 4). From this formulation, we can see that SE can be decreased (improved) by decreasing the numerator in the MSE calculation, by increasing the denominator, or by doing both. Since the denominator is defined as n - k - 1, it might appear that that uncertainty is reduced by increasing the sample size (n). In other words, holding all else constant, if n increases, then the denominator (n - k - 1) increases, which would decrease SE. But, we see from the ratio for SE that whether and to what extent SE changes is a function of the rate of change in the numerator of the SE calculation relative to the rate of change in the denominator of the calculation. The operative phrase here is "holding all else constant." Thus, the change in SE is jointly a function of both a "numerator effect" and a "denominator effect." A corollary of this is that increasing the sample size does not necessarily decrease the SE. Put another way, adding poor-quality data to an existing data set may decrease, not increase, the standard error of the regression.

Other Output from the ANOVA Table

The amount labeled "Multiple R" in Figure 4 seems as if it is a misnomer when associated with results from a one-variable linear model such as the cost-function model in our example. In this situation, Multiple R is simply the bivariate correlation (called the Pearson correlation coefficient) between the actual Y values in the data set and the regression-predicted Y values. (21) The value of Multiple R ranges from 0 to 1; a value of 0 implies that knowledge of X does not provide better explanatory power for Y than would be obtained by simply assuming the mean value of Y in the data set--in other words, Y.

As noted before, SE as an absolute measure of goodness-of-fit--it measures the variability of the actual Y observations around the regression line. SE also is analogous to the standard deviation. Notice from the ANOVA table in Figure 4 that the mean squared error (MSE) term for our regression equation is 53,409,444 (534,094,444/10). This MSE is analogous to the variance of a set of data. If we take the square root of the variance, we get the standard deviation; if we take the square root of the mean square error term, we get the standard error of the regression. In the example, SE = [square root of (53,409,444)] = 7,308. Thus, we can think of SE as being the standard deviation of the actual Y values around the regression line. As such, SE can be zero only if all the actual values of Y lie on the regression line.

Earlier, we addressed one sampling-related issue: the construction of range estimates around point estimates generated by our regression equation. We now address two additional "sampling-related" issues: (1) testing for the statistical significance of R-squared, which, in a simple regression is equivalent to a test on the slope coefficient or variable cost rate (b), and (2) assessing the reliability of the estimated slope coefficient (b).

The ANOVA table in Figure 4 includes an F-statistic (87.34) associated with the output line labeled "Regression." This statistic equals the ratio of the [MS.sub.reg] to [MS.sub.error] (in the example, it is 4,664.748,529/53,409,444, which equals 87.34). We know that the R-squared value associated with a linear model fit to the present data set is 89.7%. We also know that this result was produced using sample data. This suggests the following statistical test: If the population R-squared is zero, what is the probability that the sample R-squared value would be 89.7%? Intuitively, you know that the probability is likely very, very low. The reported F-statistic is associated with the null hypothesis statistical test that the population R-square value is zero. The reported significance level (0.00000295) confirms our intuition: The probability is close to zero that a sample R-squared value of 89.7% would be produced if the population R-squared value is zero. (22) Thus, the null hypothesis is rejected. (23)

The bottom part of the ANOVA table contains the confidence intervals associated with each of the two estimated coefficients, a and b. Since R-squared is based on the sample data set, so, too, are these estimated coefficients. To assess the reliability of these estimates, calculate a t-value for each estimate, defined as the ratio of the coefficient to the standard error for that coefficient. In Figure 4, the t-value of the constant term (a) is -1.134, while the t-value for the slope coefficient (b) is 9.346. Thus, the t-value gives an indication of the reliability of each coefficient estimate.

The same t-statistic is used to test, in each case, the null hypothesis that the population coefficient is zero. We see from the ANOVA table that the null hypothesis for the constant term cannot be rejected (not surprising given the distance of the intercept term from the "relevant range" exhibited in the data set), while the null hypothesis on the slope coefficient is strongly rejected. In technical terms, the probability that the population slope coefficient is zero given a sample value of \$29.351 is very remote. These two null hypothesis tests are confirmed by the 95% confidence intervals constructed around each coefficient estimate (and included in the bottom portion of the ANOVA table). In the case of the intercept (constant) term, the confidence interval includes the value of zero, while it does not for the slope coefficient. The spread of 22.35 to 36.35 provides more specific information regarding the reliability of the estimated variable cost rate (b).

Now that we have looked at two Excel-based methods for estimating simple (i.e., one-variable) linear cost functions using regression analysis and using these functions to estimate future costs, Part 2 of the series will look at the use of Excel for estimating learning curve functions (one type of nonlinear cost function) and multiple-regression models (i.e., models containing more than a single explanatory variable).

David E. Stout, Ph.D., is the John S. and Doris M. Andrews Professor of Accounting at the Williamson College of Business Administration, Youngstown State University. He is also a member of IMAs Akron Chapter. He can be reached at destout@ysu.edu.

ENDNOTES

(1) The calculations and the methods described in the article were performed in Excel 2013, but using Excel 2010 or 2016 should yield equivalent results and require similar steps--though we did not verify this.

(2) This example is taken from Edward J. Blocher, David E. Stout, Paul G. Juras, and Gary Cokins, Cost Management: A Strategic Emphasis, 7th edition, McGraw-Hill Education, 2016, p. 276.

(3) A useful online resource for creating simple charts in Excel 2013 is available at www.youtube.com/watch?v=7vkIB7VN75k. A more comprehensive resource is available at www.youtube. com/watch?v=xGLXQ1WrJO0. For the chart presented in Figure 2, the bounds on the horizontal axis were set as 4,500 (minimum) to 8,000 (maximum).

(4) Tutorials for fitting trendlines (both linear and nonlinear) to a data set in Excel 2013 are available at www.youtube.com/watch?v=Z8UNmk9b0qo and www.youtube.com/watch? v=-KRtKU98i50.

(5) A useful tutorial covering the calculation of the correlation coefficient (r) is given at www.youtube.com/watch? v=DVQ1kA1Mj_w; a more comprehensive treatment of the topic is given at www.youtube.com/watch?v=4EXNedimDMs. A tutorial explaining how to calculate the correlation coefficient (r) and the coefficient of determination (R-squared) is given at www.youtube.com/watch?v=RnnK2vP5gwE. An explanation of the difference between r and R-squared is provided at www.youtube.com/watch?v=keuTgD71058, www.youtube.com/watch?v=F54cdyvcyRA, and www.youtube. com/watch?v=qQMAjsOihYc. A lucid explanation of how to calculate the coefficient of determination (R-squared) is provided at www.youtube.com/watch?v=w2FKXOa0HGA. Use of the RSQ built-in function in Excel to estimate the coefficient of determination is available at www.ehow.com/how_8498030_ calculate-r2-excel.html. Alternative explanations of R-squared are available at www.youtube.com/wa tch?v=IMjrEeeDB-, www.youtube.co m/watch ?v=KkfjPZ6K EAg, and www.youtube.com/watch?v=lZtyxpHsBlg.

(6) The R-squared value can also be interpreted as the average increase in accuracy obtained by using the regression function to estimate the value of Y for the values of X in the data set, rather than using the mean value of Y. R-squared, for a data set, can be estimated using the RSQ function in Excel, which uses the syntax =RSQ(known_y's, known_x's). For the data set presented in Figure 1, the formula would be =RSQ(C2:C13,B2:B13), which returns the value 89.7%.

(7) Explanatory videos regarding the use of the INDEX function in Excel 2013 can be found at www.youtube.com/watch? v=lKaxcpBkLLc and at www.youtube.com/watch?v= ZPBbWRtUZWk. Videos for using the LINEST function in Excel can be found at www.youtube.com/watch?v= 6wbcPbYbq6M and at www.youtube.com/watch?v=XI58EyvyPlo. Note that the formulas under this option apply only when a one-variable function is being estimated.

(8) As an alternative, the SLOPE function in Excel could be used. Its syntax is =SLOPE(known_y's, known_x's). Given the data set presented in Figure 1, enter the following formula into a cell: =SLOPE(C2:C13,B2:B13). This produces a value of \$29.351.

(9) As an alternative, the INTERCEPT function in Excel could be used, which has a syntax of =INTERCEPT(known_y's, known_x's). Given the data set presented in Figure 1, the formula to enter into a cell is =INTERCEPT(C2:C13,B2:B13). This produces a value of -\$22,297.

(10) Tutorials for using Excel's FORECAST function are available at http://support2.microsoft.com/kb/828236, www.youtube. com/watch?v=ZJa7RDJ6TJo, and www.youtube.com/watch ?v=j5wSnU FYkkk.

(11) Tutorials for using the Regression analysis tool in Excel are available at www.youtube.com/watch?v=ExfknNCvBYg (this clip shows both the use of charts and the regression tool), http://blog.yojimbocorp.com/2012/05/03/linear-regression-withexcel-2010/, www.wikihow.com/Run-Regression-Analysis-inMicrosoft-Excel, and www.excel-easy.com/examples/regression.html.

(12) If Data Analysis is not visible on the Data tab, go to File and click Options. In the Excel Options dialog, choose AddIns on the left. In the Manage dropdown, make sure "Excel Add-ins" is selected, then click the Go button to open the Add-Ins dialog. Check the box next to "Analysis ToolPak" and click OK. The Data Analysis button should now appear on the Data tab.

(13) If the "Labels" box is checked (see Figure 3), the range for the X and Y values must include headings (here, entries in cells B1 and C1).

(14) Clips providing a discussion of regression-related output are available at www.youtube.com/watch?v=8R6UcK91Cec, www.youtube. co m/watch?v=c5blVUkkjTM, and www.youtube.com/watch?v=aq8VU5KLmkY.

(15) [Y.sub.i] = actual (observed) values; ([bar.Y]) = mean of the Y values; [??] = regression-predicted Y values for the set of Xs.

(16) As a check on the results produced by the Regression analysis tool, use Excel to calculate TSS, ESS, and RSS for the data set. TSS equals the sum of the squared deviations of the actual Y values from the mean value, \$160,427. This sum is 5,198,842,973, which is exactly the total sum of squares (TSS) reported in Figure 4. The error sum of squares (ESS) is found by summing the squared differences between the regressionpredicted values of Y (i.e.,[??]) and the actual values of Y for all values of X in the data set. This sum is 534,094,444, which is exactly the amount shown in Figure 4. The regression (explained) sum of squares (RSS) is found by summing the squared differences between the regression-predicted values of Y (i.e.,[??]) and Y, for all values of X in the data set. This sum is 4,664,748,529, which is the same as the amount shown in Figure 4. Detailed calculations for TSS, ESS, and RSS are provided in Columns D, E, and F, respectively, of Figure 5.

(17) W. E. Leininger, Quantitative Methods in Accounting, D. Van Nostrand Company, New York, N.Y., 1980, pp. 106-107, notes that if working with the population of observations (rather than a sample) or if the sample size (n) is relatively large (e.g., over 30) or if the error terms are normally distributed around the regression line, then the standard error of the regression (SE) can be used to construct confidence intervals around point estimates generated from a regression equation.

(18) The t-distribution (also known as the Student's t-distribution) is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown. The t-distribution allows us to conduct statistical tests on certain data sets that are not appropriate for analysis using the normal (i.e., Z) distribution. As noted in the text, the form of the t-distribution is a function of its degrees of freedom (i.e., the number of independent observations in the data set). With very large degrees of freedom, the t-distribution is the same as the standard normal distribution. See http://stattrek.com/probability-distributions/t-distribution.aspx. A t-distribution table, for various degrees of freedom and probability levels, is available at www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm or at www.medcalc.org/manual/t-distribution.php.

(19) See: www.excelfunctions.net/Excel-T-Inv-2t-Function.html or http://msdn.microsoft.com/en-us/library/office/ff821541.aspx.

(20) This is found by entering the formula = T.INV.2T(0.1,10) into an open cell in Excel.

(21) This can be confirmed in Excel by generating a column (i.e., an array) of predicted Y values, using the function Y = -\$22,297 + \$29.351X, where X = number of deliveries per month. Next, find the Pearson product-moment correlation between the actual Y values and an array containing the predicted Y values by inserting the formula =CORREL(array1, array2) into an open cell. The result you obtain should be 0.947, which is exactly the amount for "Multiple R" reported in the ANOVA output depicted in Figure 4.

(22) TheF.DIST.RT function could also be used to test for the statistical significance of the calculated R-squared value in Excel. The null hypothesis is that the population R-squared value is zero. To conduct this test, we need the F-statistic from the ANOVA table and its associated degrees of freedom (both numerator and denominator). As noted, the F-statistic is defined as the ratio of the [MS.sub.reg] to [MS.sub.error]; the numerator degrees of freedom is k; the denominator degrees of freedom is n - k -1, where k = the number of independent variables in the regression model. These three pieces of information are then entered into the following formula in an open cell: =F.DIST.RT(F,[df.sub.n],[df.sub.d]), where F = the calculated F-statistic from ANOVA table, [df.sub.n] = numerator degrees of freedom, and dfd = denominator degrees of freedom. In this instance, the formula is =F.DIST.RT(87.34,1,10). The resulting value (0.00000294531735) represents the probability of observing an F-statistic equal to F (or higher) if the population value for R-squared is zero. An explanation of the F.DIST.RT function in Excel is available at www.excelfunctions.net/Excel-F-DistRt-Function.html and http://msdn.microsoft.com/en-us/library/office/ff196140.aspx.

(23) In the case of a simple linear regression model, the test on R-squared is equivalent to t-test on the slope coefficient (b) in the model. In the present case (from Figure 4), t = [square root of (F)] = [square root of (87.34)] = 9.346. The null hypothesis is that the population slope coefficient = 0. The probability of generating an estimated slope coefficient of \$29.351 if the population value is zero is 2.945E-06, which is exactly the value associated with the null hypothesis statistical test conducted on R-squared. As such, the two statistical tests are equivalent.

Caption: Figure 2: Scattergraph of Data

Caption: Figure 3: Completed Regression Dialog Box
```Table 1: Two-Tailed f-values for Selected Confidence Levels (%)
and 30 Degrees of Freedom (df)

Confidance    [alpha]   df           Two-Tailed
Interval %                            t-value

99%           0.01      30           2.75
95%           0.050     30           2.042
90%           0.100     30           1.697
80%           0.200     30           1.310
75%           0.250     30           1.173
67%           0.333     30           0.983

Note:  The Excel formula = T.INV.2T(probability,deg_freedom)
returns the two-tailed inverse of the t-distribution.  That
is, it provides the t-value (of the t-distribution) based on
a specified two-tailed probability ([alpha] = 1--confidence level
%) and the number of degrees of freedom (df) associated with
the t-distribution. The confidence interval around a point
estimate from a regression equation (where "point estimate"
= a + bX for a given value of X) is constructed as: Lower
Limit of the confidence interval = Point estimate--(t-value
x SE), while the Upper Limit of the confidence interval
= Point estimate + (t-value x SE).

There is a different t-distribution for each number of degrees
of freedom (basically, sample size). The following results
are generated using Excel's T.INV.2T function and show that
using 2.0 as the t-value for constructing a 95% confidence
interval is an approximation (i.e., applies, strictly speaking,
only to a t-distribution with 60 degrees of freedom):

[alpha]       df        Two-Tailed
t-value

0.05          10        2.228
0.05          30        2.042
0.05          60        2.000
0.05          90        1.987
0.05          270       1.969
0.05          810       1.963
0.05          2,430     1.961

Figure 1: Assumed Data Set:
Simple Linear Regression Model

A         B           C

Monthly
Total      Vehicle
Deliveries   Expense
1    Month      (XI)        (Y)

2      1       5,882      \$145,329
3      2       5,557      133,245
4      3       5,166      123,245
5      4       6,621      164,295
6      5       6,433      163,937
7      6       6,681      176,229
8      7       7,182      180,553
9      8       6,577      177,293
10     9       5,942      155,389
11    10       5,622      150,832
12    11       5,599      152,993
13    12       7,433      201,783

Figure 4: Output from Applying Regression Routine to Data

SUMMARY OUTPUT

Regression Statistics

Multiple R             0.95
R Square               0.90
R Square
Standard Error         7308
Observations            12

ANOVA

df               SS               MS

Regression               1          4,664,748,529    4,664,748,529
Residual                10           534,094,444      53,409,444
Total                   11          5,198,842,973

Coefficients    Standard Error        tStat

Intercept             -22,297          19,665           -1.134
Total                 29.351            3.141            9.346

SUMMARY OUTPUT

Multiple R
R Square
R Square
Standard Error
Observations

ANOVA

F        Significance F

Regression            87.34        2.94532E-06
Residual
Total

P-value        Lower 95%      Upper 95%

Intercept             0.283          -66,114        21,520
Total              2.94532E-06        22.35          36.35
Deliveries (X)

Notes:
df = n - k - 1 = 10
Ess = [n.summation over (i=1)][([[??].sub.i]
- [Y.sub.i]).sup.2] = 534,094,444
MSE = ESS/df = 534,094,444/10 =53,409,444
SE = [square root of (MSE)] = [square root of (53,409,44) = 7,308
F = [MS.sub.refression]/[MS.sub.residual]
= 4,664,748,529/53,409,444 = 87.34
[t.sup.2] = F
[9.346.sup.2] = 87.34

Figure 5: Direct Calculation of Variance Components: TSS, ESS, and RSS

Monthly
Total       Vehicle
Month      Deliveries   Expense       Sq dev
(X)         (Y)        from mean

1              5,882    \$145,329    227,947,088
2              5,567    133,245     738,856,594
3              5,166    123,245    1,382,494,927
4              6,621    164,295      14,962,069
5              6,433    163,937      12,320,685
6              6,681    176,229     249,705,838
7              7,182    180,553     405,059,230
8              6,577    177,293     284,464,767
9              5,942    155,389      25,380,604
10             5,622    150,832      92,062,426
11             5,599    152,993      55,263,117
12             7,433    201,783    1,710,325,629
5,198,842,973
(TSS)

Month       Squared        Explained     Predicted Y
Errors        Variance        Values

1          25,182,330     101,600,691      \$150,347
2          61,725,354     373,469,857      \$141,102
3          37,047,764     966,913,281      \$129,332
4          59,950,822     134,812,417      \$172,038
5           6,670,608      37,122,643      \$166,520
6           5,905,541     178,809,084      \$173,799
7          63,216,119     788,314,552      \$188,504
8          42,858,816     106,490,406      \$170,746
9          10,763,148      69,199,777      \$152,108
10         65,871,590     313,681,226      \$142,716
11         119,950,921    338,049,682      \$142,041
12         34,951,432    1,256,284,913     \$195,871
534,094,444   4,664,748,529
Author: Printer friendly Cite/link Email Feedback Stout, David E. Management Accounting Quarterly 1USA Jan 1, 2017 6447 Estimating retail breakeven using markup pricing. Using Excel 2013 for regression-based cost estimation: Part 2. Cost (Economics) Costs (Economics) Decision making Decision-making Regression analysis Spreadsheet software Spreadsheets