# Using Excel 2013 for regression-based cost estimation: Part 2.

Part 1 of this series demonstrated two methods in Excel 2013 for estimating a simple (i.e., one-variable) linear regression model using the example of a hypothetical package-delivery company. The dependent variable (Y) was the company's monthly vehicle expense, and the independent (or, explanatory) variable (X) was the number of deliveries per month.Part 2 extends the application of Excel to estimate and interpret the output from a multiple-regression model--in other words, a cost function that includes more than a single independent variable--and a learning curve model, which is one form of a nonlinear cost function. (1) This involves fitting a multiple (i.e., more than one variable) linear regression model to a new data set. This approach is helpful when more than one X variable helps to explain variations in the dependent variable.

The overall purpose of these articles is to demystify the use of Excel 2013 for estimating a variety of cost functions. (2) They should be of interest to managers who require decisionuseful information, such as estimated cost data generated by regression models; managerial accountants charged with the responsibility of supplying that information; accounting students who are looking for user-friendly resources that can be used to learn advanced topics in the area of cost estimation using Excel; and individuals preparing for the CMA[R] (Certified Management Accountant) exam. (3)

A MULTIPLE LINEAR REGRESSION MODEL USING EXCEL 2013

To illustrate the multiple linear regression model, we will use the hypothetical example of a nursery and retail store specializing in house and garden plants and supplies. (4) The data is presented in Figure 1. (5) Using Excel, the data in Figure 1 will be used to estimate a linear cost function where the dependent variable (Y) represents the monthly supplies expense and where there are three independent variables: monthly sales in units ([X.sub.1]), monthly sales dollars ([x.sub.2]), and December ([X.sub.3]). (6) Data for the dependent variable (Y) is in Column E, while data for [X.sub.1], [x.sub.2], and [X.sub.3] is in Columns B, C, and D, respectively. The goal is to develop a multiple regression model based on the 31 months of data (June 2012 through December 2014) presented in Figure 1. This model will be used to predict monthly supplies expense for each month of the upcoming year (2015) based on projected values for each of the three independent variables, as shown in Columns B, C, and D of Figure 2.

We consider the monthly sales in units ([X.sub.1]) and monthly sales dollars ([x.sub.2]) to be continuous variables, meaning that we will look at one-unit changes in them for interpretation. December ([X.sub.3]) is a dummy variable, which means that its effect is only manifested in the month of December. The dummy variable takes a value of 1 for each observation in the data set from December and 0 for any other month. Dummy variables (season of the year, presence or absence of a medical procedure, order complexity, and so forth) that can potentially be used to improve the descriptive and predictive power of an estimated cost function. (7)

Running a multiple regression in Excel using the Regression analysis tool follows the same procedure as required for a simple, one-variable regression. The difference is the inclusion of two or more independent variables, which we denote as [X.sub.i] (i = 1 to n, where n = the number of independent variables). Excel has a built-in control feature that prevents users from including two columns for the Y (dependent) variable. One limitation in Excel's multiple regression procedure, however, is that it requires that all included independent (X) variables be in adjacent columns. When exploring sets of independent variables to include in the estimated model, you must be able to manipulate the spreadsheet in a way that preserves source data but allows for alternative setups. (8) In the present case, this means that independent variables need to be in adjacent columns.

In our example, the independent variables appear in Columns B, C, and D of Figure 1. We want to use the Regression analysis tool to regress monthly supplies expense (Y) simultaneously on monthly unit sales ([X.sub.1]), sales dollars per month ([x.sub.2]), and December ([X.sub.3]). The form of our regression model is: Y = a + [b.sub.1][X.sub.1] + [b.sub.2][X.sub.2] + [b.sub.3][X.sub.3], where the coefficient estimates [b.sub.1] and [b.sub.2] refer to the variable cost per unit of activity (or cost driver) for [X.sub.1] and [x.sub.2], respectively, and where a refers to the constant (or intercept) term--in other words, the value of Y when [X.sub.1] and [x.sub.2] both equal zero. The coefficient estimate [b.sub.3] refers to the effect of the month December on monthly supplies expense. Thus, the value of [X.sub.3] for any observation from December is 1. The estimate for [b.sub.3] will have no effect on Y in any month other than December since for those 11 months the value of [X.sub.3] is 0.

Data Input

Begin the regression analysis by clicking the Data tab and selecting Data Analysis. (9) Figure 3 shows the Regression dialog used to fit a multiple-regression model to the data in Figure 1. (10) Choosing the "Labels" option tells Excel to interpret entries in the first, row of data as labels for the variables in the model, thereby ignoring the contents of these cells in estimating the coefficients (a, [b.sub.1], [b.sub.2], and [b.sub.3]). When electing this option, be sure that the first cell is indeed a label and not data for the first month. Erroneously enabling "Labels" can be easily detected by reviewing the regression output to see whether a number is reported in place of the independent variable name and whether the sample size (n) for fitting the regression model is correct (in this example, 31).

In the Regression dialog, the confidence interval defaults to 95% for each estimated coefficient in the multiple-regression model. (11) The confidence interval is equivalent to a null hypothesis test on each independent variable--that is, a test that the population value for each coefficient is equal to zero.

We also have checked the Residuals box so that output from the regression analysis will provide additional information for this estimation process. Details regarding the analysis and interpretation of residuals from a regression model will be discussed later.

The Regression dialog also lets you select where you want the regression output to go. Although you may elect to report these results anywhere, be sure not to overwrite any other data or objects in the workbook.

The default option is for the output to appear beginning in cell AJ of a new worksheet. We have selected cell AJ on the worksheet titled "Regression Output" as our choice. Once everything is set, click OK to run the regression analysis.

Cost Function Estimation

Figure 4 presents rows 1-20 of the multiple regression results for monthly supplies expense (Y) with sales (in units) per month ([X.sub.1]), sales dollars per month ([x.sub.2]), and December ([X.sub.3]) as explanatory variables. (12) The ANOVA section of the regression output (cells A17:B20) presents the variable names and their corresponding coefficients, which are used to form the following estimated cost function:

Monthly supplies expense = $1,160 per month + ($1.74 per sales unit x Sales units) + ($0.48 per sales dollar x Sales dollars) - ($918.04 x December). (13)

This result could be interpreted to mean that, based on data obtained over the previous 31 months, monthly supplies expense (Y) consists of a fixed component ($1,160), two variable components ($1.74 per unit sold, and $0.48 per sales dollar), and a dummy component (-$918.04 in the month of December).

As monthly unit sales ([X.sub.1]) or sales dollars per month ([x.sub.2]) are varied, the monthly supplies expense would be expected to change at the respective per-unit rates noted in the estimated cost function. For example, while holding sales dollars constant, increasing the number of units sold by 10 is expected to increase estimated monthly supplies expense by $17.40 ($1.74 per unit x 10 units). (14) Similarly, while holding sales units constant, increasing sales dollars by $2,000 will increase the estimated monthly supplies expense by $960 ($0.48 per sales dollar x $2,000). Further, the month of December (relative to other months) is expected to decrease total supplies expense by $918.04.

Assessing the Precision of the Coefficient Estimates

How confident are we that we will be able to rely on this set of variables to make predictions about future monthly supplies expense? First, explore the strength (or reliability) of our coefficient estimates for the independent variables by seeing whether these amounts for [b.sub.1] and [b.sub.2] are statistically significant. (15) It is common to conduct a null hypothesis test on each estimated coefficient--in other words, a test whether the population coefficient is zero. (16) For a null hypothesis test, the hypothesized value (by definition) for the coefficient is 0, meaning that the particular X variable has no effect on Y. A sufficiently high t-statistic (low p-value) provides evidence in support of rejecting the null hypothesis of no relationship between a particular independent variable, [X.sub.i], and Y based on the sample data used to estimate the value of the coefficient, [b.sub.i].

Consider, for example, unit sales per month ([X.sub.1]). From a statistical standpoint, the estimated value of slope coefficient for [b.sub.1] ($1.74) can be transformed into a t-statistic by dividing [b.sub.1] by the standard error of the estimate for [b.sub.1]. (17) Data to calculate the t-statistic for each estimated coefficient is available from the ANOVA output shown in Figure 4.

For example, in the case of [b.sub.1], divide cell B18 (1.74, rounded) by cell C18 (0.72, rounded). The resulting amount (2.42, rounded) should match exactly the output reported in cell D18. The reliability of the coefficient [b.sub.1] is assessed by conducting a null hypothesis test on [b.sub.1]. Recall that the null hypothesis is that the population value for [b.sub.1] is zero. The appropriate test statistic is t, with degrees of freedom (df) equal to n - k - 1, where n = sample size and k = number of independent variables in the model. (18) The reported p-value in cell E18 (0.023) can be verified by entering the following formula in an open cell: = T.DIST.2T(D18, B13). (19) The first argument in this formula is the t-statistic (2.42), and the second argument is the number of degrees of freedom for the test (27, or 31 - 3 - 1). On the basis of these test results, we can reject the null hypothesis that the population value for [b.sub.1] is zero. Alternatively, we reject the hypothesis that Y and [b.sub.1] are unrelated: Based on the sample data, the probability that we obtained a t value of 2.42 when the true (population) value is zero is less than 3%.

The preceding result regarding the hypothesis test is verified by the confidence interval (CI) results for [b.sub.1] that appear in cells F18 and G18 of Figure 4. Notice that this confidence interval does not include the value zero. While Excel 2013 provides confidence interval results for each estimated coefficient in the multiple-regression model, these results can be checked by calculating the confidence intervals directly using the ANOVA results. Consider, for example, the construction of a 95% confidence interval around the estimated coefficient for [b.sub.1], $1.74 (cell B18). To form a 95% confidence interval around this estimated value, use the standard error for this coefficient estimate, 0.72 (cell C18), along with a critical value for the t distribution based on the desired level of confidence (here, 1 - 95%) and the appropriate degrees of freedom (df). The formula in Excel is =T.INV.2T(1-95%,B13), which yields a critical t value of 2.05. The 95% confidence interval for [b.sub.1] (and reported in cells F18 and G18 of Figure 4) can be verified as follows:

1. Take the estimated coefficient for [b.sub.1] (1.74).

2. Subtract (add) the product of the standard error for [b.sub.1] in cell C18 (0.72) and the critical t value (i.e., 2.05) for the lower (upper) limits of a 95% confidence interval (as reported in cells F18 and G18).

Notice that the 95% confidence interval for [b.sub.1] (0.263 to 3.212) does not include the value of zero. This result is consistent with the t-test results on [b.sub.1] These results are equivalent: Both indicate that at p = 0.05, the null hypothesis of no relationship between Y and [X.sub.1] should be rejected. That is, sales units per month ([X.sub.1]) has a nonzero effect on monthly supplies expense (Y).

The data in rows 19 and 20 of Figure 4 can also be used to verify the 95% confidence intervals for [b.sub.2] and [b.sub.3]. In all cases, the null hypothesis that the relationship between the individual independent variable and the dependent variable (Y) is zero is rejected when p = 0.05.

Assessing the Goodness-of-Fit

Next, compare how closely our estimated cost function matches the actual history of monthly supplies expense. Recall that in the Regression dialog box, the Residuals box was checked so that it was included as part of the output. Figure 5 reports the bottom section (rows 24-57) of the multiple regression output generated through the Regression tool. Column A reports observation numbers (in historical order), while column B presents Predicted Monthly Supplies Expense ([??]) generated by the estimated cost function.

To see how closely these predicted amounts are to the actual monthly supplies expense (Column E of Figure 1), run a correlation by pasting into an open cell the formula =CORREL(array 1, array 2), where "array 1" indicates the cells for Predicted Monthly Supplies Expense ([??]) from Figure 5 and "array 2" indicates the cells for actual Monthly Supplies Expense from Figure 1. The result, 88% (rounded), is reported in Figure 4 as Multiple R (cell B4), which is R for a multiple regression equation. Multiple R is equal to a simple correlation between the actual values of the dependent variable and the values for the dependent variable predicted by the regression equation. Note, too, that Multiple R-squared is equal to unadjusted R-squared. As seen in Figure 4, Multiple R = 0.88 (rounded), so that the unadjusted R-squared for the multiple regression model in cell B5 is 0.77, which is the square of 0.88. (20)

Returning to the ANOVA output in Figure 4, the amount in the error sum of squares (ESS) for the multiple regression model (cell C13) reports the sum of the squared error terms--in other words, the sum of the squared distances of the actual data points from the regression-predicted data points. (Note that Excel labels this "Residual" sum of squares--see cell A13.) Excel chooses the coefficient estimates so as to minimize ESS. Looking at the residuals output in Figure 5 can help demystify the concept of ESS. Column C reports the Residuals, which are the difference between Actual Monthly Supplies Expense ([??]) and RegressionPredicted Monthly Supplies Expense ([??]). To replicate the ESS reported in Figure 4, square the values in column C of Figure 5 and add up all the results. (The resulting total is 2,277,158.)

Now take a look at the R-squared value of 77% presented in cell B5 of the regression output in Figure 4. Our interpretation is that 77% of the observed variation in monthly supplies expense is explained by variations in the two independent variables, [b.sub.1] and [b.sub.2], coupled with the effect of December ([b.sub.3]). (21) We can replicate the R-squared value from the ANOVA output by dividing the regression sum of squares (RSS) in cell C12 (7,641,294) by the total sum of squares (TSS) in cell C14 (9,918,452).

Although R-squared is the traditional measure of goodness-of-fit, it may suffer from inflation due to the inclusion of independent variables that have little or no connection with the dependent variable. Simply put, adding more independent variables will always increase R-squared, but it might not truly explain more. A more conservative goodness-of-fit measure is Adjusted R-squared, which is reported as 74% in cell B6 of Figure 4. It is calculated as =1 - (((1 - B5)*(B8 - 1)/(B8 - B12 - 1))). (22) The emphasis on the calculation is on cell B12 --the number of independent variables (k) in the regression model fit to our data set, which is reported in the ANOVA output as the degrees of freedom (df) for the regression. As can be seen from the formula, the calculation for adjusted R-squared penalizes the multiple regression for adding independent variables. (23)

Continuing along the ANOVA output in Figure 4, cell D12 reports the Mean Square of the Regression amount (2,547,098), which is calculated by dividing the RSS (cell C12, 7,641,294) by its associated degrees of freedom (cell B12, 3). Cell D13 (84,339) similarly reports the Mean Square Error by dividing the ESS in cell C13 (2,277,158) by its associated degrees of freedom (B13, 27). Cell E12 reports the F-statistic of 30.20, which is equal to the Mean Square Regression (2,547,098) divided by the Mean Square Error (84,339).

Since it has been derived from a sample of past data, this F-statistic serves as a test for the R-squared. In other words, it tests the overall fit of the model to the data set. Similar to the coefficient tests for [b.sub.1], [b.sub.2], and [b.sub.3], a null hypothesis can be formed that the population R-squared value is zero. That is, what is the probability of obtaining the sample R-squared value if the population R-squared value is zero? For a sufficiently high Fstatistic, there will be evidence contrary to the null hypothesis, supporting the conclusion to reject the null hypothesis of no relationship. The Significance F column performs this calculation in cell F12 with the formula =F.DIST.RT(E12, B12, B13), corresponding to the F-statistic, degrees of freedom for the regression, and degrees of freedom for the residual. The resulting amount, 8.91E-09, is the probability of observing an Fstatistic of at least 30.20 if the population R-squared is 0, providing strong evidence against the null hypothesis of no relationship.

Cost Prediction Using the Estimated Cost Function

To predict 2015 monthly supplies expense, use the coefficient estimates from the regression output in conjunction with predicted values for each of the three independent variables ([X.sub.1], [x.sub.2], and [X.sub.3]) in the model. For example, the predicted contribution to cost for units sold in January equals the per-unit cost from the regression output ($1.74) multiplied by January's projected number of units (180). Figure 6 focuses on the 2015 data, with the coefficient results from the ANOVA output in Figure 4 included to accommodate cell referencing. The calculation for January 2015's total monthly supplies expense of $2,262 can be found by inputting =$E$17+($E$18*B2)+($E$19 * C2)+($E$20 * D2) into cell E2. The coefficients in cells E17, E18, E19, and E20 will be used throughout the remaining calculations for the year, so the dollar signs are included to lock the cell references. (In Excel, these are called "absolute references.") The dollar signs can be manually added as you type the formula, or they can be added by hitting the F4 key to toggle through the available options as you enter each cell reference. By including them, the formula will always refer to those particular coefficients when you copy it to other rows.

The references to cells B2, C2, and D2 are without dollar signs because these cells should update depending on the period. For example, in the estimate for February's monthly supplies expense, you want the formula to use February sales units of 230 (cell B3), sales dollars of $2,000 (cell C3), and the zero effect of the December dummy variable (cell D3). The formula in row 4 for March, however, should refer to cells B4, C4, and D4 instead.

Since the calculations of predicted supplies expense in column E of Figure 6 make use of estimated information, the actual supplies expense will likely differ somewhat. After all, R-squared is not 100%. When continuing with cost prediction, consider the amounts in cells E2 through E13 as more of a "ballpark estimate" of the actual supplies expense and allow some room for error.

The regression output in Figure 4 allows a more formal approach to defining this "ballpark estimate" by drawing upon the ESS amount from cell C13. To replicate the Standard Error of the Regression (SE) reported in cell B7 of Figure 4 (290.41), it is necessary to divide the square root of ESS in cell C13 (2,277,158) by the degrees of freedom from the residual in cell B13 (27). The Standard Error considers the overall relationship that all independent variables have with the dependent variable. Similar to the approach used to construct confidence intervals around the coefficient estimates for the individual independent variables, the Standard Error value (290.41) can be multiplied by the desired critical value (2.05) for t to construct a 95% confidence interval (+/- 595.88) around the values of Predicted Monthly Supplies Expense ([??]). (24)

Columns G and H in Figure 6 report the lower and upper limits, respectively, of the 95% confidence interval built around each of the estimated values reported in column E. To begin the process, calculate the amount in cell G2 (for January) by entering the formula =E2-595.88. Then copy this formula down the rest of the column to get values for the lower limit for each month, such as =E3-595.88 for February, =E4-595.88 for March, and so on. To calculate the upper limit in column H, simply add 595.88 to the corresponding amounts in column E instead. For example, cell H2 contains the formula =E2+595.88 to calculate the upper limit for January. Copy this formula down column H for the remaining months.

Considerations

Because the highest three observations do not behave in the same manner as the other 28 observations, the linear model is limited in its ability to produce a "smooth" estimation for monthly supplies expense. The regression output in Figures 2 and 3 support the case of fitting the lowest 28 observations more closely through a steeper slope, coupled with a downward adjustment for the highest three observations in the form of the December dummy variable. In cases where you are not able to connect the points with a single straight line, nonlinear estimation may be more appropriate. This reinforces the point that the process of cost estimation requires both knowledge of statistics and good judgment.

ESTIMATING THE OUTPUT OF A CUMULATIVEAVERAGE LEARNING CURVE MODEL

This next example involves Virilli, Inc., a manufacturer of high-end furniture, and production of one of its products, MJ-208. Virilli sells principally to low-end retailers, with some direct sales made via the internet. Until the recent past, it enjoyed healthy levels of profitability and few direct competitors. More recently, however, Virilli lost several contract bids to current and former customers. Management of the company is quite concerned over the increasingly competitive situation that it foresees and is considering various options for responding to these competitive pressures. One of these measures is the development of more sophisticated models of cost behavior, whose application could support Virilli making more competitive bids.

Table 1 includes production data for MJ-208, where X is cumulative output (in units) and Y is the cumulative average time per unit for each value of X. (25) Design and production engineers for the company are convinced that the production process for product MJ-208 is subject to a learning effect, that is, to productivity and efficiency improvements related to experience. Put another way, the underlying cost function associated with product MJ-208 is thought to be nonlinear in nature.

Using the data in Table 1, Excel can be used to test this conjecture. In particular, it can test whether the data is consistent with a learning curve model of the following form: Y = aXb, where Y is the cumulative-average production time per unit (the dependent variable), X is the unit number (the independent variable), a is the production time for the first unit produced, and b is an index of learning equal to the natural log (ln) of the learning curve rate divided by natural log of 2 (0.693147).

The learning curve model is an example of an "exponential function." The basic idea in this cost-function model is that cost incurred is not a linear function of experience (measured either in terms of volume/output or time). In other words, the learning curve model assumes productivity/efficiency gains as employees gain experience in the production process. In the cumulative-average learning curve model, the learning curve rate is used to estimate the new cumulative-average production time per unit (or cost per unit) each time cumulative output doubles (hence the link to ln 2). (26)

In the cumulative-average learning curve model, the learning curve rate must be greater than 50% and less than 100%. (27) When the learning curve rate is less than 1 (indicating that an experience effect exists), b will be less than 0. This implies that the value of Y (the labor-hour consumption or production time per unit) decreases as the number of items produced increases. It is this negative exponent value that produces a downward-sloping cost function. Table 2 contains sample values of the exponent b for different values of learning curve rate in the learning curve model Y = a[X.sup.b]. Logs base 10 can also be used in the calculations rather than natural logs. The results will be the same.

Thus, the task at hand is to use Excel to estimate a and b in the model and to assess the goodness-of-fit of the resulting cost function.

Linear or Nonlinear Model?

Before attempting to fit a nonlinear model to the data set in Table 1, the first step is to plot the data in an Excel chart. (28) This is useful in terms of helping to isolate any possible data problems, like data-recording errors or atypical observations, and to assess whether a linear or nonlinear model characterizes the data set. (29) Figure 7 presents a scattergraph of the data.

The plot of data in Figure 7 suggests that there are no apparent outliers in the data set and that a linear cost function is plausible. A simple linear regression model fit to the 14 observations in Table 1 (using the "Add Linear Trend Line" option in Excel, as discussed in Part 1) yields the following model: Y = 24.087 0.9874X, where X is the unit number and Y is the cumulative-average production time (hours) per unit for the Xth unit produced. In this result, R-squared is equal to 0.774. This model can then be used to estimate the value of the dependent variable, Y, for a given value of X, the independent variable. The question is whether the fit of the cost-estimation model could be improved by using a nonlinear rather than a linear model.

Fitting a Learning Curve Model to the Data Set Two approaches can be used to estimate the coefficients, a and b, in the learning curve model. (30)

Approach 1. This first involves fitting a linear equation to log-transformed data using the basic learning curve model Y = a[X.sup.b]. The natural log of both sides of this equation results in the following equation, which is linear in logs: Ln(Y) = Ln(a) + bLn(X). Therefore, the following five steps can estimate the coefficients a and b:

Step 1: Take the log of each X and each Y value.

(Either natural logs or log base 10 is acceptable; the example here uses natural logs.) Original X and Y values along with log-transformed counterparts are reported in Table 3.

Step 2: Fit an Ordinary Least Squares (OLS) regression to the log-transformed data using the Regression analysis tool in Excel. Results are presented in Figure 8.

Step 3: The estimated value of a in the model Y = a[X.sup.b] is found by raising e (the base of the natural logarithm, 2.7182818) to the power of the Intercept term obtained from Step 2. In the example, the intercept term is 3.357 (rounded). Thus, the value of a in the learning curve model is 28.706 (rounded). In other words, [e.sup.3357] = [2.7182818.sup.3357] = 28.706. The value for a could also have been found by using the exponential function (EXP) in Excel: =EXP(3.357). (31)

Step 4: The estimated exponent of the learning curve model, b, is equal to the regression-estimated coefficient for Ln(X), taken directly from the output from running the Regression tool in Excel. Therefore, the estimated learning curve model for the data presented in Table 3 is Y = 28.706[X.sup.-0319].

Step 5: Finally, estimate the learning curve rate for this data set as follows:

b = Ln(LCR) - Ln(2) -0.319 = Ln(LCR) - 0.693147 Ln(LCR) = -0.319 x 0.693147 = -0.22094

Thus, this approach finds that the learning curve rate is [e.sup.-022094], which is approximately 0.80.

Approach 2. The second approach estimates the coefficients in the learning curve model by using an Excel chart and trendline to fit a nonlinear curve to the data set:

Step 1: Graph the original data set (the X and Y values from Table 1) in a line graph.

Step 2: Right-click anywhere on the data line and select Add Trendline from the shortcut menu.

Step 3: In the Format Trendline dialog, set the Trend/Regression Type to Power.

Step 4: Check the boxes labeled "Display Equation on Chart" and "Display R-squared value on chart." Step 5: Close the Format Trendline dialog.

The output appears in Figure 9. Note that the coefficient estimates are Y = 28.706X-0319, precisely the same as those obtained using the first approach. Both approaches show that, on the basis of the calculated R-squared value (the coefficient of determination), the nonlinear function is an excellent fit to the data set. In fact, the R-squared value has increased from 0.774 in the linear cost function (Figure 7) to 0.998 in the cumulative-average learning curve cost function (Figures 5 and 6).

Cost Estimation

Using the cumulative-average learning curve model, cost estimates can be obtained by taking the difference between the total (i.e., cumulative) hours associated with adjacent values of X (output). For example, the estimated hours for the 15th unit produced (X = 15) is obtained by subtracting the total cumulative hours for the 14th unit from the total cumulative hours for the 15 th unit. Predicted cumulative hours for different values of X are found by multiplying X by [[??].sub.x], where [[??].sub.x]= a[X.sup.b] = 28.706 x [X.sup.-0319]:

For X = 15, cumulative total hours = 15 x (28.706 x 15-0319) = 15 x 12.10 = 181.51

For X = 14, cumulative total hours = 14 x (28.706 x 14-0.319) = 14 x 12.37 = 173.18 Estimated hours for unit #15 = 8.33

Because the data is in the form of a time series, it is possible to have Excel forecast future periods. The Format Trendline dialog contains a section labeled Forecast where you can enter the number of periods forward or backward to forecast. When you close the dialog, the trendline will be extended in the appropriate direction. The disadvantage of this method is that Excel does not provide a numerical estimate. The estimate must be read directly from the chart.

Based on the earlier calculations, the estimated learning curve rate is 80%. This means that each time output doubles, the new cumulative-average time per unit is estimated to be 80% of the preceding value. Estimated values of Y for X = 1, 2, 4, 8, and 16 are provided in Table 4.

UNDERSTANDING COST ESTIMATION

The examples used in this article show how Excel 2013 can be used to address two approaches to cost equation and prediction that are more advanced than the simple linear regression model. First, the simple linear model was extended to a multiple linear regression consisting of more than one independent variable. The inclusion of a dummy variable helps to elicit some of the story behind the estimated cost function. Second, the approach for nonlinear estimation provides a protocol for estimating and predicting when the dependent variable does not vary in a linear fashion. Specifically, using this approach, it is possible to fit a set of data to a cumulative-average learning curve model.

The overall goal of this two-part series is to demonstrate how to estimate various cost functions using Excel 2013 and to demystify the regression-related output produced by Excel. This can help provide students, practicing management accountants, and others with a useful method to get such results, as well as a straightforward way to increase their understanding of how the results were generated, thus making it easier for management accountants to use this important piece of their toolbox.

APPENDIX

USING THE LINEST FUNCTION IN EXCEL TO DETERMINE MULTIPLE REGRESSION VALUES

Excel's LINEST function uses the least squares method to calculate a straight line that best fits the data.[32] Here's how to use it with the data Figure 1.

Step 1: Identify the dependent variable (Y) and the independent variables ([X.sub.1], [x.sub.2], and [X.sub.3]):

Y = Monthly Supplies Expense [X.sub.1] = Sales Units per Month [x.sub.2] = Sales Dollars per Month [X.sub.3] = December (1 if yes, 0 otherwise)

Step 2: Select a four column by five row range of open cells, such as cells K2:N6.

Step 3: Click the Insert Function button. In the Insert Function dialog, search for LINEST and click Go. This will open the Function Arguments dialog.

Step 4: Enter the function arguments as shown in Figure 10, but do not hit OK.[33]

* Note that the first argument consists of the numeric data only (and not the column label) for the Y variable.

* Similarly, the second argument contains the three columns of data without the labels for [X.sub.1], [x.sub.2], and [X.sub.3].

* The third argument seeks treatment of the intercept term in the estimation process. Setting it equal to 1 tells the LINEST function to determine this constant term; setting it equal to 0 tells the LINEST function to determine coefficient values only for [X.sub.1], [x.sub.2], and [X.sub.3].

* The fourth argument will report additional regression results if marked 1; otherwise, it will repeat the coefficients in the column output.

Step 5: Hit OK. Excel will report output beginning in the upper left cell of the highlighted area. See Figure 11 for the output.

The output in cells K2:N6 maps to the corresponding traditional regression output terms in cells P2:S6. The first two rows of output could be manipulated to recreate the t-statistics and p-values found with the Regression routine. Divide cell M2 by M3 to determine the t-statistic equal to 2.42 in cell M8 (not shown here). (34) Use this transformation in a formula =T.DIST.2T(M8,L5) entered in cell M9 to find thep-value equal to 0.023.

David E. Stout, Ph.D., is the John S. and Doris M. Andrews Professor of Accounting at the Williamson College of Business Administration, Youngstown State University. He is also a member of IMA's Akron Chapter. He can be reached at destout@ysu.edu.

Jeremy T. Schwartz, Ph.D., CPA, is an assistant professor of accounting at the Williamson College of Business Administration, Youngstown State University. He can be reached at jtschwartz@ysu.edu.

ENDNOTES

(1) Useful resources regarding the application of learning-curve functions in management accounting are available at http:// maaw.info/LearningCurveSummary.htm and http://faculty. haas.berkeley.edu/ANDY/OldHandouts/Learning%20Curve % 20(Handout).ppt.

(2) The calculations and the methods described in the article were performed in Excel 2013, but using Excel 2010 or 2016 should yield equivalent results and require similar steps--though we did not verify this.

(3) Topicalrequirements and learning objectives for the CMA exam (effective January 2015) can be found at www.imanet.org/ -/media/58e9380baecd44df8c84f1949408f2bd.ashx. Regressionrelated learning objectives are found in Part 1-Section B.3. Forecasting techniques, which state that candidates should be able to: a. demonstrate an understanding of a simple regression equation; b. define a multiple regression equation and recognize when multiple regression is an appropriate tool to use for forecasting; c. calculate the result of a simple regression equation; d. demonstrate an understanding of learning curve analysis; e. calculate the results under a cumulative average-time learning model; and, f. list the benefits and shortcomings of regression analysis and learning curve analysis.

(4) This example is adapted from Edward J. Blocher, David E. Stout, Paul E. Juras, Gary Cokins, Cost Management: A Strategic Emphasis, 7th edition, McGraw-Hill Education, 2016, p. 293.

(5) The presentation here presumes a relationship between the three X variables and the Y variable. When estimating the Y variable, collect as much relevant data as feasible. Such evaluation of candidates for X variables included in the multiple regression model is beyond the scope of this article.

(6) From a presentation perspective, the use of subscripts for the independent variables--X!, [x.sub.2], [X.sub.3], and so forth--reflects the fact that the dependent variable is a function of more than one independent (explanatory) variable.

(7) For a step-by-step explanatory video of a multiple regression with a dummy variable, see www.youtube.com/watch? v=H07l1zgM-cw.

(8) Explanatory videos describing the use of the Regression analysis tool in Excel 2013 for estimating a multiple linear regression model can be found at www.youtube.com/watch?v=SG-tokYEgX0, www.wikihow. com/Run-a-Multiple-Regression-in-Excel, www.real-statistics. com/multiple-regression/multiple-regression-analysis, www.jeremymiles.co.uk/regressionbook/extras/appendi[x.sub.2]/ excel, and www.youtube.com/watch?v=tlbdkgYz7FM.

(9) You may need to enable the Data Analysis icon under the Data tab. Go to File and click Options. In the Excel Options dialog, choose Add-Ins on the left. In the Manage dropdown, make sure "Excel Add-ins" is selected, then click the Go button to open the Add-Ins dialog. Check the box next to "Analysis ToolPak" and click OK. The Data Analysis button should now appear on the Data tab.

(10) The Appendix presents an alternative approach for generating multiple regression results using the LINEST function. The use of the LINEST function will be of particular interest for users of Apple products, which have historically exhibited compatibility issues with the Regression tool in Excel.

(11) Though not done in this example, checking the Confidence Level box enables you to choose an alternative level. Confidence intervals of 90% and 99% are the most generally accepted alternatives to the default value of 95%.

(12) The Regression tool also generated results on the residuals beginning in cell A24, which are shown in Figure 5 and discussed in the section called "Assessing the Goodness-of-Fit." Excel left rows 21 through 23 empty.

(13) The results presented in Figure 4 have been rounded for a simpler presentation.

(14) Note that this conclusion holds within the "relevant range" for the data set at hand. Put another way, be careful about extrapolating results outside the range of data used to build the regression predictive model.

(15) Although the intercept coefficient (a) is also statistically significant here (i.e., p < 0.05), it is usually not discussed as thoroughly. This amount is properly interpreted in the present case as the estimated value of the cost function, Y, when both [X.sub.1] and [x.sub.2] are zero. Because there is no data around the area when the independent variables are close to zero, this coefficient cannot be interpreted as the fixed component of the cost function. For additional discussion of these points, see http:// blog.minitab.com/blog/adventures-in-statistics/regressionanalysis-how-to-interpret-the-constant-y-intercept.

(16) For a primer on hypothesis testing, see www.youtube.com/ watch?v=0zZYBALbZgg.

(17) More accurately, the /-statistic for a given coefficient = (estimated coefficient--hypothesized coefficient) ^ standard error of the estimate for that coefficient.

(18) The degrees of freedom (df) correspond to the number of values in the statistic that are free to vary. Since there are three independent variables and the constant term, df here are four less than the sample size.

(19) A two-tail test is used since the purpose is to test whether the coefficient is different from zero.

(20) Some of the more important underlying assumptions (such as linearity, normality of error terms, and homoscedasticity) of regression analysis can be tested through an analysis of the residuals. That discussion, however, is beyond the scope of this article. See Chapter 3, "The /-variable Linear Equation," in Jack Johnston and John DiNardo, Econometric Methods, 4th edition, McGraw-Hill, New York, N.Y., 1997.

(21) Technically, an R-squared value of 77% means that 77% of the variability of the dependent variable (Y) around the mean value (Y) is explained by the regression equation, that is, by the joint variation of X!, [x.sub.2], and [X.sub.3].

(22) Thiscell formula corresponds to the statistical calculation of Adjusted R-squared = 1 - {[(1 - R2)*(n - 1)] / (n - k - 1)}, where cells B5, B8, and B12 map to the (unadjusted) Rsquared, number of observations (n), and number of independent variables (k), respectively.

(23) As variables are added to the multiple regression model, the ratio of ESS to TSS (1 - [R.sup.2]) will usually decrease because ESS decreases and TSS remains constant. Yet the ratio of (n - 1) to (n - k - 1) will increase because n - 1 is constant while n - k 1 decreases. Thus, if a variable is added to the model that does not reduce ESS enough to compensate for the increase in k, the adjusted R-squared value will decrease.

(24) As indicated in Cliff T. Ragsdale, Spreadsheet Modeling and Decision Analysis, 2nd edition, South-Western College Publishing, 1998, p. 393, the method described here is an approximation. Techniques exist that are more accurate for constructing prediction intervals (range estimates) using multiple regression models. To do this, however, requires knowledge of matrix algebra. If interested, consult advanced texts on multiple regression analysis for a description of how to construct more accurate prediction intervals. The approximation method discussed in this article tends to give an underestimated (narrower) approximation of the more accurate prediction interval.

(25) This example is adopted from David E. Stout and Paul E. Juras, "Instructional Case: Estimating Learning-Curve Functions for Managerial Planning, Control, and DecisionMaking," Issues in Accounting Education, Vol. 24, No. 2, May 2009, pp. 195-217.

(26) As such, this form of the learning curve model can be compared to the incremental unit-time learning curve model. In this alternative form of the model, the value of Y (the dependent variable) is defined as the new incremental time per unit each time output (X) doubles. For example, with a 70% learning-curve rate, if Y = 10 for X = 1, then Y = 7 for X = 2, and Y = 4.9 for X = 4, etc.

(27) A learning curve rate of 50% would imply infinite learning, while a 100% rate would imply no learning--that is, as output doubled, the new cumulative-average time (or cost) would be 100% of the previous amount.

(28) Online resources for creating charts in Excel 2013 include www.youtube.com/watch?v=7vkIB7VN75k, http:// peltiertech.com/my-first-excel-2013-chart, and www. techrepublic.com/blog/10-things/10-cool-new-chartingfeatures-in-excel-2013/.

(29) The least-squares regression procedure can be used to estimate the learning curve model. Note that the least-squares regression procedure minimizes the sum of the squared deviations of the actual data points from the regression line. As such, the procedure is sensitive to outliers (atypical observations). Therefore, the inclusion of outlier observations in a model can seriously bias the forecasts (estimates) it generates.

(30) For estimating the incremental unit-time learning curve model using Excel, see www.youtube.com/watch?v=RbejtKfCtEc.

(31) If logs base 10 were used to convert the original X and Y values, the estimate of a in the model = 10 raised to the power of the Intercept term determined in Step 2. The estimate for a from the regression using logs base 10 is 1.458, which is different from the previous one due to the different base. Note that 101458 = 28.706. Alternatively, the same results could be obtained by using the Power function in Excel, that is =POWER(10, 1.458).

(32) An explanatory video can be found at www.youtube.com/watch ?v=zcl B Pe EZ_D0.

(33) For those who are manually inputting the LINEST function, start by entering as usual: =LINEST(D2:D32,B2:C32,1,1) but do not hit Enter. You must simultaneously hit Ctrl+Shift+Enter, which alerts Excel that the formula affects all cells in the highlighted range. The formula will then appear as {=LINEST(D2:D32,B2:C32,1,1)}.

(34) As in thearticle, this example assumes that the hypothesized value for /1 is 0.

Caption: Figure 3: Regression Dialog

Caption: Figure 7: Fitting a Linear Function to the Data Set in a Scattergraph

Caption: Figure 9: Results from Fitting a Nonlinear Function to the Data Set

Caption: Figure 10: Function Arguments Dialog for LINEST

Table 1: Assumed Data Set for Product MJ-208 X Y (Unit Number) Production Time (Cumulative Average Time per Unit) 1 29 29.00 2 18 23.50 3 12 19.67 4 14 18.25 5 12 17.00 6 12 16.17 7 11 15.43 8 10 14.75 9 11 14.33 10 8 13.70 11 12 13.55 12 8 13.08 13 9 12.77 14 7 12.36 Note: The formula for calculating the cumulative average time per unit for any value [X.sub.i] (i = 2, 14) is [Y.sub.i] = [([Y.sub.i-1] x [X.sub.i-1]) + Production [Time.sub.i]] / [X.sub.i]. For example, when X = 12, Y = [(13.55 x 11) + 8] Table 2: Exponent Values (b) for Various Learning Curve Rates in the Learning-Curve Function Y = a[X.sub.b] LCR Ln(LCR) b 1.00 0.0000 0.0000 0.95 -0.0513 -0.0740 0.90 -0.1054 -0.1520 0.85 -0.1625 -0.2345 0.80 -0.2231 -0.3219 0.75 -0.2877 -0.4150 Legend: b = Ln(LCR) / Ln(2), where LCR = learning curve rate and b = index of learning. For example, when LCR = 0.85, b = Ln (0.85) / Ln(2) = -0.1625 / 0.693147 = -0.2345. An LCR of 85% implies that each time the cumulative output doubles, the new cumulative average time per unit is 85% of the prior value. Table 3: Original and Log-Transformed Data for Product MJ-208 X Y (Unit- (Cumulative Ln(X) Ln(Y) Number) Average Time per Unit) 1 29.00 0.0000 3.3673 2 23.50 0.6931 3.1570 3 19.67 1.0986 2.9789 4 18.25 1.3863 2.9042 5 17.00 1.6094 2.8332 6 16.17 1.7918 2.7830 7 15.43 1.9459 2.7362 8 14.75 2.0794 2.6912 9 14.33 2.1972 2.6626 10 13.70 2.3026 2.6174 11 13.55 2.3979 2.6061 12 13.08 2.4849 2.5713 13 12.77 2.5649 2.5470 14 12.36 2.6391 2.5142 Table 4: Estimated Cumulative-Average Hours per Unit ([??]) for Sample Values of X, Based on an 80% Learning Curve Rate Estimated Y Total Time for X (Unit (Cumulative Average Calculation Cumulative Number Number) Production Time of Units Produced, X per Unit) 1 28.706 Given 28.706 2 22.965 28.706 x 0.80 45.929 4 18.372 22.965 x 0.80 73.487 8 14.697 18.372 x 0.80 117579 16 11.758 14.697 x 0.80 188.127 (Estimated Y values rounded to three decimal places). Figure 1: Hypothetical Data Set Date Sales Units Sales Dollars December Supplies ([X.sub.1]) ([X.sub.2]) ([X.sub.3]) Expense (Y) Jun-12 354 $ 2,205 0 $ 2,745 Jul-12 436 2,444 0 3,200 Aug-12 525 2,878 0 3,232 Sep-12 145 1,745 0 2,199 Oct-12 199 1,615 0 2,321 Nov-12 543 2,152 0 3,432 Dec-12 1,189 3,829 1 3,957 Jan-13 212 1,999 0 2,310 Feb-13 284 2,190 0 2,573 Mar-13 246 2,045 0 2,487 Apr-13 278 1,918 0 2,484 May-13 498 3,111 0 3,384 Jun-13 424 2,850 0 2,945 Jul-13 312 2,265 0 2,758 Aug-13 485 2,601 0 3,394 Sep-13 188 1,602 0 2,254 Oct-13 276 2,232 0 2,763 Nov-13 489 2,622 0 3,245 Dec-13 1,045 3,645 1 4,212 Jan-14 104 1,423 0 2,103 Feb-14 167 2,045 0 2,056 Mar-14 298 2,301 0 3,874 Apr-14 398 2,345 0 2,784 May-14 187 1,815 0 2,345 Jun-14 334 2,142 0 2,912 Jul-14 264 1,837 0 2,093 Aug-14 333 2,111 0 2,873 Sep-14 143 2,003 0 2,563 Oct-14 245 1,857 0 2,384 Nov-14 232 2,189 0 2,476 Dec-14 1.122 2.999 1 3,364 Figure 2: Projected Information for Each Month in 2015 Sales Units Sales Dollars December Supplies Date ([X.sub.1]) ([X.sub.2]) ([X.sub.3]) Expense ([??]) Jan-15 180 1,600 0 Feb-15 230 2,000 0 Mar-15 190 1,900 0 Apr-15 450 2,400 0 May-15 350 2,300 0 Jim-15 350 2,300 0 Jul-15 450 2,500 0 Aug-15 550 3,000 0 Sep-15 300 2,500 0 Oct-15 300 2,500 0 Nov-15 450 3,200 0 Dec-15 950 3,900 1 Note: Columns B and C present estimated monthly sales information in units and dollars, respectively for 2015. Column D presents December ([X.sub.3]) as a dummy variable, which takes a value equal to 1 for each December and 0 otherwise. The highlighted Column E represents the predicted supplies expen ([??]) to be projected from the regression output for each month reported in Column A. Figure 4: Output from Applying Multiple Regression Routine SUMMARY OUTPUT Regression Statistics Multiple R 88% R Square 77% Adjusted 74% R Square Standard Error 290.41 Observations 31 ANOVA df SS MS Regression 3 7,641,294 2547098 Residual 27 2,277,158 84339.2 Total 30 9,918,452 Coefficients Standard tStat Error Intercept $ 1,160.05 336.88 3.44 Sales Units $ 1.74 0.72 2.42 ([X.sub.1]) Sales Dollars $ 0.48 0.22 2.15 ([X.sub.2]) December $ (918.04) 431.46 -2.13 ([X.sub.3]) SUMMARY OUTPUT Multiple R R Square Adjusted R Square Standard Error Observations ANOVA F Significance F Regression 30.201 8.91194E-09 Residual Total P-value Lower 95% Upper 95% Intercept 0.002 468.842 1851.263 Sales Units 0.023 0.263 3.212 ([X.sub.1]) Sales Dollars 0.041 0.021 0.929 ([X.sub.2]) December 0.043 -1803.323 -32.751 ([X.sub.3]) Figure 5: Residual Output from Applying Multiple Regression Routine RESIDUAL OUTPUT Observation Predicted Residuals Supplies Expense (Y) 1 $ 2,822.83 $ (77.83) 2 $ 3,078.86 $ 121.14 3 $ 3,439.71 $ (207.71) 4 $ 2,241.15 $ (42.15) 5 $ 2,273.19 $ 47.81 6 $ 3,126.00 $ 306.00 7 $ 4,127.13 $ (170.13) 00 $ 2,478.25 $ (168.25) 9 $ 2,694.10 $ (121.10) 10 $ 2,559.18 $ (72.18) 11 $ 2,554.42 $ (70.42) 12 $ 3,503.52 $ (119.52) 13 $ 3,250.94 $ (305.94) 14 $ 2,778.38 $ (20.38) 15 $ 3,238.59 $ 155.41 16 $ 2,247.91 $ 6.09 17 $ 2,700.15 $ 62.85 18 $ 3,255.52 $ (10.52) 19 $ 3,789.53 $ 422.47 20 $ 2,016.92 $ 86.08 21 $ 2,421.93 $ (365.93) 22 $ 2,771.16 $ 1,102.84 23 $ 2,965.80 $ (181.80) 24 $ 2,347.38 $ (2.38) 25 $ 2,758.15 $ 153.85 26 $ 2,491.61 $ (398.61) 27 $ 2,741.68 $ 131.32 28 $ 2,360.28 $ 202.72 29 $ 2,468.10 $ (84.10) 30 $ 2,603.28 $ (127.28) 31 $ 3,616.33 $ (252.33) Figure 6: Predicting 2015 Monthly Supplies Exp Using Multiple Regression Output Sales Units Sales Dollars December Date ([X.sub.1]) ([X.sub.2]) ([X.sub.3]) Jan-15 180 1600 0 Feb-15 230 2000 0 Mar-15 190 1900 0 Apr-15 450 2400 0 May-15 350 2300 0 Jun-15 350 2300 0 Jul-15 450 2500 0 Aug-15 550 3000 0 Sep-15 300 2500 0 Oct-15 300 2500 0 Nov-15 450 3200 0 Dec-15 950 3900 1 Coefficients Intercept $1,160.05 Sales Units $1.74 ([X.sub.1]) Sales Dollars $0.48 ([X.sub.2]) December -$918.04 ([X.sub.3]) Predicted Supplies Date Expense ([??]) Lower 95% Upper 95% Jan-15 $2,233 1,637 2,829 Feb-15 $2,510 1,914 3,106 Mar-15 $2,393 1,797 2,989 Apr-15 $3,082 2,486 3,678 May-15 $2,861 2,265 3,457 Jun-15 $2,861 2,265 3,457 Jul-15 $3,130 2,534 3,726 Aug-15 $3,541 2,945 4,137 Sep-15 $2,869 2,273 3,465 Oct-15 $2,869 2,273 3,465 Nov-15 $3,462 2,867 4,05 S Dec-15 $3,746 3,150 4,342 Note: Cells B2:D13 report the estimated 2015 monthly information for the three independent variables in the estimated cost function. The partial regression output is taken from Figure 4. Column E uses the coefficients reported in cells E17: E20 and the estimated independent variable amounts in rows 2 through 13 to calculate the Predicted Supplies Expense ([??]) for each of the months reported in Column A. Columns G and H report the 95% confidence interval estimate for each Predicted Supplies Expense ([??]). Figure 8: Output from Running the Regression Routine on the Log-Converted Data SUMMARY OUTPUT Regression Statistics Multiple R 0.999 R Square 0.998 Adjusted 0.997 R Square Standard Error 0.013 Observations 14 ANOVA df SS Regression 1 0.7915 Residual 12 0.0019 Total 13 0.7934 Standard Coefficients Error Intercept 3.357 0.009 Ln(X) -0.319 0.005 Figure 8: Output from Running the Regression Routine on the Log-Converted Data SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA MS F Significance F Regression 0.7915 4935.9 < 0.001 Residual 0.0002 Total tStat P- value Lower 95% Upper 95% Intercept 379.9 < 0.001 3.338 3.376 Ln(X) -70.3 <0.001 -0.329 -0.309 Figure 11: LINEST Output -918.04 0.47518 1.7373 1160.05 [b.sub.3] [b.sub.2] 431.462 0.22135 0.71875 336.875 Std Error Std Error [x.sub.3] [x.sub.2] 0.77041 290.412 #N/A #N/A [R.sup.2] Std Error Reg 30.2006 27 #N/A #N/A F df residual 7641294 2277158 #N/A #N/A RSS ESS -918.04 [b.sub.1] a 431.462 Std Error Std [x.sub.1] [Error.sub.a] 0.77041 30.2006 7641294

Printer friendly Cite/link Email Feedback | |

Author: | Stout, David E.; Schwartz, Jeremy T. |
---|---|

Publication: | Management Accounting Quarterly |

Geographic Code: | 1USA |

Date: | Jan 1, 2017 |

Words: | 9368 |

Previous Article: | Using excel 2013 for regression-based cost estimation: Part 1. |

Next Article: | A Strategy for Teaching Critical Thinking: The Sellmore Case. |

Topics: |