Adjusting comparable sales using multiple regression analysis - the need for segmentation.

Multiple regression analysis (MRA) is becoming a popular tool for appraising single-family residential properties. It can be used as a guide to market value estimation or, perhaps more importantly, to calculate market adjustments for different units of comparison in the market sales comparison approach to value. [1] When using MRA, appraisers should be aware of some sources of statistical distortion. Problems of nonlinearity and multicollinearity have been discussed in articles that previously appeared in this journal. [2] This article shows how the problems associated with heteroscedasticity [3] may be minimized by segmenting the sale data by price.

HETEROSCEDASTICITY IN

MRA ANALYSIS

A classical linear regression model, for which ordinary least squares (OLS) is the optimal estimator, is based on a number of assumptions. One of them is that the residual errors have the same variance across the sample. If this assumption is violated, a case of heteroscedasticity occurs: the OLS estimator remains unbiased, but the associated standard errors and t-values betcome biased. In other words, interval estimation and hypothesis testing based on OLS estimates can no longer be trusted in the presence of heteroscedasticity.

Heteroscedasticity is not a significant problem for MRA if it is still possible to assume that the sample data originate from the same underlying population. To avoid the problem of unreliable standard errors and t-values, a number of adjustments can be made to the regression equation in this case. These range from changing the functional form to using weighted least squares. In particular, a transformation of the model variables to a natural logarithm can sometimes help reduce or eliminate heteroscedasticity. A nonlinear functional form, however, complicates comparable sales analysis because the implicit prices of housing characteristics will now vary with the house value, rising or falling systematically over the range of house prices.

Heteroscedasticity cannot always be eliminated by manipulating the functional form. However, there are a number of alternative ways to deal with this problem. First, one can resort to a heteroscedasticity-resistant estimator. [4] This method tries to squeeze as much information as possible out of the OLS estimator. It does not change the point estimates but adjusts the estimates of the standard errors appropriately oo they remain consistent in the presence of heteroscedasticity. Second, if the occurrence of heteroscedasticity can be linked to a particular variable or combination of variables, weighted least squares can be used. If no obvious variable or set of variables causes heteroscedasticity, however, this method breaks down.

These methods only make sense if the data set has been drawn from one and only one underlying population. However, data sets used for comparable sales analysis routinely violate this condition. For these data sets, a more reasonable assumption is likely to be that the sample data derive from several underlying populations, each with different variance and regression coefficients. In this case, it would be meaningless to run one regression on the complete sample. The regression results would be impossible to interpret because they would be averages of the different underlying population parameters with unknown weights. When different variances exist among a number of subsamples, the proper procedure is to estimate a separate regression for each subsample. A dummy variable model is an appropriate alternative only when the various underlying populations have a common variance but differ in the values of the regression coefficients.

In light of this discussion, two points should be established to justify running separate regressions on subsamples of the given data set. First, it is necessary to show that there is no heteroscedasticity within a proposed subsample. A convenient way to do this is through the Breusch-Pagan test or the White test. [5] Because these tests have very broad alternative hypotheses, they are most useful when it is unclear whether there is a case of heteroscedasticity or what causes the hetroscedasticity. Second, the regression variances must be significantly differenct among the various subsamples considered. This difference is best established using a Goldfeld-Quandt test, [6] which can be used to compare the variances of subsamples after the sample data are ordered with the variable thought to be responsible for the heteroscedasticity. [7]

One final point should be discussed before the results are presented. This relates to the difficulty of actually finding the various subsamples that generate homoscedatic error terms. Based on the experience of appraisers, location would appear to be the key variable to segment markets. However, the authors' statistical efforts in this direction did not prove successful. None of the regressions by location resulted in homoscedastic error terms, regardless of whether separate regressions were run by location or whether dummy variables were included for different locations in one overall regression. A likely reason for the failure of location to solve the problem of heteroscedasticity in this case lies in the fact that the locations of the multiple listing files used are not detailed enough. In particular, most of the locations contain houses with both low and high prices. The implicit values of particular housing characteristics, however, are unlikely to be comparable across different price ranges. Location would be a more useful indicator if it were specified at the level of a housing subdivision, because at that level of dissaggregation, housing values generally lie in a more narrowly defined band. All is not lost, however, if there is little useful information at the level of subdivisions. If location is strongly correlated with house price, as is suggested here, it is possible to find market segments either by (subdivision) location or by price. In our study, we tried housing price as the segmentation variable.

Choosing house price as the segmentation variable, however, is only a first step. The second step is to actually identify the prices at which market segments begin and end. It should be pointed out in this context that there is no well-known econometric technique available to do this. However, a heuristic method that was useful for a sample of new homes sales consists of the following steps.

* Order the sample according to house value from lowest to highest;

* Starting from one end, split the sample at a housing price that would suggest the beginning of the next higher priced or next lower priced market segment;

* Run a regression on this first market segment and check for heteroscedasticity using the Breusch-Pagan test or White test; and

* Try different sample splits around this point to maximize the size of the market segment without causing the test statistic for the heteroscedasticity test to move into the rejection region for a given level of statistical significance.

REGRESSION RESULTS

The procedure outlined at the end of the preceding section is applied to a sample of 801 new home sales from a suburban Tennessee county for the year 1988. For all regressions, a simple price function is specified of the form

p = f(age,sqf, stor, bb, gar, sid, fire, sew, fin)

where

p = Selling price age = Age of the house in years sqf = Square footage of living area stor = Number of stories bb = Baths per bedroom gar = Garage size sid = Type of siding (0 = vinyl, 1 = brick) fire = Dummy variable for a fireplace sew = Dummy variable for the connection to city sewer fin = Dummy variable for the type of financing (0 = FHA or VA, 1 = conventional)

Table 1 provides the parameter estimates for both the complete data set of 801 observations and the four segments identified according to this procedure. In line with previous research, all regression equations are specified in log-linear form to the extent possible. [8] The model estimated over the complete sample is clearly rejected based on the high value of both the Breusch-Pagan test statistic for heteroscedasticity and the Jarque-Bera test [9] for normality of regression residuals. By contrast, none of the four regressions that are estimated over one of four market segments has to be rejected on this basis.

Table 2 illustrates the results of Goldfeld-Quandt tests for the various segments. The first three statistics ("1 versus 2" to "3 versus 4") provide tests for equal variance of adjoining segments while the remaining two areas test equal variance for segments that are further apart. In all cases, the test statistic either reaches or exceeds the critical value for the 5% level of significance. The evidence shown in Table 2 demonstrates that the variances are significantly different among the four segments.

In interpreting the results of Table 1, large differences appear among the coefficients of different market segments. Not only are there statistically significant differences in the size of the coefficients, but signs also reversed from one market segment to another. The economic significance of the coefficient variation among the segments is illustrated in Table 3. This table translates the coefficients of Table 1 into dollar values for the units of comparison, employing in each case mean values for a given market segment. [10] The table also provides the house prices ranges that correspond to the chosen segments. A number of points are noteworthy. First, the implicit prices for the housing characteristics do not increase or decrease smoothly as the housing price increases or decreases, as would be the case if the regression estimated for the complete sample were evaluated at the same points as the four subsample equations. Second, not all implicit prices are significant in
```TABLE 1 Estimates for Complete Sample and Market Segments,
Log-Linear
Model
Range of Market Segments
Complete Sample 1-90 91-340 341-652 653-801
Constant -.6956 -.5753 -.5094 -.4392 -.4556
(-41.5) (-27.1) (-27.8) (-17.9) (-4.8)
age -.0201 -.0180 -.0081 -.0065 -.0225
(-3.7) (-1.8) (-1.7) (-1.3) (1.8)
in sqf .8647 .2121 .2147 .3612 .7200
(39.2) (3.6) (5.1) (12.5) (13.3)
in stor -.0233 -.0701 -.0287 .0240 -.0524
(-1.4) (-2.3) (-.17) (1.7) (-1.6)
in bb .0494 .1106 -.0100 .0169 .0733
(1.6) (3.5) (-.3) (.5) (1.3)
gar .0561 .0536 .0185 .0173 .0117
(9.5) (2.1) (3.9) (2.6) (.4)
sid .0171 -.0188 -.0036 .0198 .0163
(1.9) (-1.1) (-.5) (2.6) (.5)
fire .0425 .0163 .0289 .0385 .0781
(4.1) (1.0) (4.1) (2.7) (1.4)
sew .0039 -.0716 .0279 .0108 .0253
(.5) (-3.9) (3.4) (1.5) (1.1)
fin .0113 -.0018 -.0020 .0100 .0125
(1.3) (-.1) (-.2) (1.4) (.4)
[R.sup.2] .8563 .5494 .3457 .4928 .6713
SER .1065 .0574 .0499 .0591 .1103
BP(9) 114.6 6.7 13.0 9.3 13.4
JB(2) 72.33 1.88 4.65 5.09 0.29
NOTES: P divided by 100,000, sqf divided by 1000. t-values
are given in parentheses. [R.sup.2] is the
coefficient of multiple determination. SER stands for the
standard error of the regression. BP(9)
denotes Breusch-Pagan's [X.sup.2] test for heteroscedasticity
with 9 degrees of freedom. Its 5% critical
value is 16.92. JB(2) is Jarque-Bera's [X.sup.2] test for
normality of regression residuals. Its 5% critical
value is 5.99.
```

all segments. Third, some implicit prices even change their signs.

These results are of practical importance for appraisers. Consider, for example, the sign change in identified for the city sewer system. A location within the city limits provides a number of amenities, such as a sewer system and garbage pick-up, but these amenities come at the cost of a sizable city tax. Tax payments add to the overall cost of home ownership, and at a low level of income, they can figure prominently in a household's budget. It is not surprising, then, to find that

[TDO]

[TDO]

a location within the city limits lowers housing values at the lower end of the range, while it adds to the value of higher priced homes. For obvious reasons, the implicit prices derived from the overall model fail to reflect such market adjustments.

CONCLUSION

The consequences of heteroscedastic error terms for statistical appraisal analysis have so far not attracted much attention in appraisal literature. This appears to be quite unjustified in light of this study's finding of a considerable degree of heteroscedasticity in standard comparable sales multiple regression analyses. Heteroscedastic errors can be damaging for a number of reasons. First, OLS regression will produce biased standard errors regardless of the underlying cause of heteroscedasticity. This alone will invalidate hypothesis tests on the implicit prices estimated for housing characteristics. Second, and more importantly, if the sample data can be perceived as coming from different underlying market segments, each one with unique variance and coefficients, then an OLS regression for the complete sampel will be even less useful. It will generate implicit prices that are averages, with unknown weights, of the true implicit prices of the underlying market segments. Both the estimated standard errors of the implicit prices and their point estimates will be meaningless. An additional problem in this situation is that the standard statistical techniques for dealing with heteroscedasticity will no longer work. Prior to estimation, it is necessary to identify the underlying market segments. In this article, one way to deal with this difficult problem is demonstrated.

[1] Lloyd T. Murphy III, "Determining the Appropriate Equation in Multiple Regression Analysis," The Appraisal Journal (October 1989): 498-517; James D. Shilling, John D. Benjamin, and C. F. Sirmans, "Adjusting Comparable Sales for Floodplain Location," The Appraisal Journal (July 1985): 429-36.

[2] William N. Weirick and F. Jerry Ingram, "Functional Form Choice in Applied Real Estate Analysis," The Appraisal Journal (January 1990): 57-73; Dennis Bialaszewski and Bobby A. Newsome, "Adjusting Comparable Sales for Floodplain Location: The Case of Homewood, Alabama," The Appraisal Journal (January 1990): 114-119.

[3] Heteroscedasticity is also spelled as heteroskedasticity in the literature.

[4] Halbert White, "a Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity," Econometrica (May 1980): 817-838.

[5] Trevor s. Breusch and Adrian R. Pagan, "A Simple Test for Heteroskedasticity and Random Coefficient Variation," Econometrica (September 1979): 1287-1294. See also White, 817-838.

[6] Stephen Goldfeld and Richard E. Quandt, "Some Tests for Homoscedasticity," Journal of the American Statistical Association (1965): 539-547.

[7] Standard textbook treatments of heteroscedasticity can be found in William H. Greene, Econometric Analysis (New York: MacMillan, 1990) and Jack Johnston, Econometric Methods, 3d ed. (New York: McGraw-Hill, 1984).

[8] See Wirick and Ingram. The linear regression generated a Breusch-Pagan test statistic of 433.

[9] Carlos M. Jarque and Anil K. Bera, "A Test for Normality of Observations and Regression Residuals," International Statistics Review (1987): 163-172.

[10] See Weirick and Ingram.

Bobby A. Newsome, PhD, is professor of real estate in the Department of Economics and Finance at Middle Tennessee State University. Dr. Newsome received a BA in history and political science from Brigham Young University and an MA and PhD in business adminstration from the University of Georgia. He has published previously in The Appraisal Journal and contributes regularly to other publications in the real estate field.

Joachim Zietz, PhD, is a professor of economics in the Department of Economics and Finance at Middle Tennessee State University. He received his master's and doctorate degrees in economics from the University of Goettingen in Germany. Dr. Zietz has been a consultant to a number of international organizations, including the World Bank and the Organization for Economic Cooperation and Development (OECD).