# An econometric perspective on differences of proportions.

Most elementary statistics textbooks offer methods both to estimate confidence intervals and to test hypotheses for statistics of interest, and in most cases the estimated variances used for confidence intervals and hypothesis tests are the same. A notable exception is found in differences of proportions (DP) [Mann, Introductory Statistics, 2004]. The recommended estimator for the variance of the difference of proportions is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where [[??].sub.1] is the sample estimate from population 1, and [[??].sub.1] is the sample estimate from population 2. The corresponding sample sizes are [n.sub.1] and [n.sub.2], respectively.

For hypothesis testing, the recommended estimator is

var([[??].sub.2] - [[??].sub.1]) = [bar.pq](1/[n.sub.1] + 1/[n.sub.2]) (2)

where [bar.p] is the weighted average of [[??].sub.1] and [[??].sub.2], and [bar.q] = 1 - [bar.p]. However, by formulating the DP problem as a linear regression and appealing to results from econometrics, it is argued below that equation (1) is appropriate for both hypothesis testing and confidence intervals.

The linear DP regression model can be shown to be

Y = [p.sub.1] + ([p.sub.2] - [p.sub.1])d + [epsilon] (3)

where both Y and d are binary. When d = 0, the observation is assumed to be from population 1, and to be from population 2 when d = 1. Since Y is binary, [epsilon] is heteroskedastic, as shown by Pindyck and Rubinfeld [Econometric Models and Economic Forecasts, 1998]. Barring problems with sample estimation, weighted least squares (WLS) is of course the natural approach to estimation. It is assumed in what follows that [n.sub.1] and [n.sub.2] are sufficiently large to provide a joint normal distribution of the sampling statistics.

Since [[sigma].sup.2.sub.i] = [p.sub.i](1 - [p.sub.i]) = [p.sub.i][q.sub.i] and d is binary, we can set [[sigma].sup.2.sub.i] = [p.sub.1] [q.sub.1] for i = 1, ..., [n.sub.1] and to [[sigma].sup.2.sub.i] = [p.sub.2][q.sub.2] for i = [n.sub.1] + 1, ..., [n.sub.2]. Thus, the WLS weights are 1/[square root of [p.sub.1][q.sub.q]] and 1/ [square root of [p.sub.2] [q.sub.2]]. Comparing the two group variances, it is clear that there is no heteroskedasticity when [p.sub.1] = [p.sub.2], but by the contrapositive, if there is heteroskedasticity, we must have [p.sub.1] [not equal to] [p.sub.2].

Thus, if [p.sub.1] [not equal to] [p.sub.2], standard econometrics demands that this be considered in developing an estimator for [p.sub.2]-[p.sub.1]. It follows that the WLS estimator for the variance of [[??].sub.2]-[[??].sub.1] is

var([[??].sub.2]-[[??].sub.1]) = [p.sub.1][q.sub.1]/[n.sub.1] + [p.sub.2][q.sub.2]/[n.sub.2]. (4)

We must now consider the properties of sample estimators for [p.sub.2]-[p.sub.1] and var([[??].sub.2]-[[??].sub.1]). One difficulty with estimates of binary models, as explained by Pindyck and Rubinfeld [Econometric Models and Economic Forecasts, 1998] is that predicted values can lie outside the [0,1] interval, in which case WLS variances will not be efficient. In the case of the DP model, however, [[??].sub.1] and [[??].sub.2] are the predicted values and these must always lie in [0,1].

There are at least three estimators for [p.sub.2]-[p.sub.1]: Ordinary least squares (OLS), feasible generalized least squares (FGLS), and direct computation from [[??].sub.1] and [[??].sub.2]. It is easily shown that all these estimators give unbiased, consistent estimates. But finite samples are another matter where efficiency is concerned.

Of our unbiased estimators, only FGLS errors have both serial independence and homoskedasticity. Thus, only FGLS is a Gauss-Markov estimator [Davidson and MacKinnon, Econometric Theory and Methods, 2004] and we note that equation (1) is the FGLS estimator, so the econometric approach to DP supports equation (1) over equation (2). (JEL C10)

J. LLOYD BLACKWELL, University of North Dakota--U.S.A.