# Measuring the effect if data dependence on quality control limits.

vasilopa@stjohns.eduAbstract

The existing methodology for analyzing data to test whether a given process is in or out of control, using an[bar.x] chart to test if the mean value of the process is in control and an s chart to test if the process variability is in control, assumes the existence of normality and independence in the data. When either independence and/or normality are not present, the use of the existing methofology(i.e. the use of [bar.x] and s charts) introduces large errors in the data analysis and renders conclusions dubious, especially if data dependence is very pronounced. This paper introduces a methodology to use when deriving Shewhart control limit factors for a generalized ARMA (p.q) model, and shows the substantial effect that data dependence has on the classical quality control factors through specific examples.

I. Introduction

Traditionally the assumption is made that data (financial, engineering, etc.) are independent and normally distributed. Under these circumstances data are analyzed according to the existing standard methodology (6). However, when the assumptions of independence and/or normality are violated, using the standard methodology to analyze such data produces large errors that can be proven very costly. Violation of normality can be compensated for, to some extent, by increasing the subgroup size, or an appropriate data transformation may be used to make the data normal. This paper presents a generalized method for analyzing data, which may or may not be independent. Thus the standard methodology becomes a subset of the new methodology.

The Durbin-Watson d statistic will be used to determine whether the data are correlated or not. If correlation exists the proposed methodology should be used. This paper shows the procedure to be followed when the process under consideration can be described by an ARMA (p,q) model. All the special cases will be derived from this model by the appropriate choice of the Autoregressive and Moving Average parameters. Functional examples shown are for the ARMA (1,1), and the ARMA (2,0) = AR(2) models, and a numerical illustration of the proposed methodology is also presented, following the discussion of the problem and the introduction of the proposed methodology.

II. Discussion

Reference (8) presented modified control limits for maintaining control of a process, when the data were assumed to be dependent and following a second order autoregressive process (AR(2) model). The present article extends the results to allow data dependence via a generalized ARMA (p,q) model. The control limits on the mean of the process depend on the variance of the sample mean and the control limits on the variability of the process depend on E(s) and Std(s), where E(s).xs the expected value and Std(s) is the standard deviation of the sample standard deviation, respectively. It is shown in references (1) and (2) that the variance [2.sigma over bar.x]of the sample mean [bar.x] of the underlying process [x.sub.1] = [beta]+[xi.sub.t] is given in terms of [gamma.sub.k] by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where: [gamma.sub.k] = auto covariance of the process under consideration at lag k. We have shown (9) that to obtain an expression of E(s), for any n, a power expansion representation of [sigma] may be used, namely:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

More terms can be used in the power expansion to obtain higher accuracy. However, it was shown in (9) that excellent results are obtained using the terms of the expansion shown.

Taking expectations of equation (2), we obtain:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

For the process model [x.sub.1] = [beta]+[xi.sub.t] we can easily show that:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

as is shown in reference (1).

The other term needed to completely evaluate equation (3) is V(s.sup.2). But

V (s.sup.2) = E(s.sup.4) - (E.sup.2)(s.sup.2),

where (E.sup.2)(s.sup.2) = [E(s.sup.2).sup.2].

To obtain an expression for E(s.sup.4] we start with

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and, after taking expectations, simplifying the resulting expression and substituting E(s.sup.4] and [E.sup.2](s.sup.2] into V(s.sup.2) we obtain:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)

It is obvious from equations (1), (4), and (5) that the variance of the sample mean, [2.sigma over bar.x] and the expected value of the sample standard deviation, E(s), are both functions of the auto covariance of the underlying process. It follows that if a process is given and its auto covariance can be found then, at least theoretically, [2.sigma over bar.x] , E(s) and Std(s) can be found and, therefore, the control limits on [bar.x] and s can also be computed.

Whether or not closed forms can be found for the single, double, and triple summations of the above equations depends on the particular process chosen. If the auto covariance of the process is complicated and no closed form can be found, digital computers can be used to evaluate these summations to any desired accuracy. What is very important, however, is the fact that the equations above hold true for a more general time series, ARMA (p,q), other than the AR (2) . ARMA (2,0) for which they were originally derived.

II. a Generalization of Problem

Let us define the process whose control will be investigated by

[x.sub.t] = [beta]+[xi]t (6)

where [beta] is a constant, and [xi] is a wide sense stationary time series with zero mean and standard deviation. A stochastic process is said to be stationary if, in addition to having a mean and variance that do not change with time, the correlation coefficient between values of the process at two time points depends only on the distance between these two points and not on the time itself (7). The process will be considered under control if the estimate of the mean and the estimate of the standard deviation remain within prescribed control limits. This can be accomplished by maintaining an [bar.x] chart and an s chart. The control limits for the [bar.x] chart are functions of the variance of [bar.x], [2.sigma over bar.x], given by equation (1), while the control limits on the s chart are functions of E(s), given by equation (3), and Std(s), where:

Std(s) =[[E(s.sup.2) - (E.sup.2)(s)].sup.1/2] (7)

Once the nature of [xi] has been determined through identification techniques, the covariance of the process can be determined. Then equations (1), (3) and (4) can be used to obtain the control limits for the process [x.sub.t].

Identification methods are procedures applied to a set of data to identify the representational model that can be considered as generating the given data set. An ARMA (p,q) model is a subset of a more general class of models, called ARIMA models and defined by the three parameters (p,d,q) where: p is the number of autoregressive terms, q is the number of moving average terms, and d is the number of times the given time series [x.sub.t] must be differenced to produce stationarity. Since we are assuming that[xi.sub.t] is a stationary time series, d=0 and ARIMA (p,0,q) = ARMA (p,q). Once a stationary time series has been obtained, the problem becomes one of finding the appropriate values for p and q.

The sample autocorrelation function (ACF) of lag k, (r.sub.k), and sample partial autocorrelation function (PACF) of lag k, (r.sub.kk), are of vital importance in this identification procedure. For a stationary process the correlation coefficient between [x.sub.t] and [x.sub.t+k] is called the theoretical autocorrelation function of lag k, and is given by (3):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)

An estimate of [p.sub.k] is given by the sample autocorrelation function (ACF) of lag k defined

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

The theoretical partial autocorrelation function [p.sub.kk] may be thought of as the autocorrelation between any two variables [x.sub.t] and [x.sub.t+k] separated by a lag of k units, with the effects of the intervening variables [x.sub.t+1], [x.sub.t+2], ... , [x.sub.t+k-1] eliminated.

The [p.sub.kk] satisfies the set of equations (3):

[p.sub.j] = [p.sub.j-1] + ... + [p.sub.k(k-1)[p.sub.j-k+1 + [p.sub.j-k] j = 1,2,...,k (10)

leading to the Yule-Walker equations which

may be written as:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (11)

The sample partial auto correlation function [r.sub.kk] is an estimate of [p.sub.kk], satisfies the above equations and can be calculated from the observed time series values. In fact the [r.sub.kk] can be derived from [r.sub.k] as follows:

Solving the equations above for k=1,2,3..., successively we obtain the first three sample partial auto correlations: [r.sub.11] = [r.sub.1]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (12)

The standard deviations of [r.sub.k] and [r.sub.kk] are given in [3]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (13)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (14)

Having discussed briefly the sample ACF and PACE functions, let us now describe, also briefly, how they will be used in the identification procedure.

For an Autoregressive process of order p(AR(p)), the ACF tails off while the PACE has a cutoff after lag p.

For a Moving Average process of order q (MA(q)), the ACF has a cutoff after lag q, while its PACE tails off.

For a mixed process (i.e. one having both AR and MA terms) both the ACF and PACE tail off.

Therefore, to find the number of Autoregressive terms (p) and Moving Average terms (q) to be included in the model, we have to calculate the sample ACF and PACE from the observed time series and also [sigma] [r.sub.k] and [r.sub.kk]. Since sample ACFs are only estimates of theoretical ACEs, the sample r.sub.k which are significantly different from 0 lie outside the limits [+ or -]2 [sigma][r.sub.k] (7). Similarly, since sample PACFs are only estimates of theoretical PACFs, the sample [r.sub.kk] which are significantly different from 0 lie outside the limits [+ or -]2 [sigma] [r.sub.kk].

II. b The General Linear Process

When using a general linear stochastic model to describe a time series it is desirable to employ models which use parameters parsimoniously. Parsimony is usually achieved by representing the time series in terms of a small number of autoregressive and moving average terms, resulting in an Autoregressive Moving Average (ARMA) model.

Box and Jenkins (3) give several functional forms for describing a linear process. One such form is the "random shock" form given by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (15)

where: [PSI](B) is the transfer function of a Linear Filter, which transforms the While Noise Et into the stochastic process [xi.sub.t].

The defining equation for the general ARMA process is given by:

[PHI](B)[xi.sub.t] = [THETA](B)[epsilon.sub.t] (16)

where: [PHI](B) = Autoregressive Operator, assumed stationary (i.e. the roots of [PHI](B) = 0 lie outside the unit circle).

[THETA](B) = Moving Average Operator, assumed invertible (i.e. the roots of [THETA](B) = 0 lie outside the unit circle).

and B is the backward shift operator defined by:

B[x.sub.t]= [x.sub.t-1]; therefore [B.sup.m][x.sub.t] = [x.sub.t]-m

To find an expression for the kv weights, we operate on both sides of the equation (15) with the autoregressive operator [PHI](B) and obtain:

[PHI](B)[xi.sub.t] = [PHI](B)[PSI](B)[xi.sub.t] (17)

Because of equation (16) we have:

[THETA](B) = [PHI](B)[PSI](B) (18)

Therefore, the [psi] weights may be obtained by equating coefficients of B in the expansion

(1-[theta.sub.1]B-[theta.sub.2][B.sup.2]-...-[theta.sub.q][B.sup.q]) =(1-[alpha.sub.1]B-[alpha.sub.2][B.sup.2]-...-[alpha.sub.p][B.sup.p]) (1+[psi.sub.1]B+[psi.sub.2][B.sup.2]+...) (19)

Now that the [psi] weights have been defined in terms of the Autoregressive and Moving Average parameters, they will be used to express the auto covariance function [gamma.sub.k] of a generalized linear ARMA process in terms of the [psi] weights. One such expression, given by Box and Jenkins (3) is the following:

[gamma.sub.k] = E[[xi.sub.t][xi.sub.t-k]] = [2.sigma over E][n.summation over j=0][psi.sub.j][psi.sub.j+k] (20)

The variance of the process is given by [gamma.sub.o], where

[gamma.sub.o] = [n.summation over j=o][2.psi over j][2.sigma over [member of]] (21)

III.Proposed Methodology

To investigate the control of the model [x.sub.1] = [beta]+[xi.sub.t] the following procedure may be used to obtain the pertinent information required to investigate whether or not the process is under control:

1. Use a correlation test (the Durbin and Watson d statistic could be used, for example) to determine the existence of serial correlation. (The definition of d and its uses are briefly explained at the end of the ten-step procedure).

2. If serial correlation does not exist, i.e. [xi.sub.t] = [epsilon.sub.t], disregard the remaining steps of this procedure and use the classical quality control factors to set the desired control limits.

3. If serial correlation exists, i.e. if [xi.sub.t] [not equal to] [epsilon.sub.t], use identification techniques to define the nature of [xi.sub.t].

4. Use estimating procedures to obtain estimates of the autoregressive parameters [alpha.sub.p] and the moving average parameters [theta.sub.q] of the generalized ARMA model.

5. Obtain the [psi] weights by equating coefficients of B in equation (19).

6. Use the [psi] weights and equations (20) and (21) to obtain the auto covariance [gamma.sub.k] of the process.

7. Use [gamma.sub.k] and equation (1) to find the variance [2.sigma over bar.x] of the sample mean.

8. Use [gamma.sub.k] and equations (4) and (5) to find E[s.sup.2] and V[s.sup.2] respectively.

9. Use equation (3) to find the expected value of the sample standard deviation, E(s).

10. Use the equation: Std(s) = [[E[s.sup.2] - [E.sup.2](s)].sup.1/2] to obtain the value for the standard deviation of the sample standard deviation, which is also needed when defining the control limits on the variability of the process. Expressions for E(s) and E[s.sup.2] are given by equations (3) and (4) respectively.

The Durbin-Watson d-statistic, mentioned in step 1 of the proposed methodology, is a convenient way to test for first-order serial correlation, and is given by the expression

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (22)

It can be easily shown that d[approximately equal to]2(1-r), where r is the sample correlation coefficient. This result indicates that, since

-1[less than or equal to]r[less than or equal to]1, 0[less than or equal to]d[less than or equal to]4, with d = 0 when r = +1, d = 4 when r = -1, while d = 2 when r = 0.

The exact distribution of d depends on the sample size (n) and the number of independent variables (k) in the regression analysis. Given n and k and assuming that Er, is normally distributed with mean equal to 0 and variance = [sigma.sup.2], the distribution of d always lies between the distributions of two other statistics [d.sub.L] and [d.sub.u] which can be found in table form in many standard statistics texts [4] for given values of [alpha] ([alpha]= 0.05 for example). If we are testing the hypothesis

[H.sub.o]: p=0 against [H.sub.i]: p[not equal to]0, the decision rules are:

1. Reject [H.sub.o] if d<[d.sub.L] (Indicative of positive serial correlation), or if d>4 [d.sub.L] (Indicative of Negative Serial Correlation)

2. Do not reject [H.sub.o] if [d.sub.u]<d<4-du (Indicative of no serial correlation)

3. Test is inconclusive if [d.sub.L][less than or equal to]d[less than or equal to]du or if 4-dw[less than or equal to]d[less than or equal to]4-[d.sub.L]

The proposed methodology is shown in block diagram form in Figure 1, together with the existing methodology, where it can be seen that the existing methodology is a subset of the proposed methodology.

[FIGURE 1 OMITTED]

Then the control limits on the [bar.x] chart and s chart are given by:

x [+ or -] 3Std ([bar.x]) (23)

E(s) [+ or -] 3Std(s) (24)

IV. EXAMPLES

1) Suppose that the correlation test indicated the presence of serial correlation, and that identification techniques determined [x.sub.t] to be the ARMA (1,1) model. Then the model is given by:

[x.sub.t]- [alpha]1[xi.sub.t-1] = [epsilon.sub.t] - [theta.sub.1][epsilon.sub.t-1] (25)

or

(1-[alpha.sub.1]B)[xi.sub.1] = (1-[theta.sub.1]B)[epsilon.Sub.t] (26)

or

[PHI](B)[xi.sub.t] = [THETA](B)[epsilon.Sub.t] (27)

where: [PHI](B) = [1.sub.-a1]B and [THETA](B) =1-[theta.sub.1]B

The [psi] weights are obtained from the expansion of equation (19) namely:

1-[theta.sub.1]B = (1-[alpha.sub.1]B) (1+[psi.sub.1]B+[psi.sub.2][B.sub.2]+...) (28)

or

(1-[theta.sub.1]B = 1 + ([psi.sub.1]-[alpha.sub.1]B + ([theta.sub.2]-[theta.sub.1][alpha])[b.sup.2] + ([theta.sub.3]-[alpha.sub.1]([theta.sub.2][B.sub.3] + ... Therefore, by equating coefficients of B, we obtain:

[psi]0 = 1

[psi.sub.1]= [alpha.sub.1] - [theta.sub.1]

[psi.sub.2] = [alpha.sub.1][psi.sub.1] = [alpha.sub.1]([alpha.sub.1] - [theta.sub.1])

[psi.sub.3] = [alpha.sub.1][psi.sub.2] = [[alpha.sub.1].sup.2] ([alpha.sub.1] - [theta.sub.1])

[PHI]j = [alpha.sub.1][PHI.sub.j-1] = [[alpha.sub.1].sup.j-1] ([alpha.sub.1] - [theta.sub.1]) (29)

Substituting equation (29) into equation (20) we obtain:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (30)

when k=1, [gamma.sub.1]= [2.sigma over [epsilon]] [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (31)

and for k [greater than or equal to] 2, [gamma.sub.k] =[alpha.sub.1] [gamma.sub.k-1] (32)

Similarly an expression for [gamma.sub.o] can be obtained by substituting equation (29) into equation (21):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (33)

Therefore, the auto covariance function [gamma.sub.k] of the ARMA (1,1) model is given by equations (31) and (32), while the variance of the process, [gamma.sub.o], is given by equation (33). We are now ready to use equations (1) and (3) in order to find the required expressions for [2.sigma over bar.x] , E(s) and Std(s).

2) Suppose [xi.sub.t] had been identified as the AR(2) = ARMA (2,0) model. Then the process is given by:

[PHI](B)[xi.sub.t] = [epsilon.sub.t] (34)

where: [PHI](B) = 1-[alpha.sub.1]B-[alpha.sub.2][B.sub.2] (35)

The [psi] weights are obtained from the expansion

(1-[alpha.sub.1]B-[alpha.sub.2][B.sub.2] (1+[psi]1B + [psi.sub.2][B.sub.2] + ...) = 1

and they are:

[psi.sub.0] = 1

[psi.sub.1] = [alpha.sub.1]

[PHI]j = [alpha.sub.1][PHI.sub.j-1] + [alpha.sub.2][PHI.sub.j-2] j[greater than or equal to]2 (36)

When the tv weights are substituted in equations (20) and (21) they give rise to:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (37)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (38)

[gamma.sub.k] = [alpha.sub.1][gamma.sub.k-1] + [alpha.sub.2][gamma.sub.k-2] for k[greater than or equal to]2 (39)

Depending on the nature of the roots G1 and G2 of the characteristic equation

[PHI](B) = 0 (i.e. 1-[alpha.sub.1]B-[alpha.sub.1][B.sub.2]=o),

we obtain [8] for [2.sigma over bar.x]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (40)

where: [[lambda].sub.1/2] ([alpha.sub.1], [alpha.sub.2], n) (and [c.sub.2]([alpha.sub.1],[alpha.sub.2],n)) are auxiliary factors on which all of the modified quality control factors depend, as explained below.

The process [x.sub.t] = [beta]+[xi.sub.t] will be considered under control if the estimate of the mean and the estimate of the standard deviation of the process remain within prescribed control limits. This is accomplished by maintaining an [bar.xi] chart and an 5 chart. The estimates [bar.xi] and s with subgroups of size n are found from the following formulas:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (41)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (42)

If we let [bar.x] be the grand mean over several subgroups of size n, and [bar.s] be the pooled standard deviation, then the 3.a quality control limits on [bar.x] are (for white noise only):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (43)

or

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (44)

depending on whether the standard deviation is known or estimated. Similarly, for the quality control limits on the standard deviation

[c.sub.2](n)[sigma][+ or -]3[c.sub.3](n)[sigma] = ([B.sub.1](n),[B.sub.2](n))[sigma] (45)

or

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (46)

depending, again, on whether a is known or estimated. The parameters introduced above are defined as follows:

[B.sub.1](n) = max (o, [c.sub.2](n) - 3[c.sub.3](n)) (47)

[B.sub.1](n) = [c.sub.2](n) + 3[c.sub.3](n) (48)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (49)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (50)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (51)

[C.sub.3](n) = (1 - 1/n - [c2.sup.2] (n).sup.1/2) (52)

The quality control factors A(n), [A.sub.1](n), [B.sub.1](n), [B.sub.1](n), [B.sub.1](n) and [B.sub.1](n) are tabulated in the literature [6], when no serial correlation exists, as a function of n only. The modified methodology proposed in this paper is a procedure for modifying these control factors when serial correlation does exist. Reference [8] shows that when serial correlation manifests itself as a pure autoregressive process of order 1 or 2 (i.e.[xi.sub.t] = [alpha][xi.sub.t-1] + [epsilon.sub.t] or [xi.sub.t] = [alpha.sub.1][alpha][xi.sub.t-1] + [alpha.sub.2][xi.sub.t-2] + [epsilon.sub.t] the modified factors A([alpha.sub.1],[alpha.sub.2],n),[A.sub.1]([alpha.sub.1],[alpha.sub.2],n) [B.sub.1]([alpha.sub.1],[alpha.sub.2],n), [B.sub.2]([alpha.sub.1],[alpha.sub.2],n), [B.sub.3]([alpha.sub.1],[alpha.sub.2],n), and [B.sub.4]([alpha.sub.1],[alpha.sub.2],n) are functions of the auxiliary factors [lambda]([alpha.sub.1],[alpha.sub.2],n) and [c.sub.2]([alpha.sub.1],[alpha.sub.2],n) and the corresponding classical quality control factors when no serial correlation exists.

Specifically:

A([alpha.sub.1],[alpha.sub.2],n) = [lambda.sup.1/2]([alpha.sub.1],[alpha.sub.2],n) A(0,0,n) = [lambda.sup.1/2]([alpha.sub.1],[alpha.sub.2],n)A(n) (53)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (54)

[B.sub.1]([alpha.sub.1],[alpha.sub.2],n) = max(0, [c.sub.2]([alpha.sub.1],[alpha.sub.2],n) - 3[c.sub.3]([alpha.sub.1],[alpha.sub.2],n)) (55)

[B.sub.2]([alpha.sub.1],[alpha.sub.2],n) = [c.sub.2]([alpha.sub.1],[alpha.sub.2],n) + 3[c.sub.3]([alpha.sub.1],[alpha.sub.2],n) (56)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (57)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (58)

The auxiliary factors [lambda]([alpha.sub.1],[alpha.sub.2],n) and [C.sub.2]([alpha.sub.1],[alpha.sub.2],n) depend on the auto covariance function [gamma.sub.k] of[xi.sub.t] and are derived in references [8] and [9], while [C.sub.3]([alpha.sub.1],[alpha.sub.2],n) is a function of [lambda]([alpha.sub.1],[alpha.sub.2],n) and [C.sub.2]([alpha.sub.1],[alpha.sub.2],n) and given by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (59)

Note that when [alpha.sub.2]= 0, the AR(2) process becomes AR(1), while when [alpha.sub.1] = [alpha.sub.2]= 0 the modified factors are equal to the classical quality control factors, as they should be.

Then the 3[sigma] control limits on [bar.x] , when the standard deviation is known, are given [8] by:

x [+ or -] A([alpha.sub.1],[alpha.sub.2],n)[sigma] (60)

where:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (61)

while when the standard deviation is not known, the 3[sigma]. control limits on [bar.x] are given by:

x [+ or -] A([alpha.sub.1],[alpha.sub.2],n)[sigma] (62)

If n=5, and we were not aware that serial correlation existed, the control limits would be:

x [+ or -] A(0,0,n)[sigma] = x [+ or -] 1.342 [sigma] (63)

On the other hand if correlation was confirmed and the parameters [alpha.sub.1] and [alpha.sub.2] of the AR(2) process were estimated as [alpha.sub.1] = 1.2 and[alpha.sub.1]= -0.4, then: [lambda.sup.1/2](12,-0.4,5) = 1.885 and the control limits became:

x [+ or -] A([alpha.sub.1],[alpha.sub.2],n)[sigma] = x [+ or -] (1.885)(1.342)[sigma] = x [+ or -] 2.53[sigma] (64)

where the value of [lambda.sup.1/2](1.2,-0.4,5) is obtained from Figure 2, at the intersection of [alpha.sub.1] = 1.2 and [alpha.sub.2] = -0.4. Also from Figure 2, the value of [c.sub.2](1.2,-0.4,5) = 0.462.

Figure 2/n=5

In this case we would erroneously attempt to find assignable causes more often than is warranted. Figure 3 gives [lambda.sup.1/2]([alpha.sub.1],[alpha.sub.2],n) and [C.sub.2]([alpha.sub.1],[alpha.sub.2],n) values when n=7. This paper proposes a methodology which will make it possible to compute the modified quality control factors for any[xi.sub.t]whose auto covariance is given by [gamma.sub.k].

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

V. Numerical Illustration of the Proposed Methodology

In this section we provide a step-by-step illustration of the proposed methodology. The time series used for this illustration is the time series F in (3) and is repeated below for convenience (Table 1).

TABLE 1. Series F Yields From Batch Chemical Process* For this time series ([X.sub.t] = [beta] + [xi.sub.t]), = 51.13 and [sigma.sub.x] = 11.9 [bar.x] X = 47 44 50 62 68 64 80 71 44 38 23 55 56 64 50 71 37 74 43 60 38 74 50 52 39 64 51 58 38 59 55 57 45 59 40 41 50 54 55 57 59 60 36 41 54 48 45 54 53 23 = [x.sub.70] 71 57 48 49 35 50 55 34 57 45 45 35 40 25 57 54 58 59 50 45 * 70 Observations ( Read Downwards)

1. Calculate the Durbin-Watson d statistic

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (65)

For this example:

[70.summation over (t=2)]([epsilon.sub.t] - [epsilon.sub.t-1].sup.2) = 26,329

and

[70.summation over (t=1)][2.epsilon over t] = 9,785.843

Therefore d = 2.69 which indicates that negative serial correlation exists in the time series since r[approximately equal to]1-d/2 [approximately equal to] -0.35 which, for [alpha] = 0.05 and n = 70, is significant [5]. Therefore [xi.sub.t][not equal to][epsilon.sub.t].

2. Since serial correlation exists, the proposed methodology should be used.

3 Use identification techniques to define the nature of [xi.sub.t] The approach followed in this section is similar to the one suggested in [7]. We start by plotting the time series, given by figure 4, which shows that the time series oscillates rapidly, but with no apparent upward or downward trend in the average over time. Dividing the time series into subsets of size n=5 (a different subset size could have been selected and the subset size could have been chosen before the calculation of the Durbin-Watson d-statistic without affecting the procedure), we obtain the following (see Table 2 below) means and standard deviations for the subsets:

[FIGURE 4 OMITTED]

TABLE 2. Mean and Standard Deviation for Subsets of Size n = 5. Subset Mean Unbiased Stand. Deviation 1-5 48.6 19.42 6-10 53.4 9.07 11-15 52.2 14.62 16-20 58.0 13.61 21-25 52.6 5.94 26-30 47.2 13.61 31-35 60.2 10.64 36-40 49.4 8.88 41-45 51.0 4.95 46-50 53.0 9.80 51-55 49.2 9.13 56-60 43.4 8.73 61-65 51.0 13.08 66-70 46.6 15.14 For Table 2, [??] is 51.13, and [bar.[sigma]] is 11.55, or: [??] = 51.13 [bar.[sigma]] = 11.55 = s

Figure 5 results when plotting the standard deviations against the means of Table 2 which indicates that this time series is stationary in the variance and the mean because the means and the standard deviations of the subsets are bounded [7].

[FIGURE 5 OMITTED]

The stationarity or non-stationarity of a stochastic process can also be determined by inspection of the sample autocorrelation function (ACF) (see TABLE 3 below), since the theoretical ACF of a stationary process tends to either die down quickly towards zero with increasing lag k or to cut off after a particular lag k=q (i.e. [P.sub.k] = 0 for k > q). However, since the sample auto correlations are only estimates of the theoretical auto correlations, the sample auto correlations [r.sub.k] are significantly different from 0 (and therefore [P.sub.k][not equal to]0) if [r.sub.k] lies outside the limits

TABLE 3. Estimated auto correlation function of batch data k [r.sub.k] k [r.sub.k] k [r.sub.k] 1 -0.39 6 -0.05 11 0.11 2 0.30 7 0.04 12 -0.07 3 -0.17 8 -0.04 13 0.15 4 0.07 9 -0.01 14 0.04 5 -0.10 10 0.01 15 -0.01

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (66)

A similar test is available for the PACF with [r.sub.kk] being considered significantly different from 0, and therefore [P.sub.kk] 0, if [r.sub.kk] lies outside the limits

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (67)

The sample auto correlation function [r.sub.k] can be calculated from

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (68)

and the first 15 sample auto correlations are given in Table 3:

Of these only [r.sub.1] and [r.sub.2] are significantly different than zero. The theoretical partial auto correlation function (PACE) may be thought of as the auto correlation function between any two variables [x.sub.t] and [x.sub.t+k], separated by a lag of k time units, with the effect of the in-between variables [x.sub.t+1,[x.sub.t+2],...,[x.sub.t+k-1] eliminated [7]

Then [r.sub.kk], the sample partial auto correlation function, is an estimate of [p.sub.kk], calculated from the observed time series values and satisfies the set of equations [7]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (69)

leading to the Yule-Walker equations discussed previously [3]. In particular

[r.sub.11] = r1 (70)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (71)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (72)

where: | |indicates the determinant of the enclosed matrix, with more complicated expressions for [r.sub.44], [r.sub.55], ... The first 15 partial auto correlations are given in Table 4:

TABLE 4. Estimated partial auto correlation function for batch data k [r.sub.kk] k [r.sub.kk] k [r.sub.kk] 1 -0.40 6 -0.15 11 0.18 2 0.19 7 0.05 12 -0.05 3 0.01 8 0.00 13 0.09 4 -0.07 9 -0.10 14 0.18 5 -0.07 10 0.05 15 0.01

Of these only [r.sub.11]is significantly different than zero, and [r.sub.22] is "almost" significantly different from zero (i.e. it is just inside the [+ or -]2[sigma]([r.sub.kk]) = [+ or -](2/[square root of (term)]q70)[approximately equal to][+ or -]0.24 limits).

Since this time series is stationary (because of figure 5 and the fact that the ACF dies away quickly towards zero), the class of ARMA (p,q) = ARIMA (p,0,q) models (i.e. d = degree of differencing = 0) is a suitable class from which to choose a tentative model for the time series.

Examining Table 3 (Estimated Auto Correlation Function), we see that [r.sub.1] and [r.sub.2] are significant and [r.sub.3] is nearly significant. From Table 4 (Estimated Partial Auto Correlation Function), it is obvious that [r.sub.11] is significant and [r.sub.22] is nearly significant, but none of the other [r.sub.kk] are even close to being very significant. Thus the sample PACF resembles a stochastic process for which the theoretical PACF cuts off after lag 2, suggesting an AR(2) model.

Since the sample mean of this process is large compared to the sample standard deviation ( [bar.x]=51.13, [sigma.sup.x] =11.9) a constant term should be included in the model. Therefore, the tentatively identified model for this time series is:

[x.sub.t] = [beta] + [alpha.sub.1][x.sub.t-1] + [alpha.sub.2][x.sub.t-2] + [epsilon.sub.t] (73)

while the standard methodology assumes the model to be [x.sub.t] = [beta] + [epsilon.sub.t].

4. Estimate the parameters of the Identified model

Since the model has been identified as an AR(2) model, initial estimates for the parameters [alpha.sub.1] and [alpha.sub.2]can be obtained by substituting estimates r,for the theoretical auto correlations [p.sub.j] in the formulas [3]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (74)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (75)

which are obtained from the Yule-Walker equations. From Table 3, [r.sub.1] = -0.39 and [r.sub.2] = 0.30.

Therefore the suggested initial estimates for [alpha.sub.1] and [alpha.sub.2] are:

[alpha.sub.1] = -0.32197

[alpha.sub.2] = +0.17443

The SPSS BOX-JENKINS procedure was used to fit the time series. Starting with initial values of

[beta]= overall constant = 58.292

[alpha.sub.1] = -0.32197

[alpha.sub.2] = 0.17443

and initial FUNCTION VALUE = 7900.6, the optimization search converged in 30 iterations, with the following optimal parameters:

[beta]= 58.402 with standard error = 9.9894 & t-ratio = 5.8463

[alpha.sub.1] = -0.34 with standard error = 0.11452 & t- ratio = -2.9689

[alpha.sub.2] = 0.20 with standard error = 0.11355 & t-ratio = 1.7605

Residual Variance = 97.404

(Standard Deviation = 9.8693)

Function Value = 7889.722

Therefore, the estimated model is:

[x.sub.t] = 58.402 - 0.34 [x.sub.t-1] + 0.20 [x.sub.t-2] + [epsilon.sub.t] (76)

To complete the analysis of a general model, steps 5-10 of the proposed methodology must also be completed. These steps, repeated here for convenience, are:

5. Obtain the [PSI] weights

6. Express [gamma.sub.o] and [gamma.sub.k] as functions of the [PSI] weights; i.e., find [gamma.sub.o] = [f.sub.1]([PSI]) and [gamma.sub.k] = [f.sub.2] ([PSI])

7. Express the variance of the sample mean, [2.sigma over bar.x] as a function of [gamma.sub.o] and [gamma.sub.k]; i.e., find V ( [bar.x] ) = [2.sigma over bar.x] = [f.sub.3]([gamma.sub.o],[gamma.sub.k];) and write the control limits for the mean of the process ( [bar.x] chart) as:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

8. Express E([s.sup.2]) and V([s.sup.2]) as functions of [gamma.sub.o] and )[gamma.sub.k]; i.e. find E([s.sup.2]) = [f.sup.4]([gamma.sub.o],[gamma.sub.k]) and V([s.sup.2]) = [f.sup.5]([gamma.sub.o],[gamma.sub.k]) in accordance with equations (4) and (5) respectively.

9. Express E(s) as a function of E([s.sup.2]) and V([s.sup.2]), i.e. find E(s) = [f.sup.6][E([s.sup.2]), V([s.sup.2])], using equation (3).

10. Find Std(s) from the equation Std(s) = [E([s.sup.2]) - [E(s).sup.2].sup.1/2] and write the control limits for the variability of the process (s chart) as: E(s) [+ or -] 3 Std(s)

The calculations in steps 5-10, outlined above, are equivalent to the calculation of the modified parameters

A([alpha.sub.1],[alpha.sub.2],n), [A.sub.1]([alpha.sub.1],[alpha.sub.2],n)

[C.sub.2]([alpha.sub.1],[alpha.sub.2],n), [C.sub.3]([alpha.sub.1],[alpha.sub.2],n)

[B.sub.1]([alpha.sub.1],[alpha.sub.2],n), [B.sub.2]([alpha.sub.1],[alpha.sub.2],n)

[B.sub.3]([alpha.sub.1],[alpha.sub.2],n), [B.sub.4]([alpha.sub.1],[alpha.sub.2],n)

and they can be found without too much difficulty with the help of a computer. However, for the AR(1) and AR(2) models it is not necessary to perform steps 5-10 because this work has already been done in [9], and we can immediately write:

1) For the [bar.x] -chart

1) x [+ or -] A([alpha.sub.1],[alpha.sub.2],n)[sigma] if [sigma] is known (77)

2) x [+ or -] A([alpha.sub.1],[alpha.sub.2],n)[sigma] if [sigma] is not known (78)

II) For the s-chart

1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] if [sigma] is known (79)

2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] if [sigma] is not known (80)

For our estimated model, given by equation (76), the modified and standard quality control factors are given by:

A([alpha.sub.1],[alpha.sub.2],n) = A(-0.34,0.2,5) = 1.065 versus

A(0,0,5) = 1.342

[A.sub.1]([alpha.sub.1],[alpha.sub.2],n) = [A.sub.1]+(0.34,0.2,5) = 1.254 versus

[A.sub.1](0,0,5) = 1.596

[c.sub.2]([alpha.sub.1],[alpha.sub.2],n) = [c.sub.2](-0.34,0.2,5) = 0.849 versus

[c.sub.2](0,0,5) = 0.841

[c.sub.3]([alpha.sub.1],[alpha.sub.2],n) = [c.sub.3](-0.34,0.2,5) = 0.389 versus

[c.sub.3](0,0,5) = 0.305

[B.sub.1]([alpha.sub.1],[alpha.sub.2],n) = [B.sub.1](-0.34,0.2,5) = 0 versus [B.sub.1](0,0,5) = 0

[B.sub.2]([alpha.sub.1],[alpha.sub.2],n) = [B.sub.2](-0.34,0.2,5) = 2.016 versus

[B.sub.2](0,0,5) = 1.756

[B.sub.3]([alpha.sub.1],[alpha.sub.2],n) = [B.sub.2](-0.34,0.2,5) = 0 versus [B.sub.3](0,0,5) = 0

[B.sub.4]([alpha.sub.1],[alpha.sub.2],n) = [B.sub.4](-0.34,0.2,5) = 2.374 versus

[B.sub.4](0,0,5) = 2.089

If we were not aware that serial correlation existed, the control limits would be:

I) When [sigma] is known (standard known)

1) For the [bar.x] -chart

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (81)

2) For the s-chart

[[B.sub.1](0,0,5),[B.sub.2](0,0,5)[sigma](5 = (0,1.756)[sigma] (82)

II) When [sigma] is not known (standard unknown)

1) For the [bar.x] -chart

x [+ or -] [A.sub.1](0,0,5)[bar.[sigma]] = x [+ or -] 1.596[bar.[sigma]] * (83)

2) For the s-chart

[[B.sub.3](0,0,5),[B.sub.4](0,0,5)] [bar.[sigma]] = (0,2.089)[bar.[sigma]] (84)

Since, however, serial correlation exists and [alpha.sub.1] = -0.34 and [alpha.sub.2] = 0.20, the control limits are:

I) When [sigma] is known (standard known)

1) For the [bar.x] -chart

x [+ or -] A(-0.34,0.2,5)[simga] = x [+ or -] 1.065[sigma] (85)

2) For the s-chart

[[B.sub.1](-0.34,0.2,5),[B.sub.2](-0.34,0.2,5)][sigma] = (0,2.016)[sigma] (86)

II) When [sigma] is not known (standard unknown)

1) For the [bar.x] -chart

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (87)

2) For the s-chart

[[B.sub.3](-0.34,0.2,5), [B.sub.4](-0.34,0.2,5)] [bar.[sigma]] = (0,2.374) [bar.[sigma]] (88)

Therefore, this particular combination of [alpha.sub.1] and [alpha.sub.2] values (i.e. [alpha.sub.1] = -0.34 and [alpha.sub.2]= 0.20), narrowed the control limits for [bar.x] and widened them for s. Ignoring the correlation would have been detrimental in either case. The narrowing of the [bar.x] control limits implies an erroneous attempt to find "assignable causes" more often than is warranted. On the other hand, the widening of the s control limits would result in not searching often enough for "assignable causes" when we should.

*where x =51.13 and [bar.[sigma]] =11.55 , as can be seen from Table 2.

Summary and Conclusions

In Quality Control Applications, the existing methodology for testing to see if a given process is under control, tests to see if the estimate of the mean and the estimate of the standard deviation remain within prescribed control limits. This can be accomplished by maintaining an [bar.x] chart and an s chart. The control limits for the [bar.x] chart are functions of the variance of [bar.x], while the control limits on the s chart are functions of the Expected Value of s, E(s), and the Standard Deviation of s, Std(s). The basic assumption of the existing methodology is that the data to be analyzed is Independent and Normally Distributed. However, when this assumption is violated, continuing to use the existing methodology produces large errors, which can be very costly.

It is possible to compensate for Nonnormality of the data by appropriate data transformations, or by increasing the subgroup size. But this is not possible, if the data is dependent. What is required is a new methodology, like the one described in this paper, which makes the existing methodology a subset of the new methodology.

The new methodology starts by using a Test Statistic to determine the existence of data dependence. When data dependence exists, Identification techniques are used to identify the nature of the process, and then Estimating procedures are used to obtain estimates of the new parameter values. These parameter values are used to calculate the Variance and Co-variance functions, which determine the control limits of the modified control charts.

The example included in the paper explains in detail the steps required to implement the proposed methodology. Since data dependence is present, it is necessary to use the modified methodology. If we ignored data dependence, the control limits on [bar.x] and s would be: (x [+ or -]1.596[bar.[sigma]] )and (0,2.089[bar.[sigma]]), while if we take into consideration the existence of

data dependence, the control limits become: (x [+ or -] 1.254[bar.[sigma]]) and (0,2.374[bar.[sigma]]).

Obviously, the presence of data dependence narrowed the control limits for [bar.x] and widened them for s. The narrowing of the [bar.x] control limits implies an erroneous attempt to find "assignable causes" more often than is warranted. On the other hand the widening of the s control limits would result in not searching often enough for "assignable causes" when we should.

Ignoring the data dependence would have been detrimental in either case.

References

Anderson, T.W., "The Statistical Analysis of Time Series." Wiley, New York, 1971.

Bartlett, M.S., "On the Theoretical Specification and Sampling Properties of Auto Correlated Time Series."Journal of the Royal Statistical Society B, Vol. 8, No. 27, 1946.

Box, G.E.P., and Jenkins, G.M., "Time Series Analysis: Forecasting and Control." Holdenday, San Francisco, 1976, Revised Edition.

Chou, Y. L., "Statistical Analysis for Business and Economics." Elsevier Science Publishing Co., Inc., New York, 1989.

Hoel, PG., and Jensen, RJ., "Basic Statistics for Business and Economics." John Wiley and Sons, New York, 1982, Third Edition.

Grant, E.L., and Leavenworth, R.S., "Statistical Quality Control." McGraw Hill, New York, 1988.

O'Donovan, TM., "Short Term Forecasting", John Wiley and Sons. 1983.

Vasilopoulos, A.V., and Stamboulis, A.P., "Modification of Control Chart Limits in the Presence of Data Correlation," Journal of Quality Technology. Vol. 10, No. 1, January 1978.

Vasilopoulos, A.V., "Second Order Autoregressive Model Applied to Quality Control." Ph.D. dissertation, New York University, 1974.

Additional References

A. Ian McLeod, Ying Zhang. Improved Subset Autoregression: With R Package. Journal of Statistical Software, Vol. 28, Issue 2, October 2008.

P. Peterson and D. Padua, "Static and Dynamic Evaluation of Dependence Analysis Techniques." IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 11, November 1996.

R. A. van Engelen, J Birtch, Y. Shou, B. Walsh and K. A. Gallivan, "A Unified Framework for Nonlinear Dependence Testing and Symbolic Analysis." Proceedings of the ACM International Conference on Supercomputing, Saint-Malo, France, June 2004.

S. Tsay, Ruey. Analysis of Financial Time Series, 2nd Edition, Wiley, 2005.

A. Vasilopoulos, The Peter J. Tobin College of Business, St. John's University, New York

Printer friendly Cite/link Email Feedback | |

Author: | Tobin, Peter J. |
---|---|

Publication: | Review of Business |

Date: | Dec 22, 2013 |

Words: | 7821 |

Previous Article: | The penetration of international hotel chains in Italy: evidences from an updated census. |

Next Article: | Knowledge creation and transfer processes in international scientific networks. |

Topics: |