# Dealing with collinearity: during creation of regression models, collinearity can occur, which causes invalid results for individual regressors.

William James once said: "We must be careful not to confuse data with the abstractions we use to analyze them." Collinearity is a problem that occurs during the creation of regression models. It is the presence of intercorrelation among predictor variables. In other words, it occurs when a regressor is a linear combination of one or more other regressors. Although a model with collinearity still may have good predictive ability, the results about an individual regressor may not be valid. I dealt with model building in a previous article ("How to Select a Useful Model," Scientific Computing, February 2012). This article is a follow up. I will divide the topic into indicators of collinearity, diagnostic tests for collinearity, and correction for collinearity.

Indicators of collinearity are: parameter tests are insignificant for theoretically important parameters, parameter tests are insignificant whereas the whole model test is significant, large standard errors for regression coefficients, extreme variability in parameters across samples, large changes in parameters when changing data or either adding or removing other variables, unexpected signs for parameters, and decreases in regression standard errors when removing a variable. These can be determined from output tables from standard analysis, such as from SAS Proc Reg (Figure 1).

A visual indication can be found using Leverage Plots with JMP Fit Model platform. Collinearity is seen as shrinkage of points in the X direction (Figure 2). X3 is involved in collinearity, since values shrink toward the center of the plot. X4 is not involved in collinearity, since values are dispersed along X axis and shown for comparison.

Diagnostic tests for collinearity are: VIF (variance inflation factor), correlation matrix of variables, eigenvalues and eigenvectors, condition indices, variance proportions. VIF is calculated as 1/(1 - [[R.sup.2].sub.i]), where [[R.sup.2].sub.i] is the coefficient of determination of the regression for the ith input variable on all other input variables. It demonstrates how much collinearity increases coefficient estimate instability. It is available as an option under SAS Proc Reg (Figure 3) or in JMP. Using R2 = 0.5411 from Figure 1, we can calculate a VIF = 2.1791. Variables with VIF greater than this value (i.e. X2 and X3) are more closely associated with other X (independent) variables than the Y (dependent) variable.

The correlation matrix of variables also is available as an option under SAS Proc Reg (Figure 4) or in JMP. High correlations between variables (e.g. -0.8749 for X2 and X3; -0.8806 for X6 and X7) are indicators of collinearity.

Principal component analysis to produce eigenvalues near zero is an indication of collinearity. This can be generated using JMP Principal Components Analysis on Correlations (see Figure 5, eigenvalue = 0.0810 for Principal Component 7). Eigenvectors may show which variables are involved if there are large "loads" (values) for several variables for principal components of low eigenvalue (Figure 5, Principal Component 7 has large loads on X6 and XT).

Condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. When condition indices are greater than 10, this is an indication that regression estimates may be affected. This can be generated using SAS Proc Reg collin option to include the intercept or collinoint option to adjust the intercept out first (Figure 6 using collinoint option). None of the condition indices indicate a collinearity problem.

Variance proportions are the proportion of the variance of the estimates accounted for by each principal component. If there is a principal component with a high condition index that contributes significantly to the variance of at least two variables, this is an indication of collinearity (Figure 6). None of the principal components indicate a collinearity problem.

Correction of collinearity is more difficult than diagnosis. Methods for dealing with collinearity should begin with increasing sampling size, since this should decrease standard error. If this is not feasible, removal of intercorrelated variables can be approached using some of the methods I discussed in "How to Select a Useful Model," Scientific Computing, February 2012, such as stepwise regression using SAS Proc Reg or JMP Fit Model. Ensure that interaction terms use centering, i.e. transformation by subtracting by the mean. Redefine the variables by using an alternative form, such as a percentage or per capita. If these more straightforward approaches don't work, then more elaborate approaches may be needed, such as removing the variance in one of the intercorrelated variables by regressing them on that variable or analyzing the common variance as a separate variable.

The variables also can be transformed to principal components, where those with small eigenvalues can be eliminated, but the larger question is whether these principal components are interpretable. Ridge regression also can be considered where other options don't work. It introduces a small bias in exchange for a reduction in sampling variances.

During the creation of regression models, collinearity can occur, which causes invalid results for individual regressors, although the overall model still can have good predictability. There are several indicators of collinearity, but diagnostic tests, such as VIF, should be performed. Once identified, a strategy of dealing with collinearity should proceed from the simplest (sample size increase) to the more complex if needed (ridge regression).

Mark Anawis is a Principal R&D Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.
```

Figure 1

The SAS System            09:35 Friday, January 4, 2013 109

The REG Procedure
Model: MODEL1
Dependent Variable: y1

Number of observations Used      30

Analysis of Variance

Sum of       Mean
Source      DF     Squares      Square    F Value    Pr > F

Model        7     559.90107    79.98587   3.71      0.0085
Error       22     474.84506    21.58387
Corrected   29    1034.74613
Total

Root MSE           4.64514      R-Square   0.5411
Dependent Mean    47.02440      Adj R-Sq   0.3951
Coeff Var          9.87965

Parameter Estimates

Parameter    Standard
Variable    DF     Estimate     Error     t Value   Pr > |t|

Intercept   1     83.70755     18.87230    4.44     0.8002
x1          1     -0.10111      0.15537   -0.65     0.5219
x2          1     -0.19813      0.26369   -0.75     0.4604
x3          1      8.14219     11.07504    0.74     0.4700
x4          1     -1.86542      0.58301   -3.20     0.0041
x5          1     -0.08212      0.13194   -0.62     0.5401
x6          1     -0.35242      0.18633   -1.89     0.0718
x7          1      0.31406      0.21568    1.46     0.1595

Figure 3
Parameter Estimates

Parameter   Standard
Variable     DF    Estimate     Error

Intercept      1    83.70755   18.87230
x1             1    -0.10111    0.15537
x2             1    -0.19813    0.26369
x3             1     8.14219   11.07504
x4             1    -1.86542    0.58301
x5             1    -0.08212    0.13194
x6             1    -0.35242    0.18633
x1             1     0.31406    0.21568

Parameter Estimates

Variance
Variable     t Value   Pr > |t|   Inflation

Intercept       4.44     0.0002           0
x1             -0.65     0.5219     1.30029
x2             -0.75     0.4604     5.60944
x3              0.74     0.4700     4.65465
x4             -3.20     0.0041     1.69078
x5             -0.62     0.5401     1.44474
x6             -1.89     0.0718     6.00304
x1              1.46     0.1595     6.19107

Figure 4

Correlation of Estimates

Variable     Intercept         x1          x2          x3

Intercept      1.0000     -0.4480      0.0842     -0.2567
x1            -0.4480       1.000      0.1071     -0.0234
x2             0.0842      0.1071      1.0000     -0.8749
x3            -0.2567     -0.0234     -0.8749      1.0000
x4             0.0634      0.1326     -0.2753      0.1931
x5            -0.1758     -0.0220      0.1527     -0.1080
x6            -0.1211      0.4341      0.0182      0.0355
x7            -0.2129     -0.4006     -0.1891      0.0917

Correlation of Estimates

Variable           x4          x5          x6          x7

Intercept      0.0634     -0.1758     -0.1211     -0.2129
x1             0.1326     -0.0220      0.4341     -0.4006
x2            -0.2753      0.1527      0.0182     -0.1891
x3             0.1931     -0.1080      0.0355      0.0917
x4             1.0000     -0.4410      0.3594     -0.3877
x5            -0.4410      1.0000     -0.3744      0.2620
x6             0.3594     -0.3744      1.0000     -0.8806
x7            -0.3877      0.2620     -0.8806      1.0000

Figure 6

Condition
Number   Eigenvalue       Index        X1           X2

1      2.73705      1.00000   0.01007      0.01309
2      1.53387      1.33512   0.00307      0.02982
3      1.00628      1.64921   0.34351   0.00001201
4      0.95770      1.69055   0.33275   0.00001849
5      0.58180      2.16897   0.09436   0.00088996
6      0.10228      5.17311   0.00128      0.87891
7      0.08102      5.81227   0.21496      0.07726

Number          X3        X4        X5        X6        X7

1     0.010211   0.02835   0.01630   0.01256   0.01364
2      0.05043   0.00244   0.07124   0.01557   0.00735
3   0.00066453   0.03110   0.14605   0.01267   0.03300
4   0.00001584   0.21339   0.09728   0.01684   0.00143
5      0.02864   0.42410   0.48392   0.00353   0.00330
6      0.86928   0.15822   0.07451   0.06273   0.00768
7      0.04077   0.14241   0.11069   0.87610   0.93361
```