# Dealing with collinearity: during creation of regression models, collinearity can occur, which causes invalid results for individual regressors.

William James once said: "We must be careful not to confuse
data with the abstractions we use to analyze them." Collinearity is
a problem that occurs during the creation of regression models. It is
the presence of intercorrelation among predictor variables. In other
words, it occurs when a regressor is a linear combination of one or more
other regressors. Although a model with collinearity still may have good
predictive ability, the results about an individual regressor may not be
valid. I dealt with model building in a previous article ("How to
Select a Useful Model," Scientific Computing, February 2012). This
article is a follow up. I will divide the topic into indicators of
collinearity, diagnostic tests for collinearity, and correction for
collinearity.

Indicators of collinearity are: parameter tests are insignificant for theoretically important parameters, parameter tests are insignificant whereas the whole model test is significant, large standard errors for regression coefficients, extreme variability in parameters across samples, large changes in parameters when changing data or either adding or removing other variables, unexpected signs for parameters, and decreases in regression standard errors when removing a variable. These can be determined from output tables from standard analysis, such as from SAS Proc Reg (Figure 1).

A visual indication can be found using Leverage Plots with JMP Fit Model platform. Collinearity is seen as shrinkage of points in the X direction (Figure 2). X3 is involved in collinearity, since values shrink toward the center of the plot. X4 is not involved in collinearity, since values are dispersed along X axis and shown for comparison.

Diagnostic tests for collinearity are: VIF (variance inflation factor), correlation matrix of variables, eigenvalues and eigenvectors, condition indices, variance proportions. VIF is calculated as 1/(1 - [[R.sup.2].sub.i]), where [[R.sup.2].sub.i] is the coefficient of determination of the regression for the ith input variable on all other input variables. It demonstrates how much collinearity increases coefficient estimate instability. It is available as an option under SAS Proc Reg (Figure 3) or in JMP. Using R2 = 0.5411 from Figure 1, we can calculate a VIF = 2.1791. Variables with VIF greater than this value (i.e. X2 and X3) are more closely associated with other X (independent) variables than the Y (dependent) variable.

The correlation matrix of variables also is available as an option under SAS Proc Reg (Figure 4) or in JMP. High correlations between variables (e.g. -0.8749 for X2 and X3; -0.8806 for X6 and X7) are indicators of collinearity.

Principal component analysis to produce eigenvalues near zero is an indication of collinearity. This can be generated using JMP Principal Components Analysis on Correlations (see Figure 5, eigenvalue = 0.0810 for Principal Component 7). Eigenvectors may show which variables are involved if there are large "loads" (values) for several variables for principal components of low eigenvalue (Figure 5, Principal Component 7 has large loads on X6 and XT).

Condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. When condition indices are greater than 10, this is an indication that regression estimates may be affected. This can be generated using SAS Proc Reg collin option to include the intercept or collinoint option to adjust the intercept out first (Figure 6 using collinoint option). None of the condition indices indicate a collinearity problem.

Variance proportions are the proportion of the variance of the estimates accounted for by each principal component. If there is a principal component with a high condition index that contributes significantly to the variance of at least two variables, this is an indication of collinearity (Figure 6). None of the principal components indicate a collinearity problem.

Correction of collinearity is more difficult than diagnosis. Methods for dealing with collinearity should begin with increasing sampling size, since this should decrease standard error. If this is not feasible, removal of intercorrelated variables can be approached using some of the methods I discussed in "How to Select a Useful Model," Scientific Computing, February 2012, such as stepwise regression using SAS Proc Reg or JMP Fit Model. Ensure that interaction terms use centering, i.e. transformation by subtracting by the mean. Redefine the variables by using an alternative form, such as a percentage or per capita. If these more straightforward approaches don't work, then more elaborate approaches may be needed, such as removing the variance in one of the intercorrelated variables by regressing them on that variable or analyzing the common variance as a separate variable.

The variables also can be transformed to principal components, where those with small eigenvalues can be eliminated, but the larger question is whether these principal components are interpretable. Ridge regression also can be considered where other options don't work. It introduces a small bias in exchange for a reduction in sampling variances.

During the creation of regression models, collinearity can occur, which causes invalid results for individual regressors, although the overall model still can have good predictability. There are several indicators of collinearity, but diagnostic tests, such as VIF, should be performed. Once identified, a strategy of dealing with collinearity should proceed from the simplest (sample size increase) to the more complex if needed (ridge regression).

Mark Anawis is a Principal R&D Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.

Indicators of collinearity are: parameter tests are insignificant for theoretically important parameters, parameter tests are insignificant whereas the whole model test is significant, large standard errors for regression coefficients, extreme variability in parameters across samples, large changes in parameters when changing data or either adding or removing other variables, unexpected signs for parameters, and decreases in regression standard errors when removing a variable. These can be determined from output tables from standard analysis, such as from SAS Proc Reg (Figure 1).

A visual indication can be found using Leverage Plots with JMP Fit Model platform. Collinearity is seen as shrinkage of points in the X direction (Figure 2). X3 is involved in collinearity, since values shrink toward the center of the plot. X4 is not involved in collinearity, since values are dispersed along X axis and shown for comparison.

Diagnostic tests for collinearity are: VIF (variance inflation factor), correlation matrix of variables, eigenvalues and eigenvectors, condition indices, variance proportions. VIF is calculated as 1/(1 - [[R.sup.2].sub.i]), where [[R.sup.2].sub.i] is the coefficient of determination of the regression for the ith input variable on all other input variables. It demonstrates how much collinearity increases coefficient estimate instability. It is available as an option under SAS Proc Reg (Figure 3) or in JMP. Using R2 = 0.5411 from Figure 1, we can calculate a VIF = 2.1791. Variables with VIF greater than this value (i.e. X2 and X3) are more closely associated with other X (independent) variables than the Y (dependent) variable.

The correlation matrix of variables also is available as an option under SAS Proc Reg (Figure 4) or in JMP. High correlations between variables (e.g. -0.8749 for X2 and X3; -0.8806 for X6 and X7) are indicators of collinearity.

Principal component analysis to produce eigenvalues near zero is an indication of collinearity. This can be generated using JMP Principal Components Analysis on Correlations (see Figure 5, eigenvalue = 0.0810 for Principal Component 7). Eigenvectors may show which variables are involved if there are large "loads" (values) for several variables for principal components of low eigenvalue (Figure 5, Principal Component 7 has large loads on X6 and XT).

Condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. When condition indices are greater than 10, this is an indication that regression estimates may be affected. This can be generated using SAS Proc Reg collin option to include the intercept or collinoint option to adjust the intercept out first (Figure 6 using collinoint option). None of the condition indices indicate a collinearity problem.

Variance proportions are the proportion of the variance of the estimates accounted for by each principal component. If there is a principal component with a high condition index that contributes significantly to the variance of at least two variables, this is an indication of collinearity (Figure 6). None of the principal components indicate a collinearity problem.

Correction of collinearity is more difficult than diagnosis. Methods for dealing with collinearity should begin with increasing sampling size, since this should decrease standard error. If this is not feasible, removal of intercorrelated variables can be approached using some of the methods I discussed in "How to Select a Useful Model," Scientific Computing, February 2012, such as stepwise regression using SAS Proc Reg or JMP Fit Model. Ensure that interaction terms use centering, i.e. transformation by subtracting by the mean. Redefine the variables by using an alternative form, such as a percentage or per capita. If these more straightforward approaches don't work, then more elaborate approaches may be needed, such as removing the variance in one of the intercorrelated variables by regressing them on that variable or analyzing the common variance as a separate variable.

The variables also can be transformed to principal components, where those with small eigenvalues can be eliminated, but the larger question is whether these principal components are interpretable. Ridge regression also can be considered where other options don't work. It introduces a small bias in exchange for a reduction in sampling variances.

During the creation of regression models, collinearity can occur, which causes invalid results for individual regressors, although the overall model still can have good predictability. There are several indicators of collinearity, but diagnostic tests, such as VIF, should be performed. Once identified, a strategy of dealing with collinearity should proceed from the simplest (sample size increase) to the more complex if needed (ridge regression).

Mark Anawis is a Principal R&D Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.

Figure 1 The SAS System 09:35 Friday, January 4, 2013 109 The REG Procedure Model: MODEL1 Dependent Variable: y1 Number of observations Read 30 Number of observations Used 30 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 559.90107 79.98587 3.71 0.0085 Error 22 474.84506 21.58387 Corrected 29 1034.74613 Total Root MSE 4.64514 R-Square 0.5411 Dependent Mean 47.02440 Adj R-Sq 0.3951 Coeff Var 9.87965 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 83.70755 18.87230 4.44 0.8002 x1 1 -0.10111 0.15537 -0.65 0.5219 x2 1 -0.19813 0.26369 -0.75 0.4604 x3 1 8.14219 11.07504 0.74 0.4700 x4 1 -1.86542 0.58301 -3.20 0.0041 x5 1 -0.08212 0.13194 -0.62 0.5401 x6 1 -0.35242 0.18633 -1.89 0.0718 x7 1 0.31406 0.21568 1.46 0.1595 Figure 3 Parameter Estimates Parameter Standard Variable DF Estimate Error Intercept 1 83.70755 18.87230 x1 1 -0.10111 0.15537 x2 1 -0.19813 0.26369 x3 1 8.14219 11.07504 x4 1 -1.86542 0.58301 x5 1 -0.08212 0.13194 x6 1 -0.35242 0.18633 x1 1 0.31406 0.21568 Parameter Estimates Variance Variable t Value Pr > |t| Inflation Intercept 4.44 0.0002 0 x1 -0.65 0.5219 1.30029 x2 -0.75 0.4604 5.60944 x3 0.74 0.4700 4.65465 x4 -3.20 0.0041 1.69078 x5 -0.62 0.5401 1.44474 x6 -1.89 0.0718 6.00304 x1 1.46 0.1595 6.19107 Figure 4 Correlation of Estimates Variable Intercept x1 x2 x3 Intercept 1.0000 -0.4480 0.0842 -0.2567 x1 -0.4480 1.000 0.1071 -0.0234 x2 0.0842 0.1071 1.0000 -0.8749 x3 -0.2567 -0.0234 -0.8749 1.0000 x4 0.0634 0.1326 -0.2753 0.1931 x5 -0.1758 -0.0220 0.1527 -0.1080 x6 -0.1211 0.4341 0.0182 0.0355 x7 -0.2129 -0.4006 -0.1891 0.0917 Correlation of Estimates Variable x4 x5 x6 x7 Intercept 0.0634 -0.1758 -0.1211 -0.2129 x1 0.1326 -0.0220 0.4341 -0.4006 x2 -0.2753 0.1527 0.0182 -0.1891 x3 0.1931 -0.1080 0.0355 0.0917 x4 1.0000 -0.4410 0.3594 -0.3877 x5 -0.4410 1.0000 -0.3744 0.2620 x6 0.3594 -0.3744 1.0000 -0.8806 x7 -0.3877 0.2620 -0.8806 1.0000 Figure 6 Collinearity Diagnostics (intercept adjusted) Condition Number Eigenvalue Index X1 X2 1 2.73705 1.00000 0.01007 0.01309 2 1.53387 1.33512 0.00307 0.02982 3 1.00628 1.64921 0.34351 0.00001201 4 0.95770 1.69055 0.33275 0.00001849 5 0.58180 2.16897 0.09436 0.00088996 6 0.10228 5.17311 0.00128 0.87891 7 0.08102 5.81227 0.21496 0.07726 Collinearity Diagnostics (intercept adjusted) Number X3 X4 X5 X6 X7 1 0.010211 0.02835 0.01630 0.01256 0.01364 2 0.05043 0.00244 0.07124 0.01557 0.00735 3 0.00066453 0.03110 0.14605 0.01267 0.03300 4 0.00001584 0.21339 0.09728 0.01684 0.00143 5 0.02864 0.42410 0.48392 0.00353 0.00330 6 0.86928 0.15822 0.07451 0.06273 0.00768 7 0.04077 0.14241 0.11069 0.87610 0.93361

Printer friendly Cite/link Email Feedback | |

Title Annotation: | DATA ANALYSIS |
---|---|

Author: | Anawis, Mark |

Publication: | Scientific Computing |

Geographic Code: | 1USA |

Date: | Mar 1, 2013 |

Words: | 1515 |

Previous Article: | Common errors in statistics (and how to avoid them): this book may be profitably used by scientists, physicians, lawyers, business types and students. |

Next Article: | Bridging the development-to-quality divide: is ELN up to the challenge? |

Topics: |