Printer Friendly

Dealing with collinearity: during creation of regression models, collinearity can occur, which causes invalid results for individual regressors.

William James once said: "We must be careful not to confuse data with the abstractions we use to analyze them." Collinearity is a problem that occurs during the creation of regression models. It is the presence of intercorrelation among predictor variables. In other words, it occurs when a regressor is a linear combination of one or more other regressors. Although a model with collinearity still may have good predictive ability, the results about an individual regressor may not be valid. I dealt with model building in a previous article ("How to Select a Useful Model," Scientific Computing, February 2012). This article is a follow up. I will divide the topic into indicators of collinearity, diagnostic tests for collinearity, and correction for collinearity.

Indicators of collinearity are: parameter tests are insignificant for theoretically important parameters, parameter tests are insignificant whereas the whole model test is significant, large standard errors for regression coefficients, extreme variability in parameters across samples, large changes in parameters when changing data or either adding or removing other variables, unexpected signs for parameters, and decreases in regression standard errors when removing a variable. These can be determined from output tables from standard analysis, such as from SAS Proc Reg (Figure 1).

A visual indication can be found using Leverage Plots with JMP Fit Model platform. Collinearity is seen as shrinkage of points in the X direction (Figure 2). X3 is involved in collinearity, since values shrink toward the center of the plot. X4 is not involved in collinearity, since values are dispersed along X axis and shown for comparison.

Diagnostic tests for collinearity are: VIF (variance inflation factor), correlation matrix of variables, eigenvalues and eigenvectors, condition indices, variance proportions. VIF is calculated as 1/(1 - [[R.sup.2].sub.i]), where [[R.sup.2].sub.i] is the coefficient of determination of the regression for the ith input variable on all other input variables. It demonstrates how much collinearity increases coefficient estimate instability. It is available as an option under SAS Proc Reg (Figure 3) or in JMP. Using R2 = 0.5411 from Figure 1, we can calculate a VIF = 2.1791. Variables with VIF greater than this value (i.e. X2 and X3) are more closely associated with other X (independent) variables than the Y (dependent) variable.

The correlation matrix of variables also is available as an option under SAS Proc Reg (Figure 4) or in JMP. High correlations between variables (e.g. -0.8749 for X2 and X3; -0.8806 for X6 and X7) are indicators of collinearity.

Principal component analysis to produce eigenvalues near zero is an indication of collinearity. This can be generated using JMP Principal Components Analysis on Correlations (see Figure 5, eigenvalue = 0.0810 for Principal Component 7). Eigenvectors may show which variables are involved if there are large "loads" (values) for several variables for principal components of low eigenvalue (Figure 5, Principal Component 7 has large loads on X6 and XT).

[FIGURE 2 OMITTED]

Condition indices are the square roots of the ratio of the largest eigenvalue to each individual eigenvalue. When condition indices are greater than 10, this is an indication that regression estimates may be affected. This can be generated using SAS Proc Reg collin option to include the intercept or collinoint option to adjust the intercept out first (Figure 6 using collinoint option). None of the condition indices indicate a collinearity problem.

Variance proportions are the proportion of the variance of the estimates accounted for by each principal component. If there is a principal component with a high condition index that contributes significantly to the variance of at least two variables, this is an indication of collinearity (Figure 6). None of the principal components indicate a collinearity problem.

Correction of collinearity is more difficult than diagnosis. Methods for dealing with collinearity should begin with increasing sampling size, since this should decrease standard error. If this is not feasible, removal of intercorrelated variables can be approached using some of the methods I discussed in "How to Select a Useful Model," Scientific Computing, February 2012, such as stepwise regression using SAS Proc Reg or JMP Fit Model. Ensure that interaction terms use centering, i.e. transformation by subtracting by the mean. Redefine the variables by using an alternative form, such as a percentage or per capita. If these more straightforward approaches don't work, then more elaborate approaches may be needed, such as removing the variance in one of the intercorrelated variables by regressing them on that variable or analyzing the common variance as a separate variable.

The variables also can be transformed to principal components, where those with small eigenvalues can be eliminated, but the larger question is whether these principal components are interpretable. Ridge regression also can be considered where other options don't work. It introduces a small bias in exchange for a reduction in sampling variances.

During the creation of regression models, collinearity can occur, which causes invalid results for individual regressors, although the overall model still can have good predictability. There are several indicators of collinearity, but diagnostic tests, such as VIF, should be performed. Once identified, a strategy of dealing with collinearity should proceed from the simplest (sample size increase) to the more complex if needed (ridge regression).

[FIGURE 5 OMITTED]

Mark Anawis is a Principal R&D Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.
Figure 1

The SAS System 09:35 Friday, January 4, 2013 109

 The REG Procedure
 Model: MODEL1
 Dependent Variable: y1

Number of observations Read 30
Number of observations Used 30

 Analysis of Variance

 Sum of Mean
Source DF Squares Square F Value Pr > F

Model 7 559.90107 79.98587 3.71 0.0085
Error 22 474.84506 21.58387
Corrected 29 1034.74613
 Total

Root MSE 4.64514 R-Square 0.5411
Dependent Mean 47.02440 Adj R-Sq 0.3951
Coeff Var 9.87965

 Parameter Estimates

 Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 83.70755 18.87230 4.44 0.8002
x1 1 -0.10111 0.15537 -0.65 0.5219
x2 1 -0.19813 0.26369 -0.75 0.4604
x3 1 8.14219 11.07504 0.74 0.4700
x4 1 -1.86542 0.58301 -3.20 0.0041
x5 1 -0.08212 0.13194 -0.62 0.5401
x6 1 -0.35242 0.18633 -1.89 0.0718
x7 1 0.31406 0.21568 1.46 0.1595

Figure 3
 Parameter Estimates

 Parameter Standard
Variable DF Estimate Error

Intercept 1 83.70755 18.87230
x1 1 -0.10111 0.15537
x2 1 -0.19813 0.26369
x3 1 8.14219 11.07504
x4 1 -1.86542 0.58301
x5 1 -0.08212 0.13194
x6 1 -0.35242 0.18633
x1 1 0.31406 0.21568

 Parameter Estimates

 Variance
Variable t Value Pr > |t| Inflation

Intercept 4.44 0.0002 0
x1 -0.65 0.5219 1.30029
x2 -0.75 0.4604 5.60944
x3 0.74 0.4700 4.65465
x4 -3.20 0.0041 1.69078
x5 -0.62 0.5401 1.44474
x6 -1.89 0.0718 6.00304
x1 1.46 0.1595 6.19107

Figure 4

 Correlation of Estimates

Variable Intercept x1 x2 x3

Intercept 1.0000 -0.4480 0.0842 -0.2567
x1 -0.4480 1.000 0.1071 -0.0234
x2 0.0842 0.1071 1.0000 -0.8749
x3 -0.2567 -0.0234 -0.8749 1.0000
x4 0.0634 0.1326 -0.2753 0.1931
x5 -0.1758 -0.0220 0.1527 -0.1080
x6 -0.1211 0.4341 0.0182 0.0355
x7 -0.2129 -0.4006 -0.1891 0.0917

 Correlation of Estimates

Variable x4 x5 x6 x7

Intercept 0.0634 -0.1758 -0.1211 -0.2129
x1 0.1326 -0.0220 0.4341 -0.4006
x2 -0.2753 0.1527 0.0182 -0.1891
x3 0.1931 -0.1080 0.0355 0.0917
x4 1.0000 -0.4410 0.3594 -0.3877
x5 -0.4410 1.0000 -0.3744 0.2620
x6 0.3594 -0.3744 1.0000 -0.8806
x7 -0.3877 0.2620 -0.8806 1.0000

Figure 6

 Collinearity Diagnostics (intercept adjusted)

 Condition
Number Eigenvalue Index X1 X2

 1 2.73705 1.00000 0.01007 0.01309
 2 1.53387 1.33512 0.00307 0.02982
 3 1.00628 1.64921 0.34351 0.00001201
 4 0.95770 1.69055 0.33275 0.00001849
 5 0.58180 2.16897 0.09436 0.00088996
 6 0.10228 5.17311 0.00128 0.87891
 7 0.08102 5.81227 0.21496 0.07726

 Collinearity Diagnostics (intercept adjusted)

Number X3 X4 X5 X6 X7

 1 0.010211 0.02835 0.01630 0.01256 0.01364
 2 0.05043 0.00244 0.07124 0.01557 0.00735
 3 0.00066453 0.03110 0.14605 0.01267 0.03300
 4 0.00001584 0.21339 0.09728 0.01684 0.00143
 5 0.02864 0.42410 0.48392 0.00353 0.00330
 6 0.86928 0.15822 0.07451 0.06273 0.00768
 7 0.04077 0.14241 0.11069 0.87610 0.93361
COPYRIGHT 2013 Advantage Business Media
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2013 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:DATA ANALYSIS
Author:Anawis, Mark
Publication:Scientific Computing
Geographic Code:1USA
Date:Mar 1, 2013
Words:1521
Previous Article:Common errors in statistics (and how to avoid them): this book may be profitably used by scientists, physicians, lawyers, business types and students.
Next Article:Bridging the development-to-quality divide: is ELN up to the challenge?
Topics:

Terms of use | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters