Printer Friendly

Two paradoxes in linear regression analysis.

Summary: Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection.

Key words: Forward selection, backward elimination, univariate regression; multiple regression

[Shanghai Arch Psychiatry. 2016; 28(6): 355-360.

1. Introduction

Linear regression is the most widely used statistical model in data analysis. [1] Wide availability and ease of use of statistical software packages, such as SAS, SPSS and R make the linear regression accessible to people without any formal statistical training. Although wise use of statistical methods such as linear regression helps us, even novices, develop a better understand of data and guide our decisions, it also causes confusion in interpretation of results and paradoxical findings. For example, we are often asked by our biomedical collaborators questions like "When I run the univariate regression of Y on the predictor , the p-value is very small. However, if I add some other predictors in the model, is not significant anymore. Why?" The same problem also occurs in logistic regression for binary outcome [2], log-linear regression for counting data [2], and Cox proportional hazards regression for survival data. [3]

A simple answer to this question is the different assumptions between the univariate and multiple regression models. However, this is not so meaningful for non-statisticians. This is discussed in Section 2.

In many medical studies, regression analysis involves a large of number of independent variables, or predictors. Model selection is required to find the predictors that are significantly associated with an outcome, or dependent variable, of interest. Here is how the model selection was done in a recent paper published in JAMA Surgery [4]:

"The administrative database was then evaluated by means of univariate and multivariate logistic regression. First we identified variables that were associated (P < .20) with readmission, the dependent variable. These potential confounders were then entered in multivariate stepwise (backward elimination) logistic regression, with readmission as the dependent variable. A logistic regression model was constructed to identify patient factors associated with readmission."

This forward selection procedure as the first step to weed out "non-significant" predictors has been become almost the gold standard for variable selection and has been used in many papers published in top medical journals. [5-24] The key idea of this method is first to run a univariate regression on each predictor. If the p-value is less than some pre-specified level, for example 0.1, then the predictor is used in the multiple regression. Otherwise, the predictor is assumed to have no significant effect on the outcome. This method seems quite logical and intuitively meaningful. Indeed, it has been used and is still being used by the biomedical and other research communities. Is this a valid procedure?

In this paper we use linear regression analysis to show two paradoxes in regression analysis. In Section 2 we use some very basic theory to show how the univariate regression and multiple regression make different assumptions on the models. We use examples and simulation studies to show two paradoxes in regression analysis in Section 3. Section 4 briefly discusses the transitivity of correlation. Our results clearly invalidate the model selection procedure widely used in biomedical research.

2. Basic theory

Let (Y, [X.sub.1],... , [X.sub.p]) be a random vector, where [X.sub.1]... , [X.sub.p] are called the covariates (independent variables), and Y is called the outcome (dependent variables). The regression of Y on ([X.sub.1],..., [X.sub.p]) is the conditional expectation of Y given ([X.sub.1;]..., [X.sub.p]), denoted by E[Y|[X.sub.1],... , [X.sub.p]] which is a measurable function of ([X.sub.1],..., [X.sub.p]). Denote the function by g([X.sub.1],... , [X.sub.p]). Without knowing the joint distribution of ([X.sub.1],... , [X.sub.p], Y), in general, the form of g([X.sub.1] ,... , [X.sub.p]) is unknown. In statistical analysis, we usually assume some mathematically tractable forms of g([X.sub.1] ,... , [X.sub.p]). For example, the linear regression analysis [1] assumes that

g([x.sub.1],... [X.sub.p]) = [[beta].sub.0]+ [[beta].sub.1][X.sub.1]+...+[[beta].sub.p][X.sub.p]

In the logistic regression analysis with 0-1 outcome [2], we assume that

g([X.sub.1],... [X.sub.p]) = [[[[beta].sub.0]+ [[beta].sub.1][X.sub.1]+...+[[beta].sub.p][X.sub.p]]/[1 + exp([[beta].sub.0]+ [[beta].sub.1][X.sub.1]+...+[[beta].sub.p][X.sub.p])]]

In this paper we assume the outcome Y is continuous. Let

[epsilon] = Y-E[Y|[X.sub.1],... [X.sub.p]]

It is obvious that E[Y|[X.sub.1],... [X.sub.p]] = 0. We consider a stronger form of the liner regression model

Y = [[beta].sub.0]+ [[beta].sub.1][X.sub.1]+...+[[beta].sub.p][X.sub.p] + [epsilon] (1)

and assume that given [X.sub.1]..., [X.sub.p], the variance of [epsilon]

Var[[epsilon]|[X.sub.1],... [X.sub.p]] = Var[[X.sub.1]..., [X.sub.p]] = [[sigma].sup.2]

which does not depend on ([X.sub.1], ..., [X.sub.p]). This assumption is also used in most statistical literature on linear model.[1] We further assume that [X.sub.k], k = 1,... , p, have finite second moments.

From (1) we have


Let [Z.sub.k] = E[[X.sub.k]|[X.sub.1]], k = 1,... , p. (It is clear that [Z.sub.k] = [X.sub.k]). Then the regression of Y on [X.sub.1] is

E[Y|[X.sub.1]] = [[beta].sub.0]+[[beta].sub.1][Z.sub.1]+...+[[beta].sub.K][Z.sub.Ki],

which still has a linear form. Let Then

Y = [[beta].sub.0]+[[beta].sub.1][Z.sub.1]+[[beta].sub.2][Z.sub.2]+...+[[beta].sub.k][Z.sub.k]+[eta] (3)

Although (3) has the same form as (1), they are fundamentally different in the error terms. Note that E[[eta] |[X.sub.1]] = 0, Cov( [Z.sub.k], [eta]) = 0, k = 1,... , p. However, the conditional variance of [eta] given [X.sub.1] is


Therefore, the conditional variance of [eta] given [X.sub.1] is no longer a constant. This violates the fundamental assumption used in linear regression model. [1]

The univariate linear regression of on assumes the following form of the model

Y = [[gamma].sub.0]+[[gamma].sub.1][X.sub.1] + [zeta] (4)

From (3) we know that generall

E[[zeta]|[X.sub.1]] [not equal to] 0, Cov([zeta], [X.sub.1]) [not equal to] 0.

Suppose (Y, [X.sub.i1],..., [X.sub.ip]), i = 1, ..., n, is a random sample from (1). Let


Let [[gamma].sub.1] be the least square estimate of the univariate regression of [Y.sub.i] on [X.sub.1], in (4). Then




as n [right arrow] [infinity]. Let [[gamma].sub.1] be the least square estimator of [[beta].sub.1] in (1). It is well known that E[[[beta].sub.1]] = [[beta].sub.1] and [[beta].sub.1] [right arrow] [[beta].sub.1]. Hence the estimates from the univariate regression and multiple regression usually converge to different limits. In a special case that and other covariates are uncorrelated, the limits are the same.

3. Two paradoxes in linear regression analysis

In this section we show why the estimates of the coefficient of some covariates in the univariate regression and in the multiple regression do not matich. More specifically, we show that in some cases, the estimate from the univariate regression is significant, but the result from the multiple regression is not. On the other hand, in some cases, the result is significant for the multiple regression but not for the univariate regression.

Suppose (1) is the true multiple regression model. The univariate regression model uses model (4) by assuming that E[[zeta]|[X.sub.1]] = 0. This assumption is generally wrong unless E[[X.sub.k]|[X.sub.1]] is a constant (k = 2,..., p). Hence, with a correct multiple regression model, the estimate of the univariate analysis is based on a wrong model. This is the reason why the results from univariate regression and multiple regression do not matich. Furthermore, result (5) shows that there is no clear interpretation of the estimate in the univariate analysis.

We discuss two paradoxes related to univariate and multiple regressions through both theoretical derivations and simulation studies.

3.1 Significant covariate effect in multiple regression but not in univariate regression

Let [X.sub.2], [X.sub.3], [X.sub.4] and E be independent random variables with standard normal distributions. Consider the following model

Y = [[alpha].sub.0]+[[alpha].sub.1][X.sub.1]+[[alpha].sub.2][X.sub.2]+[[alpha].sub.2][X.sub.2]+[epsilon] (6)

where [[alpha].sub.k] [not equal to] 0, k = 0,1,2,3, and

[x.sub.1] = [[beta].sub.1][X.sub.1]+[[beta].sub.2][X.sub.2]

where [[beta].sub.1][[beta].sub.2] [not equal to] 0.Then


which is 0 if and only if


From (5) we know that if (7) is true, the least square estimator [[gamma].sub.1] of the coefficient of the univariate regression of Y on [X.sub.1] will not be significant, even though [X.sub.1] is necessary in specifying model (6).

Example 1. Let [[alpha].sub.1] = -3/5, [[alpha].sub.2] = 3, [[alpha].sub.3] = 4, [[beta].sub.1] = 1, [[beta].sub.2] = 2 in (6). The true model is

Y = 1-3/5[X.sub.1]+3[X.sub.2]+4[X.sub.3]+ [epsilon] (8)

Table 1 shows the simulation result of the estimates and standard deviations of the coefficient of [X.sub.1] in both univariate and multiple regressions after 10,000 replications. For a wide range of sample sizes, the least square estimator of the coefficient of [X.sub.1] in the multiple regression is very close to the true value, and the standard deviation decreases significantly with the sample size. However, the estimate of coefficient in the univariate analysis is very close to 0 in all cases.

According to the practice in medical publications [4-24], [X.sub.1] will not enter the multiple regression. Table 2 shows the result of the least square estimates of the coefficients of [X.sub.2] and [X.sub.3] after [X.sub.1] is removed in (8). It is easy to see that the estimate of the coefficient of [X.sub.2] is dramatically biased in the multiple regression after [X.sub.1] is removed due to the univariate analysis.

3.2 Significant covariate effect in univariate regression but not in multiple regression

Suppose [X.sub.1], [X.sub.2], [X.sub.3] and E are independent standard normal random variables, and [X.sub.4] = [[beta].sub.1][X.sub.1]+[[beta].sub.2][X.sub.2;] where

[[beta].sub.1][[beta].sub.2] [not equal to] 0

Consider the following true model is

Y = [[alpha].sub.0]+[[alpha].sub.1][X.sub.1]+[[alpha].sub.2][X.sub.2]+[epsilon] (9)

If (9) is expanded to include [X.sub.4] and the expanded model still satisfies the conditions of the linear regression, then the regression equation becomes

Y = [[delta].sub.0]+ [[delta].sub.1][X.sub.1] + [[delta].sub.2][X.sub.2] + [[delta].sub.3][X.sub.4] + [epsilon]' = [[delta].sub.0] + ([[delta].sub.1] + [[delta].sub.3][[beta].sub.1])[X.sub.1] + [[delta].sub.2][X.sub.2] + [[delta].sub.3][[beta].sub.2][X.sub.3] + [epsilon]. (10)

From (9) and (10) we have

E[Y|[X.sub.1],[X.sub.2]] = E[Y|[X.sub.1],[X.sub.2],[X.sub.3]]


[[alpha].sub.0] + [[alpha].sub.1][X.sub.1] + [[alpha].sub.2][X.sub.2] = [[delta].sub.0] + ([[delta].sub.1] + [[delta].sub.3][[beta].sub.1])[X.sub.1]+ [[delta].sub.2][[beta].sub.2] + [[delta].sub.3][[beta].sub.2][X.sub.3]

Since [[beta].sub.2] [not equal to] 0, we should have [[delta].sub.3] = 0, which means that [X.sub.4] has no role in the multiple regression. Let [gamma] be the least square estimate of the coefficient of univariate linear regression of Y on [X.sub.4]. Then


Hence if [[alpha].sub.1][[beta].sub.1] [not equal to] 0. , when sample size [[delta].sub.3] = 0 is large enough, the result from the univariate is significant but the multiple regression is not.

Example 2. Let [[alpha].sub.0] = 0, [[alpha].sub.1] = 1, [[alpha].sub.2] = 2 in (9) and [[beta].sub.1] = [[beta].sub.2] = 1, Table 3 shows the least square estimates of the coefficient of [X.sub.4] in both univariate and multiple linear regressions after 10,000 replications. For all sample sizes, the univariate regression shows that [X.sub.4] has very significant effect on Y. However, in the multiple regression, the effect is not significant.

4. Transitivity of correlation

Another issue around the regression analysis is the transitivity of the correlation in the interpretation. For example, some people may say like that: "Since factor A is highly correlated with outcome Y, and factor A and factor 6 are highly correlated, then 6 should be correlated with Y." It seems very intuitive and reasonable that correlation is transitive. Unfortunately, this is not true. Here is a theoretical example. Suppose X and Z are independent standard normal random variables and Y=X+Z. It's clear that the correlation between X and Y, and between Y and Z are both 0.707. However, the correlation between X and Z is 0.

In our Example 2, the correlations between [X.sub.4] and [X.sub.1] and Y are 0.707 and 0.408, respectively. However, we proved in Section 3.2 shows that [X.sub.4] has no role in the multiple regression if [X.sub.1] and [X.sub.2] are in the model although [X.sub.4] is not a linear combination of [X.sub.1] and [X.sub.2].

5. Discussion

Regression analysis in medical research usually involves many predictors (independent variables). The model selection is needed to pick covariates having significant effect on the outcome. A widely used method in medical publications [4-24] is first to screen those covariates through univariate analysis. If a covariate is not significant in the univariate regression analysis, it will not enter the multiple regression analysis. The underlying assumption of this method is that is a covariate is significant in the multiple regression only if it is significant in the univariate regression analysis. Our results indicate that this assumption is wrong. A covariate may be very significant in the univariate regression but has no role in the multiple regression (see Example 2 in Section 3). On the other hand, a covariate is a necessary part of a multiple regression but may be not correlated with the outcome (see Example 1 in Section 3). The initial univariate screening method totally ignores the correlation among covariates. There is no theoretical work to support this method. Our simulation results clearly show that the multiple regression results after the univariate screening may be dramatically biased and misleading. The biomedical community should stop using this procedure in their research and publications.



Conflict of interest statement

The authors report no conflict of interest related to this manuscript.

Author's contribution

Ge Feng and Changyong Feng: theoretical derivation and revision

Jing Peng, Dongke Tu, and Julia Z. Zheng: Simulation and manuscript drafting

Feng G, Peng J, Dongke TU, Zheng JZ, Feng C

                                  [TEXT NOT REPRODUCIBLE IN ASCII]


[1.] Seber GAF, Lee AJ. Linear regression analysis (2nd ed). Hoboken, NJ: Wiley; 2003

[2.] Agresti A. Categorical data analysis (2nd ed). Hoboken, NJ: Wiley; 2002

[3.] Cox DR. Regression models and life-tables (with discussion). J R STAT SOC. 1972; B. 34:187-220. doi:

[4.] Mclntyre LK, Arbabi S, Robinson EF, Maier RV. Analysis of Risk Factors for Patient Readmission 30 Days Following Discharge From General Surgery. JAMA Surgery. 2016; (Epub ahead of print). doi:

[5.] Bardia A, Sood A, Mahmood F, Orhurhu V, Mueller A, Montealegre-Gallegos M, et al. Combined epidural-general anesthesia vs general anesthesia alone for elective abdominal aortic aneurysm repair. JAMA Surgery. 2016; (Epub ahead of print), doi:

[6.] Barlesi F, Mazieres J, Merlio JP, Debieuvre D, Mosser J, Lena H, et al. Routine molecular profiling of patients with advanced non-small-cell lung cancer: results of a 1-year nationwide programme of the French Cooperative Thoracic Intergroup (IFCT). Lancet. 2016; 387: 1415-1426. doi:

[7.] Brooks GA, Kansagra AJ, Rao SR, Weitzman Jl, Linden EA, Jacobson JO. A clinical prediction model to assess risk for chemotherapy-related hospitalization in patients initiating palliative chemotherapy. JAMA Oncology. 2015; 1(4): 441-447; doi:

[8.] Cronin PR, DeCoste L, Kimball AB. A multivariate analysis of dermatology missed appointment predictors. JAMA Dermatology. 2013; 149(12): 1435-1437. doi:

[9.] Fivez T, Kerklaan D, Mesotten D, Verbruggen S, Wouters PJ, Vanhorebeek I, et al. Early versus late parenteral nutrition in critically III children. N Engl J Med. 2016; 374(12): 1111-1122. doi:

[10.] Geng E, Kreiswirth B, Burzynski J, Schluger NW. Clinical and radiographic correlates of primary and reactivation tuberculosis: a molecular epidemiology study. JAMA. 2005; 293(22): 2740-2745. doi:

[11.] Hole J, Hirsch M, Ball E, Meads C. Music as an aid for postoperative recovery in adults: a systematic review and meta-analysis. Lancet. 2015; 386: 1659-1671. doi:

[12.] International CLL-IPI working group. An international prognostic index for patients with chronic lymphocytic leukaemia (CLL-IPI): A meta-analysis of individual patient data. Lancet Oncology. 2016; 17(6): 779-790. doi:

[13.] Leon MB, Smith CR, Mack MJ, Makkar RR, Svensson LG, Kodali SK, et al. Transcatheter or surgical aortic-valve replacement in intermediate-risk patients. N Engl J Med. 2016; 374(17): 1609-1620. doi:

[14.] Li Y, Stocchi L, Cherla D, Liu X, Remzi FH. Association of preoperative narcotic use with postoperative complications and prolonged length of hospital stay in patients with crohn disease. JAMA Surgery. 2016; 151(8): 726-734. doi:

[15.] Lorant V, Deliege D, Eaton W, Robert A, Philippot P, Ansseau M. Socioeconomic Inequalities in Depression: A Meta-Analysis. Am J Epidemiol. 2003; 157(2): 98-112. doi:

[16.] van der Meer AJ, Veldt BJ, Feld JJ, Wedemeyer H, Dufour JF, Lammert F, et al. Association between sustained virological response and all-cause mortality among patients with chronic hepatitis C and advanced hepatic fibrosis. JAMA. 2012; 308(24): 2584-2593. doi:

[17.] Mingrone G, Panunzi S, De Gaetano A, Guidone C, laconelli A, Nanni G, et al. Bariatricmetabolic surgery versus conventional medical treatment in obese patients with type 2 diabetes: 5 year follow-up of an open-label, single-centre, randomized controlled trial. Lancet. 2015; 386: 964-973. doi:

[18.] Nelson KB, Ellenberg JH. Antecedents of cerebral palsy: I. univariate analysis of risks. Am J Dis Child. 1985; 139(10): 1031-1038. doi:

[19.] Nelson KB, Ellenberg JH. Antecedents of cerebral palsy: Multivariate analysis of risk. N Engl J Med. 1986; 315(2): 81-86. doi:

[20.] NICE-SUGAR Study Investigators. Hypoglycemia and risk of death in critically ill patients. N Engl J Med. 2012; 367(12): 1108-1118. doi:

[21.] Pages F, Berger A, Camus M, Sanchez-Cabo F, Costes A, Molidor R, et al. Effector memory T cells, early metastasis, and survival in colorectal cancer. N Engl J Med. 2005; 353(25): 2654-2666. doi:

[22.] Schwed AC, Boggs MM, Pham XD, Watanabe DM, Bermudez MC, Kaji AH, et al. Association of admission laboratory values and the timing of endoscopic retrograde cholangiopancreatography with clinical outcomes in acute cholangitis. JAMA Surgery. 2016; (Epub ahead of print), doi:

[23.] Templin C, Ghadri JR, Diekmann J, Napp LC, Bataiosu DR, Jaguszewski M, et al. Clinical features and outcomes of takotsubo (stress) cardiomyopathy. N Engl J Med. 2015; 373(10): 929-938. doi:

[24.] Wood GC, Benotti PN, Lee CJ, Mirshahi T, Still CD, Gerhard GS, Lent MR. Evaluation of the association between preoperative clinical factors and long-term weight loss after roux-en-y gastric bypass. JAMA Surgery. 2016; (Epub ahead of print), doi:

Ge FENG (1), Jing PENG (2), Dongke TU (4), Julia Z. ZHENG (5), Changyong FENG (2,3*)

(1) School of Geophysics and Oil Resource, Yangtze University, Wuhan, China

(2) Department of Biostatistics & Computatonal Biology, University of Rochester, Rochester, NY, USA

(3) Department of Anesthesiology, University of Rochester, Rochester, NY, USA

(4) School of Philosophy, Wuhan University, Wuhan, China

(5) Department of Microbiology and Immunology, McGill University, Montreal, QC, Canada

(*) correspondence: Dr. Changyong Feng. Mailing address: Department of Biostatistics and Computatonal Biology, University of Rochester, 601 Elmwood Ave., Box 630, Rochester, NY, USA. Postcode: NY 14642. E-mail:


Ge Feng is a graduate student in the School of Geophysics and Oil Resources at Yangtze University, Wuhan, Hubei, China. His research interest includes statistical analysis in rock physics.
Table 1. Estimate of the regression coefficient of [X.sub.1]

       Multiple regression  Univariate regression
  n      Estimate    SD       Estimate    SD

   30    -0.6010   0.0988     -0.0005   0.4225
   50    -0.6003   0.0748     -0.0016   0.3194
  100    -0.6003   0.0514     -0.0009   0.2226
  200    -0.6002   0.0357      0.0002   0.1585
  500    -0.6005   0.0226     -0.0005   0.0965
1,000    -0.6000   0.0160     -0.0002   0.0691

Table 2. Estimates of the regression coefficients of [X.sub.2] and
[X.sub.3] with [X.sub.1] being removed

       Coefficient of [X.sub.2]  Coefficient of [X.sub.3]
        ([[alpha].sub.2] = 3)     ([[alpha].sub.3] = 4)
  n         Estimate    SD           Estimate    SD

   30        2.4074   0.3030          4.0028   0.3047
   50        2.3990   0.2281          4.0014   0.2302
  100        2.4020   0.1611          3.9992   0.1581
  200        2.3999   0.1111          4.0019   0.1126
  500        2.4002   0.0703          4.0005   0.0705
1,000        2.4002   0.0498          3.9993   0.0492

Table 3. Estimate of the regression coefficient of [X.sub.4]

       Univariate regression  Multiple regression
  n       Estimate    SD        Estimate    SD

   30      1.0024   0.4723       0.0038   0.2014
           0.9975   0.3564      -0.0008   0.1496
  100      0.9995   0.2469      -0.0015   0.1032
           0.9982   0.1733       0.0005   0.0723
  500      0.9999   0.1101       0.0005   0.0452
1,000      0.9995   0.0776       0.0004   0.0318
COPYRIGHT 2016 Shanghai Mental Health Center
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Biostatistics in psychiatry (36)
Author:Feng, Ge; Peng, Jing; Tu, Dongke; Zheng, Julia Z.; Feng, Changyong
Publication:Shanghai Archives of Psychiatry
Article Type:Report
Date:Dec 1, 2016
Previous Article:Comment on "Disability, psychiatric symptoms, and quality of life in infertile women: a cross-sectional study in Turkey".
Next Article:Some thoughts on the common issue of psychotherapy in different cultures--report on the China conference of psychoanalysis.

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters