Multiple linear regression

# Multiple linear regression An extension of [[Simple Linear Regression|simple linear regression]] that incorporates more than one covariate/predictor. $ \begin{align} &E(Y\mid X_1, ..., X_p) = \beta_0 + \beta_1 + ... + \beta_p X_p \\ &\text{Var}(Y \mid X_1, ..., X_p) = \sigma^2 \end{align} $ The [[Ordinary least square (OLS) estimators|OLS estimators]] are still used to estimate the model parameters for multiple linear regression. The [[Statistical properties of the OLS estimators|statistical properties]] are still the same. [[Hypothesis testing for linear regression]] normally stems from being able to assume that the estimates are normally distributed. - If the variances are actually hetereoskedastic, consider the [[Dealing with heteroskedastic data (weighted least squares)|weighted least squares (WLS)] model]] - If your errors are both heteroskedastic *and* correlated, then the [[Generalized least squares (GLS) estimators|generalized least squares model]] might be more appropriate. ## Interpreting parameters Parameter interpretations are similar to those in [[Simple Linear Regression|simple linear regression]] with an added nuance. ### Intercept For $\beta_0$ to be isolated, *all* of the predictors in the model need to equal to zero. There is usually some baseline group associated with this situation. Assume that we are using the following multiple linear regression model: $ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon_i $ >[!example] >Let $X_1 = 1$ mean that someone is on treatment A and $X = 0$ means that someone is on placebo. Let $X_2$ then $\beta_0$ represents the number of hours of exercise. Therefore, $\beta_0$ represents the average outcome of someone in the placebo group with no exercise. ### Non-intercepts To isolate $\beta_1$ or $\beta_2$ to be isolated, we take a similar strategy as in [[Simple Linear Regression|simple linear regression]]. Take the difference of two equations where one equation has the covariate equal to one and the other has the covariate equal to one. The extra nuance here is that the *other covariates* in the model must have the same value in order to isolate the parameter. For instance: $ E(Y \mid X_1 = 1, X_2 = x_2) - E(Y \mid X_1 = 0, X_2 = x_2) = \beta_1 $ Therefore, we interpret the non-intercept as *the average change in the outcome* for a unit increase (+1) in the covariate, *holding the other covariates constant*. It is easy to forget this distinction in interpretation for papers. Depending on the type of the covariate, this "unit increase" can have different interpretations. >[!example] >Using the above example, we would interpret $\beta_1$ as the average change in the outcome, adjusted for hours of exercise. >[!example] >In a similar vein, we would interpret $\beta_2$ as the average change in the outcome for someone in the same treatment group. See [[Simple Linear Regression|simple linear regression]] for how interpretations change when the outcome and/or the predictor are on the log-scale. ### Categorical Variables Categorical variables (factors) are encoded as dummy variables, where the column takes the value 1 if an observation is a given category and zero otherwise. One group/category must be selected as the reference, or the design matrix will not be full rank ([[High multicollinearity leads to high variance for the OLS estimates|multicollinearity problem]]). The coefficients associated with a dummy variable are interpreted as the change in the outcome associated with being in the given category, relative to the baseline group. >[!example] >Say we are dealing with 3 treatments in an experiment: A, B, and placebo. If we use placebo as the reference group, we'll get the following regression: >$ >Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon_i >$ >If $X_1$ is a dummy variable for treatment A, then it is interpreted as the average change in the outcome associated with being on A, relative to the placebo group. By extension, $X_2$ is the dummy variable for B, and it has a similar interpretation, but with the B group. ### Interactions Interactions allow predictors to "interact" and enable *further change* in the outcome. Interactions are expressed as products of predictors in a regression. If $X_1$ is a dummy variable for being on treatment A, and $X_2$ is a dummy variable indicating someone is a woman, then we could model an interaction between these two variables as: $ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon $ $\beta_1$ and $\beta_2$ are referred to as the *main effects* for treatment A and sex, respectively. Thus, $\beta_3$ represents the additional effect of their interaction. You can interpret it different ways, depending on what main effect you're interested in. $\beta_3$ could be viewed as the *additional* average change in the outcome associated with the interaction between treatment A and being a woman. Alternatively, it could be viewed as the additional change associated with being a woman (relative to a man) who is taking treatment A. Most of the time in biostatistics, we are interested in the former — the effect of treatment. Linear regressions with only main effects and interactions can be viewed as a reparameterization of [[Analysis of Variance (ANOVA) Model|analysis of variance model]]. ### Polynomial regression If we include predictors that are powers of other predictors, then we have [[Polynomial regression|polynomial regression]]. ## Splines [[Spline regression]] is a reparameterization of the polynomial regression that are more numerically stable and have better spatial interpretation. ## Hypothesis Testing With the right assumptions, it can be shown that the OLS coefficients come from a $p$-variate Normal distribution. Hypothesis tests for the coefficients or linear combinations of the coefficients usually use a [[Wald test statistic]]. ## Code implementation - [[Linear regression in R]] ## Potential problems - [[High multicollinearity leads to high variance for the OLS estimates]] - [[Unobserved covariates can drastically change estimated coefficients]] - [[Dealing with heteroskedastic data (weighted least squares)|The data is actually heteroskedastic]] - [[Variable Selection Methods|You don't know what covariates to include in the model]] --- # References - [[Applied Linear Regression#3. Multiple Linear Regression]] - [[Applied Linear Regression#4. Interpretation of Main Effects]] - [[Applied Linear Regression#5. Complex Regressors]]