# Regression Diagnostic Tools Each of the assumptions for [[Multiple linear regression|multiple linear regression]] play a role in influencing its statistical properties (bias, variance, p-values, etc.). Statisticians have developed tools that can help us understand *the extent to which assumptions are violated*. They do not give a "yes-or-no" answer to this question; ultimately the statistician must make this judgment. ## Residual scatterplots Residual scatterplots plot the covariate values against their associated estimated residuals. Under the homoskedasticity assumption, we should ideally see a band of constant width. The width being constant helps indicate that the variance of the residuals at different covariate values are constant. Residual plots can indicate assumption violations if: - the residuals seem to have a curvilinear relationship with the covariate. This suggests that the linear model assumption is not suitable for the outcome-covariate relationship - very large residuals can indicate that there are outliers in the data (in either the covariate or the outcome) ## Leverage and influence *Over repeated experiments* the residuals should have the following properties if the [[Ordinary least square (OLS) estimators|OLS model assumptions hold]]. The estimated residuals have zero expectation, conditioned on the data: $ E(\hat{e}\mid X) = 0 $ The variance of the estimated residuals have the following structure: $ \text{Var}(\hat{e}\mid X) = \sigma^2 (\mathbf{I} - \mathbf{H}) $ where $\mathbf{H} = X(X'X)^{-1}X'$ is the "hat matrix" or "projection matrix". Let the $i$th diagonal elements of $\mathbf{H}$ be denoted as $h_{i}$. Then, the variance of the $i$th residual is given by: $ \text{Var}(\hat{e}_i) = \sigma^2(1 - h_i) $ $h_i$ is called the "leverage" of an observation $x_i$. Outliers have $h_i$ is close to one, so the corresponding variance for that observation will be small. The resulting fitted line will be forced to be closer to this outlier point, possibly affecting the residuals for all the other observations. Outliers in both the covariate and the outcome have the potential to unduly influence the regression coefficient estimates. ## Outlier test There's actually a hypothesis test for evaluating if something in a regression is an outliers. The null hypothesis is that the expected value of the outcome is given by the model: $ H_0: E(Y\mid x_i) = x_i\beta $ against the alternative that the expected value deviates from the fitted regression line by some amount $\delta$ $ H_1: E(Y\mid x_i) = x_i\beta + \delta $ $x_i$ is suspected to be an observation that produces an outlier in the outcome. To perform the test, the regression is performed without $x_i$. After performing this regression, we can take a look at the estimated residuals for all the observations. The p-value for this test is the quantile in which the original residual for $x_i$ appears in this new set of residuals. It being extreme suggests that it doesn't keep the fitted values in the usual place if it's included in the modell. Warning: doing this with multiple observation brings a [[Multiple Testing Problem|multiple testing problem]], so you'll need some sort of adjusted p-value. A similar idea can be seen with Cook's distance. ## Cook's Distance The Cook's distance measures the change in the fitted values of the observations when observation $i$ is removed: $ D_i = \frac{||\hat{\mathbf{y}} - \hat{\mathbf{y}}_{(-i)}||^2}{(p+1)\hat{\sigma^2}} $ - The numerator measures the sum of squared distances of all the fitted observations and the same fitted observations that omit $x_i$. - The denominator helps to standardize the distance according to the number of parameters in the model and the estimated variance. The idea with Cook's Distance is that it summarizes the total change that one observation has on influencing the fit of the model over all the other observations. As a rule of thumb, $D_i > 1$ should be flagged and reviewed for discussion ## QQ Plot With linear regression, the errors are often assumed to be normally distributed. One way to check this is to plot the quantiles of the estimated residuals against the theoretical quantiles of this normal distribution. If the quantiles line up on a line with slope 1, it suggests that the Normal distribution assumption is not terribly violated. Lots of deviations suggest that the residuals have heavier tails than would be predicted by a Normal distribution. Since we usually need to estimate the variance of the distribution, the estimated residuals are often "studentized" (dividing by the estimated variance of each residual). This gives them a t-distribution. $ e_{(i)} = \frac{\hat{\varepsilon}_i}{\hat{\sigma}_{-i}\sqrt{1 - h_i}} $ - The estimated residual is divided by its variance (see [[Regression Diagnostic Tools#Leverage and influence|the leverage and influence section]]). - Note that the estimated standard deviation is estimated without using observation $i$. This is done to make the numerator and denominator independent. --- # References [[Applied Linear Regression#9. Regression Diagnostics]]