In-sample Variable Selection Criteria

# In-sample Variable Selection Criteria These are criteria that can be used to evaluate the predictive ability of a model (not ideal) and to figure out [[Variable Selection Methods|variable selection]] for predictive models when the number of predictors is large. ## Adjusted $R^2$ The adjusted $R^2$ is a modified [[Coefficient of determination|coefficient of determination]] that adjusts for the fact that adding variables will never *decrease* $R^2$. A penalty factor is added to account for this: $ R^2_{\text{Adj}} = 1 - \left( \frac{n-1}{n-p-1} \right) \frac{SSE_{\text{model}}}{SST} $ With the added penalty term, it's possible for this metric to decrease when a covariate is added to the model. The idea is that if a covariate adds no predictive value to the model, the penalty term will increase. A good covariate should contribute to a large enough decrease in the squared error to justify adding it. The idea is to choose the model with the highest adjusted $R^2$. ## Mallow's $C_p$ Mallow's $C_p$ is defined by the following: $ C_p = \frac{SSE_{\text{subset}}}{\hat{\sigma}^2_{\text{full}}} + 2(p+1) - n $ - The first term is the ratio: - The numerator is the sum of squared errors of a fitted model using a some subset of $p$ predictors out of a full set - The denominator is the variance of of a model using the full set of predictors - The second term is a function of the number of subset predictors and the sample size Mallow's $C_p$ is designed as an estimate for the standardized total mean squared error of prediction in the observed data: $ \frac{1}{\sigma^2}\sum^n_{i=1}E[(\hat{Y}_i - \mu_i)^2 \mid X] $ where $\hat{Y}_i$ is the predicted value under the *fitted model* and $\mu_i$ is the expected value of the outcome under the true (unknown) model. Smaller $C_p$ is preferable. It is often used with [[Stepwise regression|stepwise regression]]. ## Akaike Information Criterion (AIC) The AIC is defined in general as: $ \text{AIC} = -2 \cdot \max \log\mathcal{L}(\theta) + 2p $ Thus, it can be defined for any model that uses a likelihood. $\mathcal{L}(\theta)$ is the likelihood of the model. In plain speak, the AIC is a a function of the maximized log-likelihood and the number of parameters used in the model. Smaller AIC is preferable. In the context of [[Multiple linear regression|linear regression]], it takes on the following value: $ \text{AIC} = n \cdot \log \left( \frac{SSE}{n} \right) + 2p + c $ where $c$ is a constant that comes out of the likelihood for the Normal distribution. ## Others - Bayesian Information Criterion - Deviance Information Criterion --- # References [[Applied Linear Regression#10. Variable Selection]]