# In-sample Variable Selection Criteria
These are criteria that can be used to evaluate the predictive ability of a model (not ideal) and to figure out [[Variable Selection Methods|variable selection]] for predictive models when the number of predictors is large.
## Adjusted $R^2$
The adjusted $R^2$ is a modified [[Coefficient of determination|coefficient of determination]] that adjusts for the fact that adding variables will never *decrease* $R^2$. A penalty factor is added to account for this:
$
R^2_{\text{Adj}} = 1 - \left( \frac{n-1}{n-p-1} \right) \frac{SSE_{\text{model}}}{SST}
$
With the added penalty term, it's possible for this metric to decrease when a covariate is added to the model. The idea is that if a covariate adds no predictive value to the model, the penalty term will increase. A good covariate should contribute to a large enough decrease in the squared error to justify adding it.
The idea is to choose the model with the highest adjusted $R^2$.
## Mallow's $C_p$
Mallow's $C_p$ is defined by the following:
$
C_p = \frac{SSE_{\text{subset}}}{\hat{\sigma}^2_{\text{full}}} + 2(p+1) - n
$
- The first term is the ratio:
- The numerator is the sum of squared errors of a fitted model using a some subset of $p$ predictors out of a full set
- The denominator is the variance of of a model using the full set of predictors
- The second term is a function of the number of subset predictors and the sample size
Mallow's $C_p$ is designed as an estimate for the standardized total mean squared error of prediction in the observed data:
$
\frac{1}{\sigma^2}\sum^n_{i=1}E[(\hat{Y}_i - \mu_i)^2 \mid X]
$
where $\hat{Y}_i$ is the predicted value under the *fitted model* and $\mu_i$ is the expected value of the outcome under the true (unknown) model.
Smaller $C_p$ is preferable. It is often used with [[Stepwise regression|stepwise regression]].
## Akaike Information Criterion (AIC)
The AIC is defined in general as:
$
\text{AIC} = -2 \cdot \max \log\mathcal{L}(\theta) + 2p
$
Thus, it can be defined for any model that uses a likelihood. $\mathcal{L}(\theta)$ is the likelihood of the model. In plain speak, the AIC is a a function of the maximized log-likelihood and the number of parameters used in the model. Smaller AIC is preferable.
In the context of [[Multiple linear regression|linear regression]], it takes on the following value:
$
\text{AIC} = n \cdot \log \left( \frac{SSE}{n} \right) + 2p + c
$
where $c$ is a constant that comes out of the likelihood for the Normal distribution.
## Others
- Bayesian Information Criterion
- Deviance Information Criterion
---
# References
[[Applied Linear Regression#10. Variable Selection]]