Variable Selection Methods

# Variable Selection Methods Variable selection is the process of choosing which predictors to include in a model (presumably among many candidates). Not choosing a predictor essentially sets its coefficient to zero. Why do this? - We'd like to reduce the amount of [[Confounders|confounding variables]], but may not know or have evidence to what they are - Discover potentially important predictor/covariate relationships that we aren't aware about - Creating a better predictive model In controlled experiments, variable selection is less crucial. The central relationship we care about is between the independent and dependent variable. But for observational studies, the number of potentially predictive variables is large. Model selection is more important in this context. ## Bias-Variance Trade-Off Adding more predictors to a model makes it inherently more complex. There is a trade-off between simplicity and complexity. More complex models can help reduce bias and confounding, as well as capture more complex relationships between variables. However, their addition can also increase variance of the estimation process (see: [[High multicollinearity leads to high variance for the OLS estimates|collinearity problem]]) and make interpretation harder (e.g. deep neural networks). More parameters to estimate also means our degrees of freedom are reduced more, leading to slightly higher variances. ## Evaluating variable selection methods 1. Out-of sample criteria: criteria tested on data that a model was not trained on - [[Holdout methods]] 2. In-sample criteria: criteria calculated based on an estimated model - Adjusted [[Coefficient of determination]] ($R^2$) - Hypothesis test criteria (p-values) - In-sample estimates of prediction error --- # References [[Applied Linear Regression#10. Variable Selection]]