Dealing with heteroskedastic data (weighted least squares)

# Dealing with heteroskedastic data (weighted least squares) One of the assumptions made by the [[Ordinary least square (OLS) estimators|OLS estimators]] is that the data is *homoskedastic* — aka the variance of the outcome is the same, regardless of the covariate value. This can be handled by using a modified version of the OLS estimators called the weighted least square estimators. First, we need to change our model specification for [[Simple Linear Regression|simple linear regression]]: $ \begin{align} &E(Y_i\mid X = x_i) = \mathbf{x}_i'\beta \\ &\text{Var}(Y\mid X = x_i) = \sigma^2 / w_i, \quad w_i > 0 \end{align} $ The notable change is in how the variance is specified. Now, each unique value of the covariate is paired with a (known) weight $w_i$. This weight term allows the variance to be different for different covariate values, accounting for heteroskedasticity. The original OLS estimators minimize the sum of squared errors: $ \sum^n_{i=1}\varepsilon^2_i = \sum^n_{i=1} (y_i - \mathbf{x}_i'\beta)^2 $ The idea behind the weighted least squares estimator is to normalize each squared error by its variance. This gives each error equal weight in the estimation: $ \sum^n_{i=1} \frac{\varepsilon_i^2}{\text{Var}(\varepsilon_i)} = \frac{1}{\sigma^2} \sum^n_{i=1} w_i(y_i - \mathbf{x}_i'\beta)^2 $ Then, we can estimate the regression coefficients by minimizing this expression instead. We may ignore the $1/\sigma^2$ term because it's just a proportionality constant here. The WLS estimator for the coefficients is given by: $ \hat{\beta}_{\text{WLS}} = (X'WX)^{-1}X'WY $ and the WLS estimator for the variance is given by: $ \sigma^2_{\text{WLS}} = \frac{1}{n-2} \sum^n_{i=1} w_i \hat{\varepsilon}^2_i $ In the case of [[Multiple linear regression|multiple linear regression]], the covariance matrix is given by: $ \text{Cov}(\hat{\beta}\mid X) = \hat{\sigma}^2_{\text{WLS}} (X'WX)^{-1} $ where $W$ is a diagonal matrix containing the weights. Observations with lower variance are given higher weight in the estimation process and vice-versa. Based on this expression, you can see that the [[Ordinary least square (OLS) estimators|OLS estimate]] is a special case of the WLS estimate. ## Use cases Examples where heteroskedastic data may appear are: - Repeated observations for each covariate group (each "observation" is an average of these repeated measurements) - Stratified sampling, where a population is split into different subsets, each with a different number of samples ## Unknown weights Above, it is assumed that each of the weights are known, but realistically, we may not know their true values. If we still use OLS to estimate the coefficients, then the estimates will still be unbiased, but the values in the covariance will be biased. Under this misspecification, the OLS estimates will actually have the following covariance: $ \text{Cov}(\hat{\beta}_{\text{OLS}}\mid X) = \sigma^2 (X'X)^{-1}(X'WX)(X'X)^{-1} $ To improve the covariance estimates, we can attempt to estimate the entries of the unknown weight matrix $W$. MacKinnon et. al came up with several estimators to estimate $W$. One example, the HC3 estimator, uses the squared error and the leverage of the predictor as the weight: $ \text{Cov}(\hat{\beta}_{\text{OLS}}\mid X) = (X'X)^{-1}(X'\left( \frac{\hat{\varepsilon}_i^2}{(1 - h_i)^2} \right)X)(X'X)^{-1} $ - the weight $w_i$ is the estimated variance for observation $i$, scaled by a function of the leverage $h_i$ --- # References - [[Applied Linear Regression#7. Variances]] - MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics. Elsevier BV. https://doi.org/10.1016/0304-4076(85)90158-7.