# Missing Data
Most classical statistical models assume that every observation contains all the variables of interest.
In the real world, some data might be missing with some people.
# Missingness Mechanisms
Rubin (1974) posited that there were several mechanisms for how data could be missing:
- **Missing completely at random**: the probability that the data is missing is the same for all the observations (i.e. nothing is causing someone to be more likely to be missing than another).
- In a simple random sample, some members of the population are not present, but they all have the same probability of being in the sample.
- **Missing at random**: the probability that the data is missing is the same *within the groups defined in the observed data*.
- This is the assumption that models which handle missingness make
- **Missing not at random**: the reason that the data is missing is related of the value of that variable. For example, a treatment may be causing severe symptoms in humans, leading them to leave the trial or die.
# In Survival Analysis
In survival analysis, there is also censoring. It is common that human subjects will leave a trial before it is finished. A trial may finish before some people experience the event of interest. In both cases, the event of interest is not observed, or missing.
# Handling Missing Data
1. Ignore incomplete cases (may lead to too much data being tossed)
2. Imputation: predict the missing values through other variables
3. Simulation: use a distribution on the data to fill in missing values
4. Modeling: model the missingness explicitly
---
# References
- [[Applied Linear Regression#5. Complex Regressors]]
- Rubin, Donald B. “Inference and Missing Data.” _Biometrika_ 63, no. 3 (1976): 581–92. https://doi.org/10.2307/2335739.
- [[Flexible Imputation of Missing Data]]