# Missing Data Most classical statistical models assume that every observation contains all the variables of interest. In the real world, some data might be missing with some people. # Missingness Mechanisms Rubin (1974) posited that there were several mechanisms for how data could be missing: - **Missing completely at random**: the probability that the data is missing is the same for all the observations (i.e. nothing is causing someone to be more likely to be missing than another). - In a simple random sample, some members of the population are not present, but they all have the same probability of being in the sample. - **Missing at random**: the probability that the data is missing is the same *within the groups defined in the observed data*. - This is the assumption that models which handle missingness make - **Missing not at random**: the reason that the data is missing is related of the value of that variable. For example, a treatment may be causing severe symptoms in humans, leading them to leave the trial or die. # In Survival Analysis In survival analysis, there is also censoring. It is common that human subjects will leave a trial before it is finished. A trial may finish before some people experience the event of interest. In both cases, the event of interest is not observed, or missing. # Handling Missing Data 1. Ignore incomplete cases (may lead to too much data being tossed) 2. Imputation: predict the missing values through other variables 3. Simulation: use a distribution on the data to fill in missing values 4. Modeling: model the missingness explicitly --- # References - [[Applied Linear Regression#5. Complex Regressors]] - Rubin, Donald B. “Inference and Missing Data.” _Biometrika_ 63, no. 3 (1976): 581–92. https://doi.org/10.2307/2335739. - [[Flexible Imputation of Missing Data]]