Likelihood Function - Very Normal

# Likelihood Function The likelihood function is a function that describes how well the observed data fits with a statistical model. If we model $X$ with some parametric random variable, then it will have a probability mass/density function $p_\theta(x)$. Assuming that we observe multiple realizations $(x_1, ..., x_n)$ of this random variable and that they are [[Independent and Identically Distributed (IID) Assumption|independent, and identically distributed]], then we can write the likelihood $\mathcal{L}(\theta)$ as the function of the PDF/PMFs. $ \mathcal{L}(\theta) = \prod^n_{i=1} p_\theta(x_i) $ ## Example: Bernoulli data If $X$ is a binary random variable, then we can model it with a Bernoulli random variable with some parameter $\pi$, which represents the probability of observing a 1 (aka an event). The PMF of a Bernoulli distribution is given by: $ p_\pi(x) = \pi^x (1 - \pi)^{1-x} $ So the corresponding likelihood would be: $ \begin{align} \mathcal{L}(\pi) &= \prod^n_{i=1} \pi^x (1 - \pi)^{1-x} \end{align} $ Notice that the likelihood is a function of the parameter $\pi$, whereas the probability mass function is a function of the data instead. Under [[Likelihood regularity conditions|regularity conditions]], the likelihood can be used to produce estimates with useful asymptotic properties via [[Maximum likelihood estimation (general)|maximum likelihood estimation]]. ## Important functions derived from the likelihood ### Log-likelihood Like its name suggests, the log-likelihood is just the log of the likelihood. It's useful because, in general, it's easier to work with. In terms of homework problems, it makes it easier to optimize because it turns products into sums. In terms of computation, it also prevents problems with extremely small numbers. $ l(\theta) = \log \mathcal{L}(\theta) $ Since the log is a monotone increasing function, it does not alter where the maximums of the likelihood are. ### The score function The score function is the derivative of the log-likelihood with respect to the parameter. $ S(\theta) = \frac{\partial}{\partial \theta} \ l(\theta) $ It can also be used to estimate the parameter since the score will approach zero as the likelihood reaches a maximum. For distributions with multiple parameters, the score function will actually be a vector of functions — one element for a derivative with respect to each parameter. The score function can also be used to produce a hypothesis test. ## Fisher Information The Fisher Information is the variance of the score function. $ \begin{align} \mathcal{I}(\theta) &= \text{Var}(S(\theta)) \\ &= E[S(\theta)^2] + E[S(\theta)]^2 \\ &= E[S(\theta)^2] \\ &= E\left[ \left( \frac{\partial}{\partial \theta} \log \mathcal{L}(\theta) \right)^2\right] \end{align} $ - The second line comes from how we can express the variance in terms of both the first and second moments - The expected value of the score is zero since it's a derivative The Fisher Information is a numerical measure for how much "information" a random variable gives us about the $\theta$ that generated it. If the data gives the likelihood a wide shape (aka high variance), then it does not tell us much about the true parameter, and vice-versa. --- # References - [[Categorical Data Analysis#Chapter 1 - Distributions and Inference for Categorical Data]]