The Connection Between MLE and MAP

1 minute read

Recently I’ve been interested in learning more about the Bayesian approach to statistics. I was watching a video on some general Bayesian principles, and it briefly mentioned a connection to Maximum Likelihood Estimation. So, I wanted to dig a bit further into this!

Generally speaking, density estimation is about selecting a probability distribution function and a set of parameters that best describe the joint probability distribution of some observed data. There are two key components: choosing the probability distribution function, and determining the associated parameters.

Maximum Likelihood Estimation (MLE) is a form of density estimation. With MLE we try to maximize the probability of observing some data from the joint probability distribution, given a specific probability distribution and its parameters. Formally:

\[maximize L(X ; \theta)\]

Where theta is the set of parameters.

We don’t make any assumptions about the prior distribution of the data when deriving the MLE.

Maximum A Posteriori (MAP) estimation, in contrast, explicitly allows the inclusion of prior beliefs, based on Bayes Theorem. Bayes Theorem allows us to calculate the conditional probability of one outcome given another outcome, using the inverse of the relationship:

\[P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}\]

Here we are calculating the posterior probability of A given B, where \(P(A)\) is the prior probability of A. We can remove \(P(B)\), the normalizing constant, and instead represent the posterior as proportional to the probability of B given A, multiplied by the prior.

\[P(A | B) = P(B | A) * P(A)\]

Since we are interested in optimizing quantities instead of estimating probabilities, this simplification is not a problem, and allows us to estimate theta simply as:

\[P(\theta | X) = P(X | \theta) * P(\theta)\]

Just like with MLE, we want to maximize this quantity over a range of theta:

\[maximize P(X | \theta) * P(\theta)\]

Comparing the MLE and MAP equations, the only thing that differs is the inclusion of \(P(\theta)\) in MAP. Basically, the likelihood is now weighted by information coming from the prior. If we assume that the prior distribution is uniform, then both calculations will be equivalent!

Image from this website