Metropolis-Hastings Algorithm

Time to read: 5 min read

Premise

According to Mullachery, Khera, and Husain, there are three steps to consider for probablistic modeling: model specification, model inference, and model checking. Metropolis-Hastings (MH) is helpful in the model inference phase, where we want to compute the posterior value of parameters.

Recall Baye's theorem, typically when we want to sample from a distribution dependently (when the next sample is dependent on the current sample), we use the posterior function. MH allows us to use Baye's without the normalizing denominator (without ). This makes MH very useful, since calculating is often difficult in practice.

MH is a Markov chain Monte Carlo (MCMC) method, since the decision on the next state/ sample depends solely on the current state/ sample (Markov chain) and it uses pseudo-random numbers in constructing its walk (Monte Carlo).

TIP

Monte Carlo is a famous casino in the principality of Monaco (hence the randomness metaphor), and Monégasques (citizens of Monaco) are not allowed to work or gamble in the casino.

The whole idea of MCMC algorithms are to draw samples from some probability distribution without having to know the exact value ("height") of the distribution. The algorithm essentially wanders around and it probabilistically visits parts of the distribution with higher probabilities more often. In other words, given an unnormalized distribution , we can approximate the properties of (including its shape). It can be very useful by itself or in other applications, such as exploring states in deep reinforcement learning.

Visual Intuition

Mountain Image of Mountain From Pexel

Say you have a mountain but you don't know the shape of it. You have a friend who enjoys mountaineering, particularly when the altitude is high. Your friend also has the superpower to teleport; your friend wants to go to where the mountain is high, but since they only recently discovered their teleporting abilities, they cannot fully control it yet. As a result, your friend teleports everywhere over the mountain, but more so for higher altitudes, since they are aiming to reach the peaks of the mountain. If you strap a GPS to your friend, and let them teleport for enough time, you should in theory be able to map the shape of the mountain. This is because while your friend cannot fully control where they teleport on the mountain, they prefer places with higher altitude. They will thus try to teleport to said places with higher altitude (mountain peaks); they may not get there each time, but the frequency of them teleporting to the peaks is higher than places with lower altitude.

MCMC is basically your teleporting friend; it will try to get to the peak(s) of the distributions (places where the probability is high). While it may not get there each time, it will be there more frequently than other places.

Algorithm

Say we want to determine the shape of , but we do not know the normalizing constant (the denominator in Baye's theorem); maybe the normalizing constant is difficult to integrate. We can only use , where is the unnormalized version of ; in other words, .

The actual algorithm is as follows:

First we initialize , our first state in our Markov chain, with an arbitrary function (preferably something similar to the posterior distribution).

~

Then for each iteration up to a certain number of iterations we do the following:

We first draw our next potential state, from the random proposal distribution .

~ (|)

In the Random Walk version of the algorithm, we draw the next value in our chain from a random distribution centered on the previous value. For example, we can use the normal distribution: ~ (, ). In general, however, can be any distribution and does not have to depend on the previous state; preferably we should choose a such that it's similar to .

We can also enforce the property that (|) is detail-balanced, meaning (|) = (|). In other words, it means going from to is equally probable as going from to .

We now calculate our acceptability ratio .

= =

tells us which is more likely given the function . For Random Walk, since (|) = (|), we can just calculate .

Once we have we can determine whether to take the potential state or not.

If then we will accept and set to .

If then we will accept with probability of . We will generate a random value between 0 and 1, if is greater than that value, then we set to , or else we set to .

How often we accept candidates is dependent on what type of algorithm we're using. For instance, if we're using to approximate , then accepting candidates is a good thing, as it would indicate that we're approximating well. One may still, however, like some more variance in to ensure that the entire space is being covered. In Random Walk Metropolis, however, it may not be ideal to accept too many states; this is because if too many steps are being accepted, it may take too long to fully explore the posterior distribution, as the steps would have been too small. If Random Walk accepts too few states, however, there may be draws (and hence compute power) being wasted, so balance is key.

We iterate again with the new .

Conclusion

I've always liked probablistic algorithms, especially how if they're given enough time, then can eventually uncover patterns. Ben Lambert released an excellent video to illustrate how Random Walk Metropolis maps the distribution over time visually.