Hamiltonian Monte Carlo

Time to read: 8 min read

Please check out my blog post on Metropolis-Hastings before reading this blog.

Premise

Hamiltonian Monte Carlo (HMC) is a version of Metropolis-Hastings using a physical analogy; the main objective is the same, to determine the shape of a posterior density distribution through probablistic sampling. HMC methods exploit the geometric properties of the posterior distribution, and rather than exploring the distribution through a random walk (as with Metropolis-Hastings), HMC's samples are selected based on the position (the original parameter ) and the momentum.

TIP

HMC was initially used mainly in computational physics and chemistry, but now is also widely used in Bayesian machine learning. Hamiltonian mechanics, for people who stopped taking physics after high school (myself included), can be conceptualized as follows:

Newtonian mechanics states that the force is equal to the change in linear velocity; this is the foundation for solving many problems, particularly those which can be expressed in Cartesian form (rectangular coordinate system). The issue becomes more complicated, however, when it is difficult to consider all constraint forces. For instance, while it's possible to model with Newtonian mechanics, say, a pendulum swinging from the weight of another pendulum, you would have to deal with many transformation matrices.

Lagrangian mechanics aims to bypass the requirement to consider all constraint forces to make using mixed coordinate systems easier. The Lagrangian is the difference between the total kinetic energy of the system and the potential energy of the system. Intuitively, instead of tracking the force acting on each pendulum, you can instead describe the system through other means, such as the angles of the pendulums and the speed and direction of the pendulum weights; in other words, Lagrangian mechanics can be described through a generalized set of coordinates and velocities.

Hamiltonian mechanics is Lagrangian but instead of velocity, momenta is used. The idea is to make it easier for numerical integration; the trade-off, as with Lagrangian mechanics, is that Hamiltonian does not lend well to encapsulating dissipate forces such as friction and drag (this is ok since we don't consider those forces in HMC).

In HMC, we are concerned about potential energy and kinetic energy. In HMC, we imagine a particle physically on the distribution; the potential energy can be proxied by the location of the particle, (oftentimes the variable q is also used), while the kinetic energy can be proxied by the momentum of the particle, (oftentimes the variable p is used). Both and are vectors. Potential energy is and kinetic energy is ; the Hamiltonian is thus = + . The goal is to update each iteration, such that is drawn from the correct distribution.

Intuition

Mountain Image of Mountain From Pexel

Recall our mountain example from Metropolis-Hastings; the algorithm approximates the shape of the mountain (the shape of the distribution) through sampling different places on the mountain, with probability of samples being proportional to the height at the different places (the higher the altitude of a place on the mountain, the more likely the algorithm will sample said place).

Flipped Mountain Image of Flipped Mountain From Pexel

For HMC, we are flipping our mountain to have a distribution that is concave. Imagine a particle at the bottom of the mountain (which is now the top of the concave trough) and each step the particle takes brings it closer to the bottom of the trough (the inverted peak of the mountain). For each step that the particle takes, we randomly sample the amount of kinetic energy (momentum) we push the particle by. Please note that the objective is similar to Metropolis-Hastings; the particle wants to get to the peak of the mountain, but now the mountain is inverted, so the particle wants to get to the bottom of the trough. Since the particle wants to get to the bottom of the trough, it ends up staying longer in the trough; periodically we give the particle a kick so it gets out of the trough, and the particle can keep exploring. Since the particle spends longer in the bottom of the trough, we can take the frequencies of where the particle tend to be, invert the frequencies, and approximate the actual distribution.

Algorithm

We start with , , N, , L

Where is potential energy, which is proportional to the minus of the log probility density of the parameters. We can only sample , where is the unnormalized version of ; in other words, . K() is derived from the formula for momentum, N is the number of iterations, is the step size, and L is the number of steps ( and L are hyperparameters which can be tuned).

The actual algorithm is as follows:

As with Metropolis-Hastings, first we initialize , our first state in our Markov chain, with an arbitrary function (preferably something similar to the posterior distribution).

~

For iteration number i between 1 and N, do:

We sample the momentum from a multivariate normal distribution:

Sample ∼ MultiNorm(0, M), where M, the covariance, can be an Euclidean metric to make sampling more efficient. The momentum does not persist across iterations.

=

We then use Hamiltonian equations:

Recall that the Hamiltonian is the sum of the potential and kinetic energies: = +

We thus get the pair time derivatives:

Then the leapfrog integrator is used L times (we simulate L amount of time):

We sample a new momentum:

∼ MultiNorm(0, M)

Leapfrog then alternates half-step updates of the momentum and full-step updates of the position:

Finally, we apply the Metropolis acceptance step:

If then we will accept and set to .

If then we will accept with probability of . We will generate a random value between 0 and 1, if is greater than that value, then we set to , or else we set to .

No-U-Turn Sampler

No-U-Turn Sampler (NUTS) is an adaptive extension of HMC used in popular modeling libraries such as Stan. The main idea is to automatically select an appropriate number of leapfrog steps in each iteration to avoid doing unnecessary work. Once momentum is drawn from the multivariate normal distribution, NUTS tries to form a balanced binary tree by both going forwards and backwards in time; the NUTS iteration terminates when either the NUTS criterion is met (a U-turn made by the trajectory of the path of the particle) or when the binary tree reaches the max allowable depth. The intuition here is that we want to stop calculating our particle's path if it starts circling back towards the direction it came from, since then it would cover the same parts it has already covered and would waste computational power. More can be read in a paper by Hoffman and Gelman.

Conclusion

I thought this was a very cool approach to sampling, where one of my favourite topics, physics, is simulated to sample data. This algorithm is not only creative, but also very practical, being actively used in many statistical libraries. I would recommend reading Betancourt's paper to learn about more practical knowledge of HMW, and watching Ben Lambert's tutorial on how to intuitively derive the algorithm.