Time to read: 8 min read
Please check out my blog post on Metropolis-Hastings before reading this blog.
Hamiltonian Monte Carlo (HMC) is a version of Metropolis-Hastings using a physical analogy; the main objective is the same, to determine the shape of a posterior density distribution through probablistic sampling. HMC methods exploit the geometric properties of the posterior distribution, and rather than exploring the distribution through a random walk (as with Metropolis-Hastings), HMC's samples are selected based on the position (the original parameter
TIP
HMC was initially used mainly in computational physics and chemistry, but now is also widely used in Bayesian machine learning. Hamiltonian mechanics, for people who stopped taking physics after high school (myself included), can be conceptualized as follows:
Newtonian mechanics states that the force is equal to the change in linear velocity; this is the foundation for solving many problems, particularly those which can be expressed in Cartesian form (rectangular coordinate system). The issue becomes more complicated, however, when it is difficult to consider all constraint forces. For instance, while it's possible to model with Newtonian mechanics, say, a pendulum swinging from the weight of another pendulum, you would have to deal with many transformation matrices.
Lagrangian mechanics aims to bypass the requirement to consider all constraint forces to make using mixed coordinate systems easier. The Lagrangian is the difference between the total kinetic energy of the system and the potential energy of the system. Intuitively, instead of tracking the force acting on each pendulum, you can instead describe the system through other means, such as the angles of the pendulums and the speed and direction of the pendulum weights; in other words, Lagrangian mechanics can be described through a generalized set of coordinates and velocities.
Hamiltonian mechanics is Lagrangian but instead of velocity, momenta is used. The idea is to make it easier for numerical integration; the trade-off, as with Lagrangian mechanics, is that Hamiltonian does not lend well to encapsulating dissipate forces such as friction and drag (this is ok since we don't consider those forces in HMC).
In HMC, we are concerned about potential energy and kinetic energy. In HMC, we imagine a particle physically on the distribution; the potential energy can be proxied by the location of the particle,
Image of Mountain From Pexel
Recall our mountain example from Metropolis-Hastings; the algorithm approximates the shape of the mountain (the shape of the distribution) through sampling different places on the mountain, with probability of samples being proportional to the height at the different places (the higher the altitude of a place on the mountain, the more likely the algorithm will sample said place).
Image of Flipped Mountain From Pexel
For HMC, we are flipping our mountain to have a distribution that is concave. Imagine a particle at the bottom of the mountain (which is now the top of the concave trough) and each step the particle takes brings it closer to the bottom of the trough (the inverted peak of the mountain). For each step that the particle takes, we randomly sample the amount of kinetic energy (momentum) we push the particle by. Please note that the objective is similar to Metropolis-Hastings; the particle wants to get to the peak of the mountain, but now the mountain is inverted, so the particle wants to get to the bottom of the trough. Since the particle wants to get to the bottom of the trough, it ends up staying longer in the trough; periodically we give the particle a kick so it gets out of the trough, and the particle can keep exploring. Since the particle spends longer in the bottom of the trough, we can take the frequencies of where the particle tend to be, invert the frequencies, and approximate the actual distribution.
We start with
Where
The actual algorithm is as follows:
As with Metropolis-Hastings, first we initialize
For iteration number i between 1 and N, do:
We sample the momentum
Sample
We then use Hamiltonian equations:
Recall that the Hamiltonian is the sum of the potential and kinetic energies:
We thus get the pair time derivatives:
Then the leapfrog integrator is used L times (we simulate L
We sample a new momentum:
Leapfrog then alternates half-step updates of the momentum and full-step updates of the position:
Finally, we apply the Metropolis acceptance step:
If
If
No-U-Turn Sampler (NUTS) is an adaptive extension of HMC used in popular modeling libraries such as Stan. The main idea is to automatically select an appropriate number of leapfrog steps in each iteration to avoid doing unnecessary work. Once momentum is drawn from the multivariate normal distribution, NUTS tries to form a balanced binary tree by both going forwards and backwards in time; the NUTS iteration terminates when either the NUTS criterion is met (a U-turn made by the trajectory of the path of the particle) or when the binary tree reaches the max allowable depth. The intuition here is that we want to stop calculating our particle's path if it starts circling back towards the direction it came from, since then it would cover the same parts it has already covered and would waste computational power. More can be read in a paper by Hoffman and Gelman.
I thought this was a very cool approach to sampling, where one of my favourite topics, physics, is simulated to sample data. This algorithm is not only creative, but also very practical, being actively used in many statistical libraries. I would recommend reading Betancourt's paper to learn about more practical knowledge of HMW, and watching Ben Lambert's tutorial on how to intuitively derive the algorithm.