Any type of data can be modeled by assuming that it follows an underlying probability distribution. Specifically, if we observe a set of random variables $\mathbf{X}$, then $\mathbf{X} \sim p(\mathbf{X} | \boldsymbol{\theta})$, where $\boldsymbol{\theta}$ refers to the parameters. The parameters may themselves have their own hyperparameters, i.e. $\boldsymbol{\theta}\sim p(\boldsymbol{\theta} | \boldsymbol{\alpha})$.

Bayesian inference: Real world high-dimensional data (observations) is typically assumed to come from a low-dimensional latent distribution on some random variables $\mathbf{z}$, which forms the well known manifold hypothesis. For example, a dataset containing images of apples and oranges may arise from a low-dimensional Bernoulli distribution. The joint distribution of observations and the latent variables are related as:

$$p(\mathbf{X}, \mathbf{z}) = p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})$$

The goal of inference is to compute the distribution $p(\mathbf{z}|\mathbf{X})$, which can be obtained using the Bayes rule:

$$p(\mathbf{z}|\mathbf{X}) = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{p(\mathbf{X})} = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{\int p(\mathbf{X},\mathbf{z})d\mathbf{z}}$$

To accurately understand the posterior, we need to compute the integral in the denominator (evidence). Because of the nature of $\mathbf{z}$, this integration is not easy to compute. Traditionally, therefore we use Monte Carlo techniques to approximate the posterior.

MCMC: To approximate the posterior $p(\mathbf{z}|\mathbf{X})$, MCMC techniques only use $p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})$ and avoid computing the denominator.

Let’s assume that there are a bunch of $\mathbf{z}$ samples which “potentially” represent a particular target distribution. For example, if $\mathbf{z}$ was 1-dimensional and could only have values between $[0, 1]$ and we had 20 $\mathbf{z}$ samples, 10 of which are $\mathbf{z} = {0.75}$ and the rest of which are $\mathbf{z} = {0.25}$, then our $p(\mathbf{z} = {0.25}) = 0.25$, $p(\mathbf{z} = {0.75}) = 0.5$ and the rest of $p(\mathbf{z}) = 0$.

MCMC constructs $\mathbf{z}$ samples that form an ergodic Markov chain. That is, an initial estimate $\mathbf{z}^{(0)}$ is chosen, and further estimates $\mathbf{z}^{(1)}, \mathbf{z}^{(2)} \cdots$ and so on are constructed, requiring only $\mathbf{z}^{(t)}$ to construct $\mathbf{z}^{(t+1)}$ (Markov assumption). The stationary state distribution of this Markov chain represents an approximation of the posterior.

There are three popular strategies for MCMC posterior approximation, Metropolis Hastings, Gibbs Sampling and Hamiltonian

Metropolis Hastings: The idea for MH goes as follows. The initial estimate is sampled from the prior $\mathbf{z}^{(0)} \sim p(\mathbf{z})$. Given any $\mathbf{z}^{(t)}$, $\mathbf{z}^{(t+1)}$ is constructed based on a simple rule. The proposed $\mathbf{z}^{(t+1)}$ is computed through sampling from a symmetric distribution $g$ centered on $\mathbf{z}^{(t)}$. Given the proposed location, a ratio $r$ can be computed that tells whether the posterior is more probable at the proposed location or not.

$$\tilde{\mathbf{z}}^{(t+1)} \sim g(\mathbf{z}^{(t)}, \cdot)\ r := \frac{p(\tilde{\mathbf{z}}^{(t+1)}|\mathbf{X})}{p(\mathbf{z}^{(t)}|\mathbf{X})} = \frac{p(\mathbf{X}|\tilde{\mathbf{z}}^{(t+1)}) \cdot p(\tilde{\mathbf{z}}^{(t+1)})}{p(\mathbf{X}|\mathbf{z}^{(t)})\cdot p(\mathbf{z}^{(t)})}$$

MH uses $\mathbf{z}^{(t+1)} := \tilde{\mathbf{z}}^{(t+1)}$ whenever $r \geq \epsilon$, where $0 \leq \epsilon \leq 1$ is uniformly randomly chosen in $[0, 1]$; otherwise no update is made. Note that $\mathbf{z}$ values with a higher posterior probability always lead to a new estimate, and therefore this intuitively represents the approximate posterior.

Gibbs Sampler: Unlike MH, GS estimates the components of the $\mathbf{z}$ estimate one by one. Let there be $K$ components in $\mathbf{z}^{(t)}$, that is, $z^{(t)}_1, z^{(t)}_2 \cdots z^{(t)}_K$. GS requires that the class conditional $p(z _i | z _1, z _2 \cdots z _{i-1}, z _{i+1}, \cdots z _K, \mathbf{X})$ is known, which may not always be feasible.

Like in MH, the initial estimate is sampled from the prior $\mathbf{z}^{(0)} \sim p(\mathbf{z})$. Now we visit the indices $1,2,\cdots,K$ in a fixed order and update as follows:

$$z_1^{(t+1)} \sim p(z_1|z_2^{(t)}\cdots z_{i-1}^{(t)}, z_{i+1}^{(t)}, \cdots z_K^{(t)}, \mathbf{X}) \\ z_2^{(t+1)} \sim p(z_2|z_1^{(t+1)}\cdots z_{i-1}^{(t)}, z_{i+1}^{(t)}, \cdots z_K^{(t)}, \mathbf{X}) \\ \vdots \\ z_K^{(t+1)} \sim p(z_K|z_1^{(t+1)}\cdots z_{i-1}^{(t+1)}, z_{i+1}^{(t+1)}, \cdots z_{K-1}^{(t+1)}, \mathbf{X})$$

Practically, GS is typically faster because it doesn’t reject any proposed changes, unlike MH. However, GS suffers from slow convergence when the features of $\mathbf{z}$ are correlated.