Variational Inference

The goal of Bayesian inference is to compute a posterior distribution $p(\mathbf{z}|\mathbf{X})$, which can be obtained using the Bayes rule:

\[ p(\mathbf{z}|\mathbf{X}) = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{p(\mathbf{X})} = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{\int p(\mathbf{X},\mathbf{z})d\mathbf{z}} \]

Computing the integral in the denominator (evidence) is not easy because of the nature of $\mathbf{z}$, therefore typically the posterior is approximated.

Variational inference approximates the posterior through a different strategy. Specifically, since the posterior $p(\mathbf{z}|\mathbf{X})$ is a probability distribution on $\mathbf{z}$, it can be reasonably assumed that out of a certain family $F$ of probability distributions $q(\mathbf{z}|\mathbf{X}) \in F$, there will be a certain distribution $q^*(\mathbf{z}|\mathbf{X}) \in F$ which will be the most similar to the posterior. This can be expressed using KL-divergence as:

\[ q^*(\mathbf{z}|\mathbf{X}) = \arg\min_{q(\mathbf{z}|\mathbf{X}) \in F} KL(q(\mathbf{z|\mathbf{X}}) || p(\mathbf{z}|\mathbf{X})) \]

Note that the direction of the divergence is not arbitrary, but necessary.

Minimizing the KL term is still not easy, because the KL term depends on the evidence:

\[ \begin{align*} KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z}|\mathbf{X})) &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z} | \mathbf{X})}\bigg] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] + \log p(\mathbf{X}) \end{align*} \]

However, since the KL divergence is always non negative,

\[ \log p(\mathbf{X}) \geq - \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] =: \mathcal{L}(q) \]

This lower bound of the evidence is called the evidence lower bound or ELBO. While it is difficult to minimize the KL term in the objective for $q^*(\mathbf{z}|\mathbf{X})$ for all $\mathbf{X}$, we can still minimize the KL term over $\mathbf{z}$ to obtain a good candidate in the family $F$ of $q(\mathbf{z}|\mathbf{X})$.

Therefore, ignoring the evidence term, our goal is to minimize the KL term, which is equivalent to maximizing ELBO.

\[ \min_{q} KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z}|\mathbf{X})) = \min_q \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] = \max_q \mathcal{L}(q) \]

Computing the ELBO: $\mathcal{L}(q)$ can also be decomposed on $\mathbf{z}$:

\[ \begin{align*} \mathcal{L}(q) &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{p(\mathbf{z}, \mathbf{X})}{q(\mathbf{z}|\mathbf{X})}\bigg] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}[p(\mathbf{X}|\mathbf{z})] - KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z})) \end{align*} \]

This structure resembles an autoencoder, where $p(\mathbf{X}|\mathbf{z})$ can be realized using the decoder and $q(\mathbf{z}|\mathbf{X})$ can be realized using the encoder. In practice, autoencoders with the ELBO loss are known as variational autoencoders.

References

  1. Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." Journal of the American statistical Association 112.518 (2017): 859-877.