The goal of Bayesian inference is to compute a posterior distribution $p(\mathbf{z}|\mathbf{X})$, which can be obtained using the Bayes rule:

$p(\mathbf{z}|\mathbf{X}) = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{p(\mathbf{X})} = \frac{p(\mathbf{X}|\mathbf{z}) \cdot p(\mathbf{z})}{\int p(\mathbf{X},\mathbf{z})d\mathbf{z}}$

Computing the integral in the denominator (evidence) is not easy because of the nature of $\mathbf{z}$, therefore typically the posterior is approximated.

Variational inference approximates the posterior through a different strategy. Specifically, since the posterior $p(\mathbf{z}|\mathbf{X})$ is a probability distribution on $\mathbf{z}$, it can be reasonably assumed that out of a certain family $F$ of probability distributions $q(\mathbf{z}|\mathbf{X}) \in F$, there will be a certain distribution $q^*(\mathbf{z}|\mathbf{X}) \in F$ which will be the most similar to the posterior. This can be expressed using KL-divergence as:

$q^*(\mathbf{z}|\mathbf{X}) = \arg\min_{q(\mathbf{z}|\mathbf{X}) \in F} KL(q(\mathbf{z|\mathbf{X}}) || p(\mathbf{z}|\mathbf{X}))$

Note that the direction of the divergence is not arbitrary, but necessary.

Minimizing the KL term is still not easy, because the KL term depends on the evidence:

\begin{align*} KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z}|\mathbf{X})) &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z} | \mathbf{X})}\bigg] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] + \log p(\mathbf{X}) \end{align*}

However, since the KL divergence is always non negative,

$\log p(\mathbf{X}) \geq - \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] =: \mathcal{L}(q)$

This lower bound of the evidence is called the evidence lower bound or ELBO. While it is difficult to minimize the KL term in the objective for $q^*(\mathbf{z}|\mathbf{X})$ for all $\mathbf{X}$, we can still minimize the KL term over $\mathbf{z}$ to obtain a good candidate in the family $F$ of $q(\mathbf{z}|\mathbf{X})$.

Therefore, ignoring the evidence term, our goal is to minimize the KL term, which is equivalent to maximizing ELBO.

$\min_{q} KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z}|\mathbf{X})) = \min_q \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{q(\mathbf{z}|\mathbf{X})}{p(\mathbf{z}, \mathbf{X})}\bigg] = \max_q \mathcal{L}(q)$

Computing the ELBO: $\mathcal{L}(q)$ can also be decomposed on $\mathbf{z}$:

\begin{align*} \mathcal{L}(q) &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}\bigg[\log \frac{p(\mathbf{z}, \mathbf{X})}{q(\mathbf{z}|\mathbf{X})}\bigg] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{X})}[p(\mathbf{X}|\mathbf{z})] - KL(q(\mathbf{z}|\mathbf{X}) || p(\mathbf{z})) \end{align*}

This structure resembles an autoencoder, where $p(\mathbf{X}|\mathbf{z})$ can be realized using the decoder and $q(\mathbf{z}|\mathbf{X})$ can be realized using the encoder. In practice, autoencoders with the ELBO loss are known as variational autoencoders.

## References

1. Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." Journal of the American statistical Association 112.518 (2017): 859-877.