Value functions and the Bellman Loss
Problem
In reinforcement learning, an agent interacts with an environment, which may or may not be fully observable. Given an observation $o_t$ or a state $s_t$ at a discrete timestep $t$, the agent is allowed to choose an action $a\in \mathcal{A}$. The goal is to choose a sequence of actions that gives the maximum total reward.
Reward functions
So if at each timestep $t$, the reward is $r_t$, then we want to maximize $\mathbb{E}[R_t]$, where
\[ R_t = \sum_{t}^{\infty}r_t \]
This formulation of total episodic reward has no guarantees on convergence, so it is not a good way to compute the amount of good or bad things that have happened to the agent.
This brings up the discounted reward, which looks like:
\[ R_t = \sum_{t}^{\infty}\gamma^t r_t \mbox{ where } 0\leq \gamma \lt 1 \]
With a non zero $\gamma$, we value our current reward the most ($r_t$ has a coefficient of $1$), but we also consider the future possible rewards, although with a diminished weight.
Policies
A (deterministic) policy $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is a function that tells us what action to take in a given state $s\in \mathbb{S}$. There could be stochastic policies too, which provide a probability distribution over actions for a specific state $s$.
Our goal is to find the best policy $\pi^* $ that maximises the expected total discounted reward from a given state $s$.
Value functions and $Q$ functions
Consider this function:
\[V^\pi(s) = \mathbb{E}[R_t]\]
Since it returns the expected reward starting from a given state and following a policy, it is a good measure of the utility of a state.
However, it is better to quantify state action pairs instead of states, since that works as the utility of taking an action in a state. Then, to find the best action, we can just find the best $Q$ value from a given state. It is defined similarly to the $V$ function:
\[Q^\pi(s,a)=\mathbb{E}[R_t|s,a]\]
The Bellman Loss
The definition of $Q$ and $V$ functions are both recursive, if we are using discounted rewards. So, if we eventually converge to a $Q^* $ or $V^* $ function, then $Q^* $ would be equal to the sum of one reward and $\gamma$ times $Q^* $ from the next state.
So, the quantity $\{r_t+\gamma\cdot \max_a Q(s', a)\}-Q(s, a)$ acts as a loss measure for how far we are from convergence. This can be used in loss based function approximator like neural networks.