### Problem

In reinforcement learning, an agent interacts with an environment, which may or may not be fully observable. Given an observation $o_t$ or a state $s_t$ at a discrete timestep $t$, the agent is allowed to choose an action $a\in \mathcal{A}$. The goal is to choose a sequence of actions that gives the maximum total reward.

### Reward functions

So if at each timestep $t$, the reward is $r_t$, then we want to maximize $\mathbb{E}[R_t]$, where

$R_t = \sum_{t}^{\infty}r_t$

This formulation of total episodic reward has no guarantees on convergence, so it is not a good way to compute the amount of good or bad things that have happened to the agent.

This brings up the discounted reward, which looks like:

$R_t = \sum_{t}^{\infty}\gamma^t r_t \mbox{ where } 0\leq \gamma \lt 1$

With a non zero $\gamma$, we value our current reward the most ($r_t$ has a coefficient of $1$), but we also consider the future possible rewards, although with a diminished weight.

### Policies

A (deterministic) policy $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is a function that tells us what action to take in a given state $s\in \mathbb{S}$. There could be stochastic policies too, which provide a probability distribution over actions for a specific state $s$.

Our goal is to find the best policy $\pi^*$ that maximises the expected total discounted reward from a given state $s$.

### Value functions and $Q$ functions

Consider this function:

$V^\pi(s) = \mathbb{E}[R_t]$

Since it returns the expected reward starting from a given state and following a policy, it is a good measure of the utility of a state.

However, it is better to quantify state action pairs instead of states, since that works as the utility of taking an action in a state. Then, to find the best action, we can just find the best $Q$ value from a given state. It is defined similarly to the $V$ function:

$Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$

### The Bellman Loss

The definition of $Q$ and $V$ functions are both recursive, if we are using discounted rewards. So, if we eventually converge to a $Q^*$ or $V^*$ function, then $Q^*$ would be equal to the sum of one reward and $\gamma$ times $Q^*$ from the next state.

So, the quantity $\{r_t+\gamma\cdot \max_a Q(s', a)\}-Q(s, a)$ acts as a loss measure for how far we are from convergence. This can be used in loss based function approximator like neural networks.