# RUDDER: Return Decomposition for Delayed Rewards

In reinforcement learning, we often want to do credit assignment, that is, assign credit for a received reward to past actions. Formally, if we have a trajectory $S_1,A_1,S_2,A_2,\cdots,S_T$ and a terminal reward $R_T$, we want to redistribute this sparse delayed non-Markov reward to earlier timesteps in the trajectory so that the reward becomes immediate and non-sparse. This paper \cite{arjona2018rudder} deals with doing exactly that, i.e. calculate which trajectory actions were instrumental in achieving the reward outcome.

### Sequence Markov Decision Processes (SDPs)

A finite MDP $M$ is a 5-tuple $M=(\mathcal S, \mathcal A, \mathcal R, p, \gamma)$ where $\mathcal S$ is the state space, $\mathcal A$ is the action space, $\mathcal R$ is the reward function, $p(S_{t+1}, R_t | S_t, A_t)$ is the transition reward distribution, and $\gamma \in [0, 1]$ is the discount factor. Markov decision processes usually come equipped with Markov policies $\pi(A_t|S_t)$.

A **sequence MDP** or **SDP** is a decision process with a Markov transition reward distribution and a Markov policy, but reward that is not required to be Markov. That means the reward could depend on the history of states ($\mathcal R(S_{1:t})$) rather just than the immediate state ($\mathcal R(S_t)$).

Note that all MDPs are SDPs by definition.

### Return Equivalent SDPs

Now we define **return equivalence**, which will come handy later. Basically in reward redistribution we want to convert our original SDP $M$ with delayed reward to an SDP $M'$ with immediate rewards. So, we define return equivalence as follows: two SDPs are return equivalent if they have different reward distributions but given a policy $\pi$, they have the same expected return at $t=1$: $(v_1^\pi)_ M = (v _1^\pi)_{M'}$.

It is easy to see that return equivalent SDPs conserve optimal policies. An optimal policy is defined as a policy that maximizes the expected return at $t=1$. Since the expected return at $t=1$ is the same across two return equivalent SDPs, they conserve optimal policies too. $$ \mathbb E_\pi[(G_1)_M|s_1] = \mathbb E_\pi[(G_1)_{M'}|s_1] $$ That is, while converting from one return equivalent SDP to another, the optimal policy stays the same. This means that we can convert our original SDP with a delayed reward to an SDP with immediate rewards while our optimal policy stays the same.

### Strictly Return Equivalent SDPs

Based on the concept of return equivalent SDPs, we can define **strictly return equivalent SDPs**. Return equivalent SDPs which have the same expected return for *every episode* and every policy are called strictly return equivalent SDPs. That is,
$$
\mathbb E_\pi[(G_1)_{M}|s_1,a_1,\cdots,s_T,a_T] = \mathbb E_\pi[(G_1)_{M'}|s_1,a_1,\cdots,s_T,a_T]
$$

### Reward Redistribution

Given all these formalisms, let’s recall the objective. Our goal is to do **reward redistribution**. More specifically, for a trajectory $s_1,a_1,s_2,a_2,\cdots,s_T, a_T$ we wish to redistribute the return $G_1$ (or the expected return) along the trajectory. Since such a redistribution conserves expected return at time $t=1$, we are essentially converting our SDP with the given reward into another SDP with a different reward distribution.

If we do reward redistribution for every trajectory, we are converting our SDP to a strictly return equivalent SDP.

### Optimal Reward Redistribution

How should we do our reward redistribution? This is the main idea as expressed in the paper. Given an SDP $M$ with a non Markov delayed reward, we wish to obtain a strictly return equivalent SDP $M'$ with immediate rewards. In this new SDP, we want to ensure that at every timestep, the expected future reward is $0$. More technically, recall that $$ Q^\pi (s_t,a_t) = r(s_t,a_t) + \mathbb E_\pi[G_{t+1}|s_t,a_t] $$ If we set $\mathbb E_\pi[G_{t+1}|s_t,a_t] := 0$ then all the reward for $(s_t,a_t)$ is given immediately. The task of estimating the $Q$ value is now equivalent to estimating $\mathbb E_\pi[R_t|s_t,a_t]$.

Considering a second order Markov SDP, the paper proves that $$ \mathbb E_\pi[(R_t)_{M'}|s_{t-1},a_{t-1},s_t,a_t] = (Q^\pi(s_t,a_t)-Q^\pi(s_{t-1},a_{t-1}))_M $$ That is, the expected immediate reward in the new SDP is equivalent to difference of $Q$ values in the old SDP.

### Novel Learning Algorithms

This difference in $Q$ values is a good way to estimate immediate rewards and convert to the new SDP. In the new SDP, since the reward redistribution is optimal, it is apparently trivial to compute $Q$ values. Therefore, we can easily perform on-policy direct $Q$ estimation using: $$ (Q^\pi(s_ t,a_ t))_ {M'} = (Q^\pi(s_ t,a_ t))_ M - \mathbb E_{s _{t-1}, a_ {t-1}|s_ t}[(Q^\pi(s_{t-1}, a_ {t-1}))_ M] $$ And we can do off-policy $Q$ estimation with a behaviour policy $\pi'$ as follows: $$ (Q^\pi(s_t,a_t))_{M'} = (Q^\pi(s_t,a_t))_M - \mathbb E_{s_{t-1}, a_{t-1},\pi'|s_t}[(Q^\pi(s_{t-1}, a_{t-1}))_M] $$ Another approach is to do policy gradients, where $Q^\pi$ can be computed trivially.

### Explaining Away Problem

In doing all of this, the **explaining away** problem may arise. That is, true causes of a reward outcome may not be attributed since a later state can also explain away the reward outcome.

For example, in a door key environment where the objective is to pick up a key, say the pick up happens at timestep $t_0$. However, this information of having picked up a key is also present in states starting from $t_0$ until the last step. In such a case, reward redistribution can attribute any state-action pair after $t_0$ as a step which explains the reward. We see a key in one of those states, and hence we get a reward, which is incorrect reasoning.

To solve this problem, the authors propose taking a difference of states to only capture the changes in the environment. Specifically, given a difference function $\Delta$, we compute $\Delta_{1:T}$, which are assumed to be mutually independent of each other: $$ \Delta_{1:T} = (\Delta(s_{0},a_{0},s_1,a_1),\cdots,\Delta(s_{T-1},a_{T-1},s_T,a_T)) $$

### Differences in Q values

An LSTM function $g$ predicts the return at every timestep. Here the contributions $h$ are calculated using the differences of return predictions: $$ h(\Delta_{1:t}) = g(\Delta_{1:t})-g(\Delta_{1:t-1}) $$ Finally, all the contributions must sum up to the final delayed return in the old SDP. $$ g(\Delta_{1:T}) = \sum_t h(\Delta_{1:t}) = \mathbb E[(R_T)_M|s_T,a_T] $$ Intuitively, for a binary delayed reward, in the new SDP, the immediate rewards represent the change in success probability, which corresponds to the difference in $Q$ values in the old SDP.

### Integrated Gradients

The second method the paper uses is **integrated gradients** (IG) \cite{sundararajan2017axiomatic}. IG is an explainable AI technique used to explain a model’s prediction in terms of its input features. So in our case, we can use it to explain the predicted returns by constructing contribution values.

Our input to the model is $x=\Delta_{1:T}$. We can interpolate between a baseline $\tilde x = 0$ and our input $x = \Delta_{1:T}$, and compute gradient of the model predictions w.r.t. the inputs. Then, we can use the formula $$ IG_i(x) := (x_i-\tilde x_i) \times \sum_{k=1}^m \frac{1}{m} \times \frac{\partial F(interpolated)}{\partial x_i} $$ For more information about IG, consider going through the Tensorflow tutorials.

### Layerwise Relevance Propagation

The third and final method the paper focuses on is called **Layerwise Relevance Propagation** (LRP) \cite{bach2015pixel}. LRP recursively propagates back relevance values, much like backpropagation. It starts at the output and redistributes the relevance values of every layer to its lower layer. The redistribution satisfies a relevance conservation principle, in that every node distributes all the relevance it receives from higher layers to lower layers.