Hindsight Experience Replay

Reward engineering is a difficult problem in reinforcement learning. While humans learn a lot from both success and failure scenarios, reinforcement learning algorithms usually tend to learn way more from successes than failures. It is possible to draw good learning conclusions from failures that can help converge the agent faster. This is the idea behind hindsight experience replay, which can be used with any off-policy RL algorithm.

Let's first see how this can be applied in a multi-goal setting. Schaul et al. \cite{schaul2015universal} detail the use of a neural network based function approximator that learns a general purpose $Q$ function over states and goals.

After experiencing any episode $s_ 0, s_ 1, s_ 2 \cdots s_ T$ where goal for the episode is $g$, we typically store the transitions $(s_ t \rightarrow s_ {t+1}, g)$ in the experience buffer. However, for doing this better kind of learning, we store additional experience tuples of the same transition but with some other goals, in which the transition may prove useful for learning.

There could be a few strategies for creating additional tuples.

If the goals are represented the same way as states, then an easy strategy for creating more experience tuples is to consider $s_ T$ as a goal for all the experience tuples in the episode and append $(s_ t \rightarrow s_ {t+1}, s_ T)$ to the buffer. The HER paper calls this strategy final.

Another strategy could be to consider possible future states as goals. This would be the strategy future. Practically, this is a better strategy than final, as it shows the best performance.

Then, there could be a random strategy, in which the goal is any random state encountered so far in training. There could be an episode random strategy, where any random state encountered in the episode could be the goal.

For the single goal setting, we can consider it equivalent to a multi goal setting where goals are represented the same way as states and the concatenation of state and goal is the input to the $Q$ network. Then, either of the strategies could be used.