Uber's Go Explore

The community is quite excited about Uber's Go-Explore algorithm. Hype or not, they claim to achieve some great results on Montezuma's Revenge, which is considered a hard game.

The algorithm's major hypothesis is detachment. Good RL algorithms fail because they are typically implemented using NNs, which suffer from catastrophic forgetting. So, once a section in the state space has been explored, future exploration may lead the NN to forget about it, howsoever promising it may be.

The algorithm works by constructing a repository of interesting states and the trajectories that lead to them. Every iteration, an interesting state is chosen and the trajectory leading up to it is roughly played out. Then, some exploration is done. The exploration may lead to a new interesting state or show a more rewarding trajectory to an existing interesting state. In both cases, the repository is updated.

A major part of the strategy is to be able to play out a trajectory to reach an interesting state. The play out would work perfectly in deterministic environments, but in stochastic environments, we can't play out a trajectory exactly, since the environment is choosing the next state stochastically. To be able to do this, the authors suggest we use goal based learning, just like in hindsight experience replay.

And finally, once we can solve the game this way, we can make the policy more robust with imitation learning. Overall, this approach claims to narrowly beat the human performance on the game.