Deep reinforcement learning setups to reach a goal usually have the goal embedded somewhere in the reward function. However, this makes the learning specific to that goal.

It is possible to use a different setup and make the learning more general, that is, good for any goal.

To do this, the state representation can be changed to pass in the output of a Siamese network. This network takes both the current state image and the target state image as an input and produces a discriminatory embedding that tells how different the current state and the target state are.

With this setup, the actor critic part learns through embeddings; the reward function can simply be a reward on the disagreement. If the disagreement is high, we penalize, and if it is low, we reward the setup, because we're getting close to the goal.