##### Motivation¶

Humans can easily adjust themselves or adopt to new environments and learn new tasks by setting their own goals. In case of Reinforcement Learning framework, we have to manually design the reward function which gives an orientation towards the goal of a given task. For example, if we have to train a robot to pick a package and deliver to a destination, we have to set reward based on its distance-covered. Along with delivery task, there might be other tasks like adjusting robot-arm to pick the package based on it's shape/size or placing the package at the destination without throwing it on the ground. For each of these tasks, we can design specific-reward functions.But, it won't be practical or scalable for real-world problems where an agent has to solve many tasks synchronously.

##### Contribution¶

Authors proposed a reinforcement learning framework where an agent can learn general-purpose goal-conditioned polices by setting it's own synthetic goals and learning tasks to achieve those goals, without human intervention.

They referred this framework as "reinforcement with imagined goals" (RIG).

###### Synthetic Goals¶

Initially, the agent itself generate a set of synthetic goals by exploration through a random policy. Both state observations and goals are the image data (for example in case of robot navigation). By random policy, agent executes some random actions in the environment and the trajectories consisting of state observations are stored for later use.

During policy training phase, agent can randomly fetch those stored observations as a set of initial states or set of goals.

Now, we have all the information to train a goal-conditioned agent. Authors used Q-learning agent - Q(s,a,g), where s - states, a- actions and g-goals to be achieved by executing action 'a'. And, the optimal policy can be derived as : $\pi (s,g) = \underset{a} max Q(s,a, g)$

In order to train this policy, two main issues need to be addressed:

a. How to design reward function ? Distance between images while nagivation is one possible reward. But, pixel-wise distance won't carry semantic meaning of actual distance between states and this will be also computationally involved. b. How to represent the goal as a distribution so that we sample goals for the training?

Authorse resolved these issues by using Variational Autoencoders (VAE), to learn encoded representation of images. The VAE takes raw images (x) as input and generate low-dimensional latent representation (z). Using these latent representation, we have now latent states (z) and latent goals ($z_g$).

The working algorithm can be summarized as :

a. Initially, agent explores environment using random policy and the state observations will be stored.

b. VAE will be trained using raw images from (a) to learn latent representation of all state observations.

c. Initial states (z) and goals ($z_g$) are sampled from (b)

d. Goal-conditioned Q-function Q(z,a, $z_g$) is trained using data from (c) and policy $\pi_{\theta} (z, z_g)$ will be learnt in the latent space.

In [ ]:


In [ ]: