##### Motivation¶

The policy learning process in Reinforcement Learning (RL) is usually suffered due to delayed/sparse rewards. Reward is a direct signal for an agent to evaluate 'how good the current action selection is'. Since reward collection takes time, learning optimal policy also takes longer time to derive. Another factor that influence the learning process is human-designed reward function. These reward functions might not represent the optimal guidance for learning of the agent or won't be scalable to real world problems. We need a way to overcome reward sparsity and also improves exploration of the agent to make learning more robust.

Human learning process is not only guided by the final goal or achievement, but also driven by motivation or curiosity of the being. Curiosity adds exploratory behaviour to the agent, allowing it to acquire new skills and gain new knowledge about the environment. It also makes agent robust to perform actions which ultimately reduces uncertaintly on it's behaviours to capture the consequences of it's own action.

##### Contribution¶

The authors, in this paper, proposed curiosity-driven learning by uing agent-intrinsic reward (.i.e a reward which is learnt by agent itself by understanding the current environment or possible changes in the states while navigation). In order to quantify curiosity, they have introduced "Intrinsic Curiosity Module".

###### Intrinsic Curiosity Module (ICM)¶

The output of ICM is the state prediction error, which serves as reward for curiosity. This module has two sub-components, each represented by neural networks.

a. Inverse Model :

This model learns feature space using self-supervision. This new feature space is learnt in order to avoid features/information which are irrelevant to the agent while nagivation. Learning feature space is completed within two sub-modules:

i) First module encodes the raw input state ($s_t$) into a feature vector ($\phi(s_t)$) ii) Second module takes $\phi(s_t)$ and $\phi(s_{t+1})$) as encoded feature inputs and predicts action $\hat{a_t}$ that agent might take to go to $s_{t+1}$ from $s_t$

$\hat{a_t} = g( \phi(s_t), \phi(s_{t+1}), \theta_i )$

Here function g represents NN and $\hat{a_t}$ is estimated action. The learnable parameters $\theta_i$ are trained with loss function representing difference between predicted action and actual action. i.e $L_I( \hat{a_t}, a_t)$

b. Forward Model :

This is a NN which predicts the next state ($s_{t+1}$) with inputs $\phi(s_t)$ and action executed at $s_t$.

$\hat{\phi(s_{t+1})} = f( \phi(s_t), a_t, \theta_F)$

$\hat{\phi(s_{t+1})}$ is the predicted estimation of $\phi(s_{t+1})$ and $\theta_F$ represents trainable parameters, with loss function as:

$L_F ( \phi(s_{t+1}), \hat{\phi(s_{t+1})}) = \frac{1}{2} || \hat{\phi(s_{t+1})} - \phi(s_{t+1}) ||^2 = \eta L_F$

Both losses can be jointly expressed as :

$\underset{\theta_i, \theta_F} {max} [ (1-\beta) L_I + \beta L_F ]$

NOTE:

ICM worked with two connected modules - inverse model (which learnt the feature representation of state and next state) and forward model ( which predicts the feature representation of the next state) Curiosity can be calculated by the difference between output of forward model i.e $\hat{\phi(s_{t+1})}$ and output of the inverse model $\phi(s_{t+1})$.

In [ ]: