Summary of the paper "Proximal Policy Optimization Algorithms"

##### Motivation¶

Deep Q learning, 'Vanilla' Policy Gradient, REINFORCE are the examples of approaches for function approximation in RL. When it comes to RL, robustness and sample efficiency are the measures that defines effectivenss of the applied algorithm.

In RL formulation, the agent needs to solve a task/problem in an envronment. Agent countinously interacts with the environment, which provides rewards to the agent, and thereafter agent learns a policy to navigate and tackle the problem. In every time step, RL agent has to make a decision by selecting preferrable action. To do so, agent fully relies on the information of current state and accumulated knowledge (history of rewards) up to current time step. Once the action is performed, the next state/observation is defined by some (stochastic) transition probability model. Also, reward will be signaled to the agent based on this new state information and performed action to get there. In overall, the goal of the agent is to maximize expected cumulative rewards.

In terms of RL, this goal can be formulated as finding a policy $ \pi (a_t | s_t) $ such that expected reward $ \mathbb{E_{\pi_{\theta, \tau}}} [G_t] $ is maximized.

In high dimensional/ countinous action space, policy gradient method can be used to solve this problem. In "vanilla" policy gradient method, the policy is parameterized by some parameter $ \theta $ .i.e parametric policy $ \pi_{\theta} (a_t | s_t) $ and we directly optimize the policy by finding the optimal parameter $ \theta $.

Even though 'Vanilla' policy gradient/ REINFORCE are simple/easier to implement, they come with some learning issues:

PROBLEM : Usually give rise to high variance while estimating gradient. This is because, the objective function

$ J(\theta) = \mathbb{E_{\pi_{\theta, \tau}}} (G_t) $

contains expectation; so we can't directly compute the exact gradient. We use stochatic gradient estimates such as REINFORCE based on some batch of trajectories. This sampling approximation adds some variance. That means, we need large number of trajectories to get the best estimation. In addition, we can see that collecting trajectories could be a problem in complex environments-might take long time to run.

##### Contribution¶

Authors have introduced a family of policy optimization methods which is build up on the basis of work of Trust-Region Policy optimization. Two main ideas:

a. Clipped Surrogate Objective Function : It avoids large deviations of learned policy $ \pi_{\theta} $ from old policy $ \pi_{\theta old} $; which is formulated as:

$ L^{clip}(\theta) = \mathbb{E}_t [ min (r_t (\theta) \hat{A_t}, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A_t}) ] $

Here $ r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta old}(a_t | s_t)} $

$ \epsilon $ is hyperparameter, which restricts the new policy from being too far from old policy. $ \hat{A_t} $ can be discounted return or advantage function.

The clipping ensures the updates take place in "trust-region". Also, introduces less variance than vanilla gradient methods

b. Multiple Epochs for policy update:

PPO allows to run multiple epochs on the same trajectories and optimize the objective $ L^{clip}(\theta) $. This also reduces the sample inefficieny while learning. In order to collect data, PPO runs policy with parallel actors and then samples mini-batches of the data for training k-epochs using the objective function above.

Let's observe the behaviour of the objective function based on changes in advantage function.

CASE I : When $ \hat{A_t} $ is +ve :

The objective function can be written as :

$ L^{clip} (\theta) = min ( r_t(\theta), 1 + \epsilon ) \hat{A_t} $

Since $\hat{A_t} $ is +ve, when the action occurrence likelihood increases (i.e. $ \pi_{\theta}(a_t | s_t) $), the whole objective value will also increase. The min operator limits the increasing objective value. So, when $ \pi_{\theta}(a_t | s_t) ) > (1 + \epsilon) \pi_{\theta old}(a_t | s_t) $, ceiling occurs at $ (1 + \epsilon) \hat{A_t} $

CASE II : When $ \hat{A_t} $ is -ve :

The objective function can be written as :

$ L^{clip} (\theta) = max ( r_t(\theta), 1 - \epsilon ) \hat{A_t} $

Now, the objective function value only increases when the likelihood of the action is less likely (i.e. $ if \pi_{\theta}(a_t | s_t) $ decreases, $ L^{clip} (\theta) $ increases ) In this case, when $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) > \pi_{\theta}(a_t | s_t) ) $, max operator limits the value at $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) \hat{A_t} $

```
```