##### Motivation¶

Deep Neural Network (DNN) is introduced to Reinforcement Learning (RL) framework in order to make function approximation easier/scable for large state-space problems. DNN itself suffers from overfitting because of the correlated data while nagivating through the environments (e.g. when we play a game, each consecutive moves of a player withing smaller timeframes looks similar, which won't contribute much for learning). In order to avoid it, people started using experience replay, where we have to store navigation experience (e.g. screenshots in games) as a buffer and we can use them later while training/updating policy/model parameters.

This works well, but only for off-policy algorithms like Q-learning. How to use on-policy algorithms like SARSA and make it stable learning using DNN ? Also, using experience replay introduces extra memory requirements/ computatonal delay for each update and real interaction with environment.

**NOTE :

• On-policy : The training data is generated by the same policy being trained. e.g : Reinforce
• Off-Policy : The training data generated from another policy can be used to train the current policy. e.g : Q-learning
##### Contribution¶

In this paper, authors introduced an asynchronous training process by executing multiple agents in parallel in different instances of the same environment using multiple CPU cores. It uses multithreading to run those agents and update the global model parameters asynchronously in online fashion. It is reported that this approach enables stable learning and faster convergence speed.

They have introduced asynchronous variants of SARSA, 1-step/n-step Q learning and advantage actor-critic algorithm. Let's discuss some details on Asynchronous Q-learning and Async. Advantage Actor critic (A3C) algorithms.

###### Asynchronous Q-Learning¶

In Deep Q-Learning (DQN), the neural network (NN) approximates the Q-value function Q(s,a, $\theta$) with loss formulated as: $$$L_{i} (\theta_i) = \mathbb{E} [ r + \gamma Q (s^, a^, \theta_{i-1}) - Q(s, a, \theta_i) ]^2 ..........(i)$$$

In Async 1-step Q learning, each thread maintains it's own copy of environment and agent traverse through the environment with the help of $\epsilon$ - greedy policy. At each step, we compute teh gradient of the loss (i) and collect gradients over multiple timesteps before updating the parameters.

Actor-critic method combines both value-pased and policy based methods.

It has a policy $\pi(a_t | s_t; \theta)$ and value function $V (s_t; \theta_t)$ to be learnt. It uses "forward-view", i.e. selecting actions based on its exploration strategy $\pi (a_t | s_t; \theta)$ up to some $t_{max}$ steps in the future, to collect up to $t_{max}$ rewards since last update.

Now, policy and value functions are updated after every $t_max$ actions as:

• Policy Network : $\bigtriangledown_{\theta} log \pi(a_t | s_t; \theta) (R_t - V(s_t, \theta_v))$
• Value Network : $\bigtriangledown_{v} (R - V(s_t; \theta_v))^2$

In this learning framework, parallel actor-learners updates a shared model and make learning process more robust and stable

In [ ]: