Motivation¶

Reinforcement Learning (RL) solves the problem of learning through experiments in the (dynamic) environments. The learner objective is to find an optimal policy which can guide the agent for the nagivation. This optimal policy is formulated in terms of maximizing future reward of the agent. Value-function $V_{\pi} (s)$ and action-value function $Q_{\pi}(s,a)$ are the measure of potential future rewards.

• $V_{\pi} (s)$ : Goodness measure to be in a state s and then following policy $\pi$
• $Q_{\pi}(s,a)$ : Goodness measure to be in a state s, perform action a and then follow policy $\pi$

NOTE

• Both $V_{\pi} (s)$ and $Q_{\pi}(s,a)$ are related to rewards in terms of expectation of the discounted future reards and their values are maintained on a lookup table.
• Goal : We want to find the value of (state) or (state,action) in a given environment, so that the agent can follow an optimal path, collecting maximum rewards.

In a large scale RL problem, maintaining lookup table will lead to the the problem of 'curse of dimensionality'. Currently, this problem is solved using function approximation. The function approximation tries to generalize the estimation of value of state or state-action value based on a set of features in a given state/observations. Most of the existing approaches follow the idea of approximating the value function and then deriving policy out of it. Authors have pointed out two major limitations of this approach:

a. This approach focused towards finding deterministic policy, which might not be the case for complex problems/environments. b. Small variation in the value estimation might cause different action selection; derived policy is sensitive.

Contribution¶

Authors proposed an alternative way to approximate policy directly using parameterized function. So, we won't be storing any Q-values in a table, but, learnt using a function approximator. For an example, the policy can be represented by a Neural Network (NN) where we can feed state as input and get probability distribution for action selection as output. Considering $\theta$ as parameters of the NN, representing the policy and $\rho$ as its performance measure (which can be a loss function), then the parameter $\theta$ will be updated as:

$\theta_{t+1} \gets \theta_t + \alpha \frac{\partial {\rho}}{ \partial{\theta}}$

For any Markov Decision Process (MDP),

$\nabla_{\theta} J(\theta) = \frac{\partial {\rho(\pi)}}{ \partial{\theta}} = \underset{s} \sum d^{\pi} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} Q^{\pi}(s,a)$ ----------(a)

Here $\rho(\pi)$ , the average rewards under current policy ($\pi$) and $d^{\pi}(s)$, stationary distribution of states under $\pi$

• The problem with the above formulation is 'how to get Q(s,a) ?' -> Q(s,a) must be estimated.

We can see that the state distribution is independent of policy parameter $\theta$. Since, gradient is independent of MDP dynamics, it allows model-free learning in RL. If we estimate the policy gradient using Monte-Carlo sampling, it will give REINFORCE algorithm.

In Monte-Carlo sampling, we take N trajectories using current policy $\pi$ and collect the returns. However, these returns hae high variance and we might need many episodes for the smooth convergence. The variance is introduced due to the fact that we won't be able to collect same trajectories multiple times(.i.e movement of agent is also dynamic) using out stochastic policies in the stochastic environment.

• QUESTION : How to estimate Q-value in equation (a) ?

Authors used a function approximation $f_w (s,a)$ with parameters 'w' to estimate $Q^{\pi} (s,a)$ as :

$\nabla_{\theta} J(\theta) = \underset{s} \sum d^{\pi(s)} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} f_w(s,a)$ --------- (b)

Here $f_w(s,a)$ is learnt by following $\pi$ and updating 'w' by minimizing mean-square error between Q-values $[ Q^{\pi}(s,a) - f_w(s,a) ]^2$. The neural network/policy will predict some Q-value and also when agent take some action in the environment, we predict Q-value for a given state/action. Algorithm will try to make sure difference between these two remains as close as possible.

The resulting formulation (b) gives the idea of actor-critic architecture for RL where

i. $\pi(s,a)$ is the actor which is learning to approximate the policy by maximixing (b)

ii. The critic $f_w(s,a)$ learning to estimate the policy by minimizing MSE with estimated and true Q-values.

In [ ]: