What is machine learning ?

Understanding and extracting hidden patterns or features from the data is the learning process in machine learning. Instead of using explicit logic supplied by people, machine learning has the capacity to learn from experiences. Conventional systems are created with the use of well defined human-set rules. In order for machine learning algorithms to understand complicated patterns from inputs (x), they use outputs (y) as a feedback signal. Thus, an intelligent program is the ML system's final product.

We often use a logical method to solve any issue. We make an effort to break the task up into several smaller tasks and solve each smaller task using a distinct rationale. When dealing with extremely complicated jobs, like stock price prediction, the patterns are always changing, which has an impact on the results. That implies that, in order to answer this problem logically, we must adjust our handwritten logic for each new change in the outputs. Machine Learning (ML), on the other hand, creates the model using a vast amount of data. The data gives the model all of its historical experience, which helps it better understand the pattern. We just retrain the model with fresh instances whenever the data changes.

Paper Summary : Playing Atari With Deep Reinforcement Learning

Motivation

Deep Learning (DL) has proven to work well when we have large amount of data. Unline supervised DL algorithm setup, Reinforcement Learning (RL) doesn't have direct access to the targets/labels. RL agent usually get "delayed and sparsed" rewards as a signal to understand about the environment and learn policy for a given environment. Another challenge is about the distribution of the inputs. In supervised learning, each batch in training loop is drawn randomly which make sure each inputs/samples are independent and the parameter updates won't overfit to some specific direction/class in the data. In case of RL, inputs are usually correlated. For example, when you collect image inputs/frames of video of games, their pixel positions won't change much. Therefore, many samples will look alike and this might lead to poor learning and local optimal solution. Another problem is the non-stationarity of the target. The target will be changing throughout the episodes when the agent learns new behaviour from the environment, or adopting well.

Contribution

Authors proposed 'Deep Q Network' (DQN) learning algorithm with experience replay. This approach solves both the correlated inputs and non-stationarity problems.

They uses CNN with a variant of Q-learning algorithm, and uses stochastic gradient descent (SGD) for the training. They maintained a buffer named - 'Experience Replay' of the transitions while the agent nagivates through the environment. While SGD training process, samples from this stored buffer is used to create mini-batches and used for the training of the NN. This refer this NN as Q-network with parameter, $ \theta $, which minimizes the sequences of loss functions $ L_i (\theta_i) $ :

$ L_i(\theta_i) $ = $ \mathbb{E_{s,a \sim \rho(.)}} [ (y_i - Q(s, a; \theta_i)^2 ] $

Where $ y_i = \mathbb{E_{s' \sim \varepsilon}} [ r + \gamma \underset{a'} max (s', a', \theta_{i-}) | s,a] $

is the target for iteration i.

They used the previous iteration parameter value ($ \theta_{i-1} $) in order to calculate the target ($y_i$). The parameter ($ \theta_{i-1} $) from previous iteration won't change for some long future iterations, which makes it stationary and training will be smooth. They also feed concatenation of four video frames as an input to the CNN in order to avoid the partial observation contraints in the learning. Using four frames, CNN will be able locate the movement direction, speed of the objects in the frames.

DQN is used to train on Atari 2600 games. The video frames from emulator are the observations based on discrete actions (up, down, left, rigth..) of the agent in the environment. The network consists of two convolutional layers and two fully connected layers. The last layer outputs the distribution over possible actions.

In [ ]:
 
Sijan Bhandari on