##### Motivation¶

Deep Learning (DL) has proven to work well when we have large amount of data. Unline supervised DL algorithm setup, Reinforcement Learning (RL) doesn't have direct access to the targets/labels. RL agent usually get "delayed and sparsed" rewards as a signal to understand about the environment and learn policy for a given environment. Another challenge is about the distribution of the inputs. In supervised learning, each batch in training loop is drawn randomly which make sure each inputs/samples are independent and the parameter updates won't overfit to some specific direction/class in the data. In case of RL, inputs are usually correlated. For example, when you collect image inputs/frames of video of games, their pixel positions won't change much. Therefore, many samples will look alike and this might lead to poor learning and local optimal solution. Another problem is the non-stationarity of the target. The target will be changing throughout the episodes when the agent learns new behaviour from the environment, or adopting well.

##### Contribution¶

Authors proposed 'Deep Q Network' (DQN) learning algorithm with experience replay. This approach solves both the correlated inputs and non-stationarity problems.

They uses CNN with a variant of Q-learning algorithm, and uses stochastic gradient descent (SGD) for the training. They maintained a buffer named - 'Experience Replay' of the transitions while the agent nagivates through the environment. While SGD training process, samples from this stored buffer is used to create mini-batches and used for the training of the NN. This refer this NN as Q-network with parameter, $\theta$, which minimizes the sequences of loss functions $L_i (\theta_i)$ :

$L_i(\theta_i)$ = $\mathbb{E_{s,a \sim \rho(.)}} [ (y_i - Q(s, a; \theta_i)^2 ]$

Where $y_i = \mathbb{E_{s' \sim \varepsilon}} [ r + \gamma \underset{a'} max (s', a', \theta_{i-}) | s,a]$

is the target for iteration i.

They used the previous iteration parameter value ($\theta_{i-1}$) in order to calculate the target ($y_i$). The parameter ($\theta_{i-1}$) from previous iteration won't change for some long future iterations, which makes it stationary and training will be smooth. They also feed concatenation of four video frames as an input to the CNN in order to avoid the partial observation contraints in the learning. Using four frames, CNN will be able locate the movement direction, speed of the objects in the frames.

DQN is used to train on Atari 2600 games. The video frames from emulator are the observations based on discrete actions (up, down, left, rigth..) of the agent in the environment. The network consists of two convolutional layers and two fully connected layers. The last layer outputs the distribution over possible actions.

In [ ]:


Sijan Bhandari on
##### Motivation¶

Reinforcement Learning (RL) solves the problem of learning through experiments in the (dynamic) environments. The learner objective is to find an optimal policy which can guide the agent for the nagivation. This optimal policy is formulated in terms of maximizing future reward of the agent. Value-function $V_{\pi} (s)$ and action-value function $Q_{\pi}(s,a)$ are the measure of potential future rewards.

• $V_{\pi} (s)$ : Goodness measure to be in a state s and then following policy $\pi$
• $Q_{\pi}(s,a)$ : Goodness measure to be in a state s, perform action a and then follow policy $\pi$

NOTE

• Both $V_{\pi} (s)$ and $Q_{\pi}(s,a)$ are related to rewards in terms of expectation of the discounted future reards and their values are maintained on a lookup table.
• Goal : We want to find the value of (state) or (state,action) in a given environment, so that the agent can follow an optimal path, collecting maximum rewards.

In a large scale RL problem, maintaining lookup table will lead to the the problem of 'curse of dimensionality'. Currently, this problem is solved using function approximation. The function approximation tries to generalize the estimation of value of state or state-action value based on a set of features in a given state/observations. Most of the existing approaches follow the idea of approximating the value function and then deriving policy out of it. Authors have pointed out two major limitations of this approach:

a. This approach focused towards finding deterministic policy, which might not be the case for complex problems/environments. b. Small variation in the value estimation might cause different action selection; derived policy is sensitive.

##### Contribution¶

Authors proposed an alternative way to approximate policy directly using parameterized function. So, we won't be storing any Q-values in a table, but, learnt using a function approximator. For an example, the policy can be represented by a Neural Network (NN) where we can feed state as input and get probability distribution for action selection as output. Considering $\theta$ as parameters of the NN, representing the policy and $\rho$ as its performance measure (which can be a loss function), then the parameter $\theta$ will be updated as:

$\theta_{t+1} \gets \theta_t + \alpha \frac{\partial {\rho}}{ \partial{\theta}}$

For any Markov Decision Process (MDP),

$\nabla_{\theta} J(\theta) = \frac{\partial {\rho(\pi)}}{ \partial{\theta}} = \underset{s} \sum d^{\pi} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} Q^{\pi}(s,a)$ ----------(a)

Here $\rho(\pi)$ , the average rewards under current policy ($\pi$) and $d^{\pi}(s)$, stationary distribution of states under $\pi$

• The problem with the above formulation is 'how to get Q(s,a) ?' -> Q(s,a) must be estimated.

We can see that the state distribution is independent of policy parameter $\theta$. Since, gradient is independent of MDP dynamics, it allows model-free learning in RL. If we estimate the policy gradient using Monte-Carlo sampling, it will give REINFORCE algorithm.

In Monte-Carlo sampling, we take N trajectories using current policy $\pi$ and collect the returns. However, these returns hae high variance and we might need many episodes for the smooth convergence. The variance is introduced due to the fact that we won't be able to collect same trajectories multiple times(.i.e movement of agent is also dynamic) using out stochastic policies in the stochastic environment.

• QUESTION : How to estimate Q-value in equation (a) ?

Authors used a function approximation $f_w (s,a)$ with parameters 'w' to estimate $Q^{\pi} (s,a)$ as :

$\nabla_{\theta} J(\theta) = \underset{s} \sum d^{\pi(s)} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} f_w(s,a)$ --------- (b)

Here $f_w(s,a)$ is learnt by following $\pi$ and updating 'w' by minimizing mean-square error between Q-values $[ Q^{\pi}(s,a) - f_w(s,a) ]^2$. The neural network/policy will predict some Q-value and also when agent take some action in the environment, we predict Q-value for a given state/action. Algorithm will try to make sure difference between these two remains as close as possible.

The resulting formulation (b) gives the idea of actor-critic architecture for RL where

i. $\pi(s,a)$ is the actor which is learning to approximate the policy by maximixing (b)

ii. The critic $f_w(s,a)$ learning to estimate the policy by minimizing MSE with estimated and true Q-values.

In [ ]:


Sijan Bhandari on
##### Motivation¶

Deep Q learning, 'Vanilla' Policy Gradient, REINFORCE are the examples of approaches for function approximation in RL. When it comes to RL, robustness and sample efficiency are the measures that defines effectivenss of the applied algorithm.

In RL formulation, the agent needs to solve a task/problem in an envronment. Agent countinously interacts with the environment, which provides rewards to the agent, and thereafter agent learns a policy to navigate and tackle the problem. In every time step, RL agent has to make a decision by selecting preferrable action. To do so, agent fully relies on the information of current state and accumulated knowledge (history of rewards) up to current time step. Once the action is performed, the next state/observation is defined by some (stochastic) transition probability model. Also, reward will be signaled to the agent based on this new state information and performed action to get there. In overall, the goal of the agent is to maximize expected cumulative rewards.

In terms of RL, this goal can be formulated as finding a policy $\pi (a_t | s_t)$ such that expected reward $\mathbb{E_{\pi_{\theta, \tau}}} [G_t]$ is maximized.

In high dimensional/ countinous action space, policy gradient method can be used to solve this problem. In "vanilla" policy gradient method, the policy is parameterized by some parameter $\theta$ .i.e parametric policy $\pi_{\theta} (a_t | s_t)$ and we directly optimize the policy by finding the optimal parameter $\theta$.

Even though 'Vanilla' policy gradient/ REINFORCE are simple/easier to implement, they come with some learning issues:

PROBLEM : Usually give rise to high variance while estimating gradient. This is because, the objective function

$J(\theta) = \mathbb{E_{\pi_{\theta, \tau}}} (G_t)$

contains expectation; so we can't directly compute the exact gradient. We use stochatic gradient estimates such as REINFORCE based on some batch of trajectories. This sampling approximation adds some variance. That means, we need large number of trajectories to get the best estimation. In addition, we can see that collecting trajectories could be a problem in complex environments-might take long time to run.

##### Contribution¶

Authors have introduced a family of policy optimization methods which is build up on the basis of work of Trust-Region Policy optimization. Two main ideas:

a. Clipped Surrogate Objective Function : It avoids large deviations of learned policy $\pi_{\theta}$ from old policy $\pi_{\theta old}$; which is formulated as:

$L^{clip}(\theta) = \mathbb{E}_t [ min (r_t (\theta) \hat{A_t}, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A_t}) ]$

Here $r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}$

$\epsilon$ is hyperparameter, which restricts the new policy from being too far from old policy. $\hat{A_t}$ can be discounted return or advantage function.

The clipping ensures the updates take place in "trust-region". Also, introduces less variance than vanilla gradient methods

b. Multiple Epochs for policy update:

PPO allows to run multiple epochs on the same trajectories and optimize the objective $L^{clip}(\theta)$. This also reduces the sample inefficieny while learning. In order to collect data, PPO runs policy with parallel actors and then samples mini-batches of the data for training k-epochs using the objective function above.

Let's observe the behaviour of the objective function based on changes in advantage function.

CASE I : When $\hat{A_t}$ is +ve :

The objective function can be written as :

$L^{clip} (\theta) = min ( r_t(\theta), 1 + \epsilon ) \hat{A_t}$

Since $\hat{A_t}$ is +ve, when the action occurrence likelihood increases (i.e. $\pi_{\theta}(a_t | s_t)$), the whole objective value will also increase. The min operator limits the increasing objective value. So, when $\pi_{\theta}(a_t | s_t) ) > (1 + \epsilon) \pi_{\theta old}(a_t | s_t)$, ceiling occurs at $(1 + \epsilon) \hat{A_t}$

CASE II : When $\hat{A_t}$ is -ve :

The objective function can be written as :

$L^{clip} (\theta) = max ( r_t(\theta), 1 - \epsilon ) \hat{A_t}$

Now, the objective function value only increases when the likelihood of the action is less likely (i.e. $if \pi_{\theta}(a_t | s_t)$ decreases, $L^{clip} (\theta)$ increases ) In this case, when $(1 - \epsilon) \pi_{\theta old}(a_t | s_t) > \pi_{\theta}(a_t | s_t) )$, max operator limits the value at $(1 - \epsilon) \pi_{\theta old}(a_t | s_t) \hat{A_t}$

In [ ]:


Sijan Bhandari on

Summary of the paper "Learning What Data to Learn"

##### Motivation¶

The performance of learning algorithms based on Machine Learning or Deep Learning rely on amount of training data. Having more data points also has benefit of learning more generalized models and avoiding overfitting. However, collecting data is a painstalking work. Instead, we can learn automatic and adaptive data selection in the training process and make learning faster with minimal data points.

##### Contribution¶

In this paper, authors have introduced Neural Data Filter (NDF) as an adaptive framework which can learn data selection policy using deep reinforcement learning(DRL) algorithm 'Policy Gradient'. Two important aspects of this framework:

a. NDF filter the data instances from randomly fetched mini-batches of data during training process. b. Training loop provides feedback to NDF policy based on reward signal (e.g. calculated in validation set) and NDF policy is trained using DRL.

###### NDF in detail¶

NDF is designed to filter out some portion of training data based on some quality measure. The filtered high-quality data points speed up the convergence of the model.

In order to formulate Markov Decision Process (MDP) in NDF, authors used 'SGD-MDP' with following tuple: <s, a, P, r, $\gamma$>

• s : representing mini-batch data and current state of training model (weights/biases) as a state
• a : binary filtering actions; $a = {\{a_m\}}_{m=1}^M \in (0, 1)^M$, M-batch size and $a_m \in \{0,1\}$ indicating whether a particular data instance in minibatch will be selected or not.
• P : P(s| s, a) is a transition probability
• r = r(s,a), reward signal based on performance of the current model under consideration (e.g. validation accuracy),
• $\gamma \in [0,1]$, discounting factor

The NDF policy A(s,a, $\Theta$) can be represented by a binary classification algorithm such as logistic regression or deep NN, where $\Theta$ is policy parameter and it is updated as:

$\Theta \gets \Theta + \alpha V_t \sum_m \frac{\partial log P_{\Theta} (a_m|s_m)}{\partial \Theta}$

and, $V_t$ is the sampled estimation of reward $R(s_t, a_t)$ from one episode.

In [ ]:


Sijan Bhandari on
##### Motivation¶

The policy learning process in Reinforcement Learning (RL) is usually suffered due to delayed/sparse rewards. Reward is a direct signal for an agent to evaluate 'how good the current action selection is'. Since reward collection takes time, learning optimal policy also takes longer time to derive. Another factor that influence the learning process is human-designed reward function. These reward functions might not represent the optimal guidance for learning of the agent or won't be scalable to real world problems. We need a way to overcome reward sparsity and also improves exploration of the agent to make learning more robust.

Human learning process is not only guided by the final goal or achievement, but also driven by motivation or curiosity of the being. Curiosity adds exploratory behaviour to the agent, allowing it to acquire new skills and gain new knowledge about the environment. It also makes agent robust to perform actions which ultimately reduces uncertaintly on it's behaviours to capture the consequences of it's own action.

##### Contribution¶

The authors, in this paper, proposed curiosity-driven learning by uing agent-intrinsic reward (.i.e a reward which is learnt by agent itself by understanding the current environment or possible changes in the states while navigation). In order to quantify curiosity, they have introduced "Intrinsic Curiosity Module".

###### Intrinsic Curiosity Module (ICM)¶

The output of ICM is the state prediction error, which serves as reward for curiosity. This module has two sub-components, each represented by neural networks.

a. Inverse Model :

This model learns feature space using self-supervision. This new feature space is learnt in order to avoid features/information which are irrelevant to the agent while nagivation. Learning feature space is completed within two sub-modules:

i) First module encodes the raw input state ($s_t$) into a feature vector ($\phi(s_t)$) ii) Second module takes $\phi(s_t)$ and $\phi(s_{t+1})$) as encoded feature inputs and predicts action $\hat{a_t}$ that agent might take to go to $s_{t+1}$ from $s_t$

$\hat{a_t} = g( \phi(s_t), \phi(s_{t+1}), \theta_i )$

Here function g represents NN and $\hat{a_t}$ is estimated action. The learnable parameters $\theta_i$ are trained with loss function representing difference between predicted action and actual action. i.e $L_I( \hat{a_t}, a_t)$

b. Forward Model :

This is a NN which predicts the next state ($s_{t+1}$) with inputs $\phi(s_t)$ and action executed at $s_t$.

$\hat{\phi(s_{t+1})} = f( \phi(s_t), a_t, \theta_F)$

$\hat{\phi(s_{t+1})}$ is the predicted estimation of $\phi(s_{t+1})$ and $\theta_F$ represents trainable parameters, with loss function as:

$L_F ( \phi(s_{t+1}), \hat{\phi(s_{t+1})}) = \frac{1}{2} || \hat{\phi(s_{t+1})} - \phi(s_{t+1}) ||^2 = \eta L_F$

Both losses can be jointly expressed as :

$\underset{\theta_i, \theta_F} {max} [ (1-\beta) L_I + \beta L_F ]$

NOTE:

ICM worked with two connected modules - inverse model (which learnt the feature representation of state and next state) and forward model ( which predicts the feature representation of the next state) Curiosity can be calculated by the difference between output of forward model i.e $\hat{\phi(s_{t+1})}$ and output of the inverse model $\phi(s_{t+1})$.

In [ ]:


Sijan Bhandari on
##### Motivation¶

Deep Neural Network (DNN) is introduced to Reinforcement Learning (RL) framework in order to make function approximation easier/scable for large state-space problems. DNN itself suffers from overfitting because of the correlated data while nagivating through the environments (e.g. when we play a game, each consecutive moves of a player withing smaller timeframes looks similar, which won't contribute much for learning). In order to avoid it, people started using experience replay, where we have to store navigation experience (e.g. screenshots in games) as a buffer and we can use them later while training/updating policy/model parameters.

This works well, but only for off-policy algorithms like Q-learning. How to use on-policy algorithms like SARSA and make it stable learning using DNN ? Also, using experience replay introduces extra memory requirements/ computatonal delay for each update and real interaction with environment.

**NOTE :

• On-policy : The training data is generated by the same policy being trained. e.g : Reinforce
• Off-Policy : The training data generated from another policy can be used to train the current policy. e.g : Q-learning
##### Contribution¶

In this paper, authors introduced an asynchronous training process by executing multiple agents in parallel in different instances of the same environment using multiple CPU cores. It uses multithreading to run those agents and update the global model parameters asynchronously in online fashion. It is reported that this approach enables stable learning and faster convergence speed.

They have introduced asynchronous variants of SARSA, 1-step/n-step Q learning and advantage actor-critic algorithm. Let's discuss some details on Asynchronous Q-learning and Async. Advantage Actor critic (A3C) algorithms.

###### Asynchronous Q-Learning¶

In Deep Q-Learning (DQN), the neural network (NN) approximates the Q-value function Q(s,a, $\theta$) with loss formulated as: $$$L_{i} (\theta_i) = \mathbb{E} [ r + \gamma Q (s^, a^, \theta_{i-1}) - Q(s, a, \theta_i) ]^2 ..........(i)$$$

In Async 1-step Q learning, each thread maintains it's own copy of environment and agent traverse through the environment with the help of $\epsilon$ - greedy policy. At each step, we compute teh gradient of the loss (i) and collect gradients over multiple timesteps before updating the parameters.

Actor-critic method combines both value-pased and policy based methods.

It has a policy $\pi(a_t | s_t; \theta)$ and value function $V (s_t; \theta_t)$ to be learnt. It uses "forward-view", i.e. selecting actions based on its exploration strategy $\pi (a_t | s_t; \theta)$ up to some $t_{max}$ steps in the future, to collect up to $t_{max}$ rewards since last update.

Now, policy and value functions are updated after every $t_max$ actions as:

• Policy Network : $\bigtriangledown_{\theta} log \pi(a_t | s_t; \theta) (R_t - V(s_t, \theta_v))$
• Value Network : $\bigtriangledown_{v} (R - V(s_t; \theta_v))^2$

In this learning framework, parallel actor-learners updates a shared model and make learning process more robust and stable

In [ ]:


Sijan Bhandari on
##### Motivation¶

Humans can easily adjust themselves or adopt to new environments and learn new tasks by setting their own goals. In case of Reinforcement Learning framework, we have to manually design the reward function which gives an orientation towards the goal of a given task. For example, if we have to train a robot to pick a package and deliver to a destination, we have to set reward based on its distance-covered. Along with delivery task, there might be other tasks like adjusting robot-arm to pick the package based on it's shape/size or placing the package at the destination without throwing it on the ground. For each of these tasks, we can design specific-reward functions.But, it won't be practical or scalable for real-world problems where an agent has to solve many tasks synchronously.

##### Contribution¶

Authors proposed a reinforcement learning framework where an agent can learn general-purpose goal-conditioned polices by setting it's own synthetic goals and learning tasks to achieve those goals, without human intervention.

They referred this framework as "reinforcement with imagined goals" (RIG).

###### Synthetic Goals¶

Initially, the agent itself generate a set of synthetic goals by exploration through a random policy. Both state observations and goals are the image data (for example in case of robot navigation). By random policy, agent executes some random actions in the environment and the trajectories consisting of state observations are stored for later use.

During policy training phase, agent can randomly fetch those stored observations as a set of initial states or set of goals.

Now, we have all the information to train a goal-conditioned agent. Authors used Q-learning agent - Q(s,a,g), where s - states, a- actions and g-goals to be achieved by executing action 'a'. And, the optimal policy can be derived as : $\pi (s,g) = \underset{a} max Q(s,a, g)$

In order to train this policy, two main issues need to be addressed:

a. How to design reward function ? Distance between images while nagivation is one possible reward. But, pixel-wise distance won't carry semantic meaning of actual distance between states and this will be also computationally involved. b. How to represent the goal as a distribution so that we sample goals for the training?

Authorse resolved these issues by using Variational Autoencoders (VAE), to learn encoded representation of images. The VAE takes raw images (x) as input and generate low-dimensional latent representation (z). Using these latent representation, we have now latent states (z) and latent goals ($z_g$).

The working algorithm can be summarized as :

a. Initially, agent explores environment using random policy and the state observations will be stored.

b. VAE will be trained using raw images from (a) to learn latent representation of all state observations.

c. Initial states (z) and goals ($z_g$) are sampled from (b)

d. Goal-conditioned Q-function Q(z,a, $z_g$) is trained using data from (c) and policy $\pi_{\theta} (z, z_g)$ will be learnt in the latent space.

In [ ]:


In [ ]:


Sijan Bhandari on
In [30]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

####  making results reproducible
np.random.seed(42)
# utils import
from fuzzywuzzy import fuzz

# https://grouplens.org/datasets/movielens/latest/

In [4]:
# configure file path
DATA_FOLDER = '/home/lenovo/workspace/prepare/data/ml-latest-small/'
# data_path = os.path.join(DATA_FOLDER, 'MovieLens')
movies_filename = 'movies.csv'
ratings_filename = 'ratings.csv'
os.path.join(DATA_FOLDER, movies_filename),
usecols=['movieId', 'title'],
dtype={'movieId': 'int32', 'title': 'str'})

os.path.join(DATA_FOLDER, ratings_filename),
usecols=['userId', 'movieId', 'rating'],
dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [5]:
# data = pd.read_csv("/home/lenovo/workspace/prepare/data/ml-latest-small/")

In [6]:
df_movies.head()

Out[6]:
movieId title
0 1 Toy Story (1995)
1 2 Jumanji (1995)
2 3 Grumpier Old Men (1995)
3 4 Waiting to Exhale (1995)
4 5 Father of the Bride Part II (1995)
In [7]:
df_ratings.head()

Out[7]:
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
In [8]:
from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

In [11]:
num_users = len(df_ratings.userId.unique())
num_items = len(df_ratings.movieId.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))

There are 610 unique users and 9724 unique movies in this data set

In [12]:
df_ratings_cnt_tmp = pd.DataFrame(df_ratings.groupby('rating').size(), columns=['count'])
df_ratings_cnt_tmp

Out[12]:
count
rating
0.5 1370
1.0 2811
1.5 1791
2.0 7551
2.5 5550
3.0 20047
3.5 13136
4.0 26818
4.5 8551
5.0 13211
In [13]:
# there are a lot more counts in rating of zero
total_cnt = num_users * num_items
rating_zero_cnt = total_cnt - df_ratings.shape[0]
# append counts of zero rating to df_ratings_cnt
df_ratings_cnt = df_ratings_cnt_tmp.append(
pd.DataFrame({'count': rating_zero_cnt}, index=[0.0]),
verify_integrity=True,
).sort_index()
df_ratings_cnt

Out[13]:
count
0.0 5830804
0.5 1370
1.0 2811
1.5 1791
2.0 7551
2.5 5550
3.0 20047
3.5 13136
4.0 26818
4.5 8551
5.0 13211
In [14]:
# add log count
df_ratings_cnt['log_count'] = np.log(df_ratings_cnt['count'])
df_ratings_cnt

Out[14]:
count log_count
0.0 5830804 15.578665
0.5 1370 7.222566
1.0 2811 7.941296
1.5 1791 7.490529
2.0 7551 8.929435
2.5 5550 8.621553
3.0 20047 9.905835
3.5 13136 9.483112
4.0 26818 10.196829
4.5 8551 9.053804
5.0 13211 9.488805
In [15]:
ax = df_ratings_cnt[['count']].reset_index().rename(columns={'index': 'rating score'}).plot(
x='rating score',
y='count',
kind='bar',
figsize=(12, 8),
title='Count for Each Rating Score (in Log Scale)',
logy=True,
fontsize=12,
)
ax.set_xlabel("movie rating score")
ax.set_ylabel("number of ratings")

Out[15]:
Text(0, 0.5, 'number of ratings')
In [16]:
df_ratings.head()

Out[16]:
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
In [17]:
# get rating frequency
df_movies_cnt = pd.DataFrame(df_ratings.groupby('movieId').size(), columns=['count'])

Out[17]:
count
movieId
1 215
2 110
3 52
4 7
5 49
In [18]:
# plot rating frequency of all movies
ax = df_movies_cnt \
.sort_values('count', ascending=False) \
.reset_index(drop=True) \
.plot(
figsize=(12, 8),
title='Rating Frequency of All Movies',
fontsize=12
)
ax.set_xlabel("movie Id")
ax.set_ylabel("number of ratings")

Out[18]:
Text(0, 0.5, 'number of ratings')
In [19]:
df_movies_cnt['count'].quantile(np.arange(1, 0.6, -0.05))

Out[19]:
1.00    329.0
0.95     47.0
0.90     27.0
0.85     17.0
0.80     12.0
0.75      9.0
0.70      7.0
0.65      5.0
Name: count, dtype: float64
In [20]:
# filter data
popularity_thres = 50
popular_movies = list(set(df_movies_cnt.query('count >= @popularity_thres').index))
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(popular_movies)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop_movies.shape)

shape of original ratings data:  (100836, 3)
shape of ratings data after dropping unpopular movies:  (41360, 3)

In [21]:
# get number of ratings given by every user
df_users_cnt = pd.DataFrame(df_ratings_drop_movies.groupby('userId').size(), columns=['count'])

Out[21]:
count
userId
1 117
2 15
3 6
4 84
5 34
In [22]:
# plot rating frequency of all movies
ax = df_users_cnt \
.sort_values('count', ascending=False) \
.reset_index(drop=True) \
.plot(
figsize=(12, 8),
title='Rating Frequency of All Users',
fontsize=12
)
ax.set_xlabel("user Id")
ax.set_ylabel("number of ratings")

Out[22]:
Text(0, 0.5, 'number of ratings')
In [23]:
df_users_cnt['count'].quantile(np.arange(1, 0.5, -0.05))

Out[23]:
1.00    429.00
0.95    223.50
0.90    166.00
0.85    134.25
0.80    105.00
0.75     85.00
0.70     74.00
0.65     62.25
0.60     56.00
0.55     48.00
Name: count, dtype: float64
In [24]:
# filter data
ratings_thres = 50
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop_movies[df_ratings_drop_movies.userId.isin(active_users)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)

shape of original ratings data:  (100836, 3)
shape of ratings data after dropping both unpopular movies and inactive users:  (32999, 3)

In [25]:
# pivot and create movie-user matrix
movie_user_mat = df_ratings_drop_users.pivot(index='movieId', columns='userId', values='rating').fillna(0)
# create mapper from movie title to index
movie_to_idx = {
movie: i for i, movie in
enumerate(list(df_movies.set_index('movieId').loc[movie_user_mat.index].title))
}
# transform matrix to scipy sparse matrix
movie_user_mat_sparse = csr_matrix(movie_user_mat.values)

In [35]:
movie_user_mat_sparse

Out[35]:
<450x268 sparse matrix of type '<class 'numpy.float32'>'
with 32999 stored elements in Compressed Sparse Row format>
In [26]:
%env JOBLIB_TEMP_FOLDER=/tmp
# define model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# fit
model_knn.fit(movie_user_mat_sparse)

env: JOBLIB_TEMP_FOLDER=/tmp

Out[26]:
NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
radius=1.0)
In [36]:
def fuzzy_matching(mapper, fav_movie, verbose=True):
"""
return the closest match via fuzzy ratio. If no match found, return None

Parameters
----------
mapper: dict, map movie title name to index of the movie in data

fav_movie: str, name of user input movie

verbose: bool, print log if True

Return
------
index of the closest match
"""
match_tuple = []
# get match
for title, idx in mapper.items():
ratio = fuzz.ratio(title.lower(), fav_movie.lower())
if ratio >= 60:
match_tuple.append((title, idx, ratio))
# sort
match_tuple = sorted(match_tuple, key=lambda x: x[2])[::-1]
if not match_tuple:
print('Oops! No match is found')
return
if verbose:
print('Found possible matches in our database: {0}\n'.format([x[0] for x in match_tuple]))
return match_tuple[0][1]

def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
"""

Parameters
----------
model_knn: sklearn model, knn model

data: movie-user matrix

mapper: dict, map movie title name to index of the movie in data

fav_movie: str, name of user input movie

n_recommendations: int, top n recommendations

Return
------
list of top n similar movie recommendations
"""
# fit
model_knn.fit(data)
# get input movie index
print('You have input movie:', fav_movie)
idx = fuzzy_matching(mapper, fav_movie, verbose=True)
print('idx is', idx)
# inference
print('Recommendation system start to make inference')
print('......\n')
distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
# get list of raw idx of recommendations
raw_recommends = \
sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
# get reverse mapper
reverse_mapper = {v: k for k, v in mapper.items()}
# print recommendations
print('Recommendations for {}:'.format(fav_movie))
for i, (idx, dist) in enumerate(raw_recommends):
print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))

In [37]:
my_favorite = 'Iron Man'

make_recommendation(
model_knn=model_knn,
data=movie_user_mat_sparse,
fav_movie=my_favorite,
mapper=movie_to_idx,
n_recommendations=10)

You have input movie: Iron Man
Found possible matches in our database: ['Iron Man (2008)']

idx is 419
Recommendation system start to make inference
......

Recommendations for Iron Man:
1: Batman Begins (2005), with distance of 0.3474416136741638
2: Sherlock Holmes (2009), with distance of 0.34635400772094727
3: Kung Fu Panda (2008), with distance of 0.3432350754737854
4: Inception (2010), with distance of 0.3307400345802307
5: District 9 (2009), with distance of 0.31877219676971436
6: Up (2009), with distance of 0.31706738471984863
7: WALL·E (2008), with distance of 0.27033132314682007
8: Avengers, The (2012), with distance of 0.26102906465530396
9: Avatar (2009), with distance of 0.25990235805511475
10: Dark Knight, The (2008), with distance of 0.24018973112106323

In [ ]:


Sijan Bhandari on

In this post, we will optimize our kNN implementation from previous post using Numpy and Numba.

For previous post, you can follow:

How kNN works ?

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Out of sample accuracy estimation using cv in knn

Tuning Hyperparameter in kNN

In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

import numba

%matplotlib inline

# making results reproducible
np.random.seed(42)


#### Teaser¶

Let us first see a small example. We will calculate the sum of inverse square root between 1 to 10000, using basic pure python method and using Numpy.

In [2]:
n = 10000

In [3]:
%timeit -n 100 sum([1./math.sqrt(i) for i in range(1,n)])

1.87 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]:
%timeit -n 100 np.sum(1./np.sqrt(np.arange(1,n)))

52.7 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [85]:
'Numpy vectorized calculation is {:.0f} times faster than pure python in this example'.format(1.87*1000/52.7)

Out[85]:
'Numpy vectorized calculation is 35 times faster than pure python in this example'

#### Using wine dataset for kNN¶

In [7]:
df = pd.read_csv(

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]

Out[7]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59
In [8]:
# we are using 10% of the data for the testing purpose

train_sample_idx = np.random.choice(df.index, size=int(df.shape[0]*0.9), replace=False)
train_data, test_data = df.iloc[train_sample_idx], df.drop(train_sample_idx)

# get features and label from train/test data
X_train, y_train = train_data.drop('CLASS', axis=1), train_data['CLASS']
X_test, y_test = test_data.drop('CLASS', axis=1), test_data['CLASS']


#### 1. kNN using Python from scratch¶

In [9]:
def euclidean_distance(vector1, vector2):
'''calculate the euclidean distance, core python method
input: numpy.arrays or lists
return: euclidean distance
'''
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))
return dist

In [10]:
def predict_instance(inputs, labels, test_instance, k):
inputs['distance'] = inputs.apply(euclidean_distance, vector2=test_instance, axis=1)

# concatenate inputs and labels before sorting the distances
inputs = pd.concat([inputs, labels], axis=1)
# sort based on distance
inputs = inputs.sort_values('distance', ascending=True)
# pick k neighbors

# get list from dataframe column
classes = neighbors['CLASS'].tolist()

# create counter of labels
majority_count = Counter(classes)
return majority_count.most_common(1).pop()[0]

def knn(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.

"""
predictions = np.zeros(X_test.shape[0])
X_test.reset_index(drop=True, inplace=True)
for index, row in X_test.iterrows():
predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
return predictions

In [11]:
# knn = KNN(3)
predictions = knn(X_train, y_train, X_test, 3)

In [12]:
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy

Out[12]:
0.7222222222222222

#### a. %timeit magic command to observe execution time for core python functions¶

In [13]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn(X_train, y_train, X_test, 3)

144 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### b. Line profiling to see which functions/call has higher contribution on execution time¶

In [14]:
%load_ext line_profiler

In [15]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn knn(X_train, y_train, X_test, 3)

Timer unit: 1e-06 s

Total time: 0.312758 s
File: <ipython-input-115-bd6cbd244598>
Function: knn at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
18                                           def knn(X_train, y_train, X_test, k):
19                                               """
20                                               Calculate k-nn for given k.
21
22                                               """
23         1         55.0     55.0      0.0      predictions = np.zeros(X_test.shape[0])
24         1        586.0    586.0      0.2      X_test.reset_index(drop=True, inplace=True)
25        19       3188.0    167.8      1.0      for index, row in X_test.iterrows():
26        18     308928.0  17162.7     98.8          predictions[index] = predict_instance(X_train.copy(), y_train.copy(), row, k)
27         1          1.0      1.0      0.0      return predictions

NOTE:

1. Time is in microseconds at mentioned at the top of the cell.
2. (Hits) shows number of times that particular line in code executed.
3. (Time) total microseconds for executing that line. (total amount of time)
4. (per Hit) = (Time)/(Hits), average time for a single execution.
5. (% Time) fraction of time spent on that line relative to total amount of time.

We can see the line number 26 is expensive one. We will improve it using numpy below.

#### 2. Improving kNN with numpy¶

In [54]:
def predict_instance_numpy(inputs, labels, test_instance, k):
# calculate L2 norm between all training points and given test_point
inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)

# concatenate inputs and labels before sorting the distances
inputs = pd.concat([inputs, labels], axis=1)
# sort based on distance
inputs = inputs.sort_values('distance', ascending=True)

# pick k neighbors

# get list from dataframe column
classes = neighbors['CLASS'].tolist()

# create counter of labels
majority_count = Counter(classes)
return majority_count.most_common(1).pop()[0]

def knn_numpy(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.

"""
predictions = np.zeros(X_test.shape[0])
X_test.reset_index(drop=True, inplace=True)
for index, row in X_test.iterrows():
predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
return predictions

In [55]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numpy(X_train, y_train, X_test, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy

Out[55]:
0.7222222222222222

#### Observing execution time and line profiling after optimizing with numpy¶

In [56]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numpy(X_train, y_train, X_test, 3)

40.6 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### The total execution has reduced from 135 ms to 41.2 ms ¶

In [57]:
# We are profiling function knn, we supply the name using -f
%lprun -f knn_numpy knn_numpy(X_train, y_train, X_test, 3)

Timer unit: 1e-06 s

Total time: 0.094862 s
File: <ipython-input-121-a0da67624daf>
Function: knn_numpy at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
22                                           def knn_numpy(X_train, y_train, X_test, k):
23                                               """
24                                               Calculate k-nn for given k.
25
26                                               """
27         1         65.0     65.0      0.1      predictions = np.zeros(X_test.shape[0])
28         1        258.0    258.0      0.3      X_test.reset_index(drop=True, inplace=True)
29        19       3696.0    194.5      3.9      for index, row in X_test.iterrows():
30        18      90843.0   5046.8     95.8          predictions[index] = predict_instance_numpy(X_train.copy(), y_train.copy(), row, k)
31         1          0.0      0.0      0.0      return predictions

### The per Hit time has reduced from 17162.7 µs to 5046.8 µs ¶

#### 3. Improving kNN with numba¶

Numba does it's best work when we do operations over numpy arrays. For the implemention below, I have converted pandas dataframe to numpy array and perform relevant operations.

In [79]:
import operator
import numba

@numba.jit(nopython=True)
def euclidean_distance_numba(vector1, vector2):
'''calculate the euclidean distance,
'''
dist = np.linalg.norm(vector1-vector2)
return dist

@numba.jit(nopython=True)
def predict_instance_numba(inputs, labels, test_instance, k):
distances = np.zeros((inputs.shape[0], 1))
# calculate distance between test point and training points
for i in np.arange(inputs.shape[0]):
distances[i] = euclidean_distance_numba(inputs[i], test_instance)
labels = labels.reshape((labels.shape[0],1))

inputs = np.hstack((inputs, labels))

inputs = np.hstack((inputs, distances))

# sort based on distance column
inputs = inputs[inputs[:,3].argsort()]
# 2nd columns contains classes, select first k values
neighbor_classes = inputs[:, 2][:k]
counter = {}
for item in neighbor_classes:
if item in counter:
counter[item] = counter.get(item) + 1
else:
counter[item] = 1
counter_sorted = sorted(counter)
return counter_sorted[0]

# @numba.jit(nopython=True)
def knn_numba(X_train, y_train, X_test, k):
"""
Calculate k-nn for given k.

"""
predictions = np.zeros(X_test.shape[0])
for i in np.arange(X_test.shape[0]):
predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
return predictions

In [80]:
# knn improved with distance calculation using np.linalg.norm
predictions = knn_numba(X_train.values, y_train.values, X_test.values, 3)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)
accuracy

Out[80]:
0.7222222222222222

#### Observing execution time and line profiling after optimizing with numba¶

In [81]:
# %%timeit
# -n execute the function 10 times in a loop
%timeit -n 10 knn_numba(X_train.values, y_train.values, X_test.values, 3)

1.65 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### The total execution has reduced from 41.2 ms to 1.65 ms ¶

In [84]:
# We are profiling function knn_numba, we supply the name using -f
%lprun -f knn_numba knn_numba(X_train.values, y_train.values, X_test.values, 3)

Timer unit: 1e-06 s

Total time: 0.002649 s
Function: knn_numba at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
36                                           def knn_numba(X_train, y_train, X_test, k):
37                                               """
38                                               Calculate k-nn for given k.
39
40                                               """
41         1         11.0     11.0      0.4      predictions = np.zeros(X_test.shape[0])
42        19         44.0      2.3      1.7      for i in np.arange(X_test.shape[0]):
43        18       2593.0    144.1     97.9          predictions[i] = predict_instance_numba(X_train.copy(), y_train.copy(), X_test[i], k)
44         1          1.0      1.0      0.0      return predictions

### The per Hit time has reduced from 5046.8 µs 144.1 µs¶

In [ ]:



In this post, we will go through an approach to get optimal/tuned model for the final prediction. First, we will see how to select best 'k' in kNN using simple python example. We will then jump to using sklearn apis to explore different options for hyperparameter tuning.

For previous post, you can follow:

How kNN works ?

K-Nearest Neighbors Algorithm using Python and Scikit-Learn?

Out of sample accuracy estimation using cv in knn

In [1]:
import os
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# making results reproducible
np.random.seed(42)

In [2]:
df = pd.read_csv(

df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']

# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]

Out[2]:
CLASS ALCOHOL_LEVEL MALIC_ACID
0 1 14.23 1.71
1 1 13.20 1.78
2 1 13.16 2.36
3 1 14.37 1.95
4 1 13.24 2.59

#### 1. kNN and Cross validation using Python from Scratch¶

In [3]:
class KNN:
def __init__(self, K):
self.K = K
self.X_train = None
self.y_train = None

def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train

def predict_instance(self, test_instance):
inputs = self.X_train.copy()
# calculate L2 norm between all training points and given test_point
inputs['distance'] = np.linalg.norm(inputs.values-test_instance.values, axis=1)

# concatenate inputs and labels before sorting the distances
inputs = pd.concat([inputs, self.y_train], axis=1)

# sort based on distance
inputs = inputs.sort_values('distance', ascending=True)

# pick k neighbors

# get list from dataframe column
classes = neighbors['CLASS'].tolist()

# create counter of labels
majority_count = Counter(classes)

return majority_count.most_common(1).pop()[0]

def predict(self, X_test):
predictions = np.zeros(X_test.shape[0])
# we want out index to be start from 0
X_test.reset_index(drop=True, inplace=True)
for index, row in X_test.iterrows():
predictions[index] = self.predict_instance(row)
return predictions

def cross_validation(n, k, data, n_neighbors):
"""
n : # iterations
k : k-fold size
data: training data
n_neighbors: k in knn
"""
accuracies = []

for _ in range(0, n):
# data shuffle
data.sample(frac=1)

fold=int(data.shape[0]/k)

for j in range(k):
test = data[j*fold:j*fold+fold]
train = data[~data.index.isin(test.index)]
X_train, y_train = train.drop('CLASS', axis=1), train['CLASS']
X_test, y_test = test.drop('CLASS', axis=1), test['CLASS']

knn = KNN(n_neighbors)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)
true_values = y_test.to_numpy()
accuracy = np.mean(predictions == true_values)

accuracies.append(accuracy)
return np.array(accuracies).mean()

In [4]:
# We will be using following settings for all the cases below
k_values = np.arange(1, 16)
cross_validation_fold = 10
accuracies = []


#### 2. Finding optimal k value for kNN¶

In [5]:
for k in k_values:
# run cross-validation with given neighbor size k
accuracy = cross_validation(1, cross_validation_fold, df, k)
accuracies.append(accuracy)
print(accuracies)

[0.6411764705882353, 0.6647058823529411, 0.7470588235294118, 0.7529411764705881, 0.7588235294117647, 0.7529411764705882, 0.7529411764705881, 0.7294117647058822, 0.7823529411764707, 0.7647058823529412, 0.7705882352941177, 0.7764705882352941, 0.7705882352941176, 0.7529411764705881, 0.7411764705882352]

In [6]:
fig = plt.figure()
plt.plot(k_values, accuracies)
plt.xlabel('k in kNN')
plt.ylabel('CV-Accuracy')
fig.suptitle('kNN hyperparameter (k) tuning with python alone', fontsize=20)

Out[6]:
Text(0.5, 0.98, 'kNN hyperparameter (k) tuning with python alone')`

We can see that k=9 seems a good choice for our dataset.

Sijan Bhandari on #kNN,