Summary of the paper "Learning What Data to Learn"

##### Motivation¶

The performance of learning algorithms based on Machine Learning or Deep Learning rely on amount of training data. Having more data points also has benefit of learning more generalized models and avoiding overfitting. However, collecting data is a painstalking work. Instead, we can learn automatic and adaptive data selection in the training process and make learning faster with minimal data points.

##### Contribution¶

In this paper, authors have introduced Neural Data Filter (NDF) as an adaptive framework which can learn data selection policy using deep reinforcement learning(DRL) algorithm 'Policy Gradient'. Two important aspects of this framework:

a. NDF filter the data instances from randomly fetched mini-batches of data during training process. b. Training loop provides feedback to NDF policy based on reward signal (e.g. calculated in validation set) and NDF policy is trained using DRL.

###### NDF in detail¶

NDF is designed to filter out some portion of training data based on some quality measure. The filtered high-quality data points speed up the convergence of the model.

In order to formulate Markov Decision Process (MDP) in NDF, authors used 'SGD-MDP' with following tuple: <s, a, P, r, $ \gamma $>

- s : representing mini-batch data and current state of training model (weights/biases) as a state
- a : binary filtering actions; $ a = {\{a_m\}}_{m=1}^M \in (0, 1)^M $, M-batch size and $ a_m \in \{0,1\} $ indicating whether a particular data instance in minibatch will be selected or not.
- P : P(s`| s, a) is a transition probability
- r = r(s,a), reward signal based on performance of the current model under consideration (e.g. validation accuracy),
- $ \gamma \in [0,1] $, discounting factor

The NDF policy A(s,a, $ \Theta $) can be represented by a binary classification algorithm such as logistic regression or deep NN, where $ \Theta $ is policy parameter and it is updated as:

$ \Theta \gets \Theta + \alpha V_t \sum_m \frac{\partial log P_{\Theta} (a_m|s_m)}{\partial \Theta} $

and, $ V_t $ is the sampled estimation of reward $ R(s_t, a_t) $ from one episode.

```
```