What is the trade-off between bias and variance in machine learning ?

Short: A model with minimal parameters may exhibit high bias and low variance, while a model with numerous parameters may demonstrate high variance and low bias. Therefore, it is essential to achieve an optimal balance to avoid overfitting and underfitting the data. High bias arises from incorrect assumptions made by the learning algorithm, whereas variance arises from a model's sensitivity to minor variations in the training dataset.

Detail: During development, all algorithms exhibit some degree of bias and variance. Models can be adjusted to address either bias or variance, but it is impossible to reduce both to zero without adversely affecting the other. This introduces the concept of the bias-variance trade-off. Bias refers to the discrepancy between the average prediction of our model and the actual value being predicted, indicating the presence of systematic errors in the model. Every algorithm inherently possesses some level of bias due to assumptions made within the model to simplify learning the target function. High bias can lead to underfitting, where the algorithm fails to capture relevant relationships between features and target outputs. Simpler algorithms tend to introduce more bias, whereas nonlinear algorithms usually have lower bias. These errors can originate from various sources, including the selection of training data, feature choices, or the training algorithm itself. Variance measures how much a model's predictions change with different training sets, indicating the degree of over-specialization to a particular training set (overfitting). The goal is to assess the deviation of our model from the best possible model for the training data. The ideal model seeks to minimize both bias and variance, achieving a balance that is neither too simple nor too complex, thereby yielding minimal error. Low-variance models typically have a simple structure and are less sophisticated, but they risk being highly biased. Examples include Regression and Naive Bayes. Conversely, low-bias models generally have a more flexible and complex structure but are prone to high variance. Examples include Nearest Neighbors and Decision Trees. Overfitting arises when a model is overly complex and learns the noise in the data rather than the actual signals.

What are different types of gradient descent algorithm in machine learning ?

There exist three distinct types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Batch Gradient Descent (BGD)

In Batch Gradient Descent, the term 'batch' signifies the utilization of the entire training dataset during each iteration of the learning process. By incorporating all training examples for each update, Batch Gradient Descent ensures stable error gradients and a consistent trajectory towards the optimal solution, albeit with significant computational demands. This batching method enhances computational efficiency; however, it can still result in extended processing times for large training datasets due to the necessity of storing all data in memory. While Batch Gradient Descent typically yields a stable error gradient and reliable convergence, it occasionally converges to a local minimum rather than the global optimum.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) enhances parameter updates by leveraging individual data points during each iteration. By conducting a training epoch for each dataset example and updating parameters sequentially, SGD minimizes memory requirements, as only a single training example needs to be stored at any given time. These frequent updates, while providing detailed and rapid adjustments, may lead to decreased computational efficiency relative to batch gradient descent. Despite the potential for noisy gradients, which arise from these frequent updates, this noise can facilitate the escape from local minima, thereby aiding in the pursuit of a global minimum. The principle of SGD is characterized by its utilization of a single example per iteration, hence the term "stochastic" reflects the random selection of each example within the batch. Given sufficient iterations, SGD proves effective, albeit with inherent noisiness.

Mini Batch Gradient Descent

Mini-batch gradient descent integrates principles from both batch gradient descent and stochastic gradient descent. It partitions the training dataset into smaller batch sizes and executes updates on each of these batches. This methodology achieves a balance between the computational efficiency of batch gradient descent and the rapidity of stochastic gradient descent. Similar to stochastic gradient descent, the average cost over epochs in mini-batch gradient descent exhibits fluctuations due to the averaging of a limited number of examples at a time.

What is underfitting in Machine Learning ?

Underfitting occurs in machine leanrning / data science when a data model fails to accurately capture the relationship between input and output variables, resulting in high error rates on both the training set and unseen data. This also entails that the model has insufficient training duration or the input variables lack significance to establish a meaningful relationship between the input and output variables. As the model learns, its bias diminishes, but its variance may increase, leading to overfitting. The objective in model fitting is to identify the optimal balance between underfitting and overfitting (.i.e., finding the sweet spot), allowing the model to capture the dominant trend in the training data and generalize effectively to new datasets.

Important details:

High biased model (underfitted) is not able to learn the very basic/important patterns in the training data.
Adding more data and making your model simpler won't help to avoid underfitting.
One should try other sophisticated models (e.g Decision tree in comparision to kNN) or add complexity in the current model.
Using complex models (example : polynomial regression rather than linear one) may be useful to capture the relevant patterns in the training data.
Adding more features (or derived features from existing one) will also increase the model capacity and helps to avoid underfitting.
If you see unacceptably high training error and test error, the model is underfitted.
High bias and low variance are the indicators of underfitting models.
Underfitting is easier to track than overfitting since the performance can be measured during training phase.

What is overfitting in Machine Learning ?

Overfitting occurs when the model attempts to match the training set too closely. On fresh data, the overfitted model is unable to produce accurate predictions.

Important details:

The model will attempt to match the data too closely and will pick up on noise in the data when the training data set is limited or the given model is complex.
An overfitted model picks up patterns that are unique to the training set and overlooks the generic patterns.
Regularization can reduce overfitting.
Overfitting can also be decreased by training on a large and diversed training data points.
Overfitting can be detected by high variation .i.e, if the test data has a high error rate while the training data has a low error rate.
A high variance model will overfit the data and is flexible in capturing every detail—relevant or not—and noise in the data.
A high variance model is also indicated as: Training error << Validation error.
More training data will improve the generalization of the given model and avoids overfitting.

What is gradient descent ?

Short:

Gradient descent is an optimization algorithm used to determine the coefficients / parameters of a function (f) that minimize a cost function.

reference

Detail:

Gradient descent is a widely used optimization approach for training machine learning models and neural networks. Optimization is the process of minimizing or increasing an objective function. Optimization entails calculating the gradient (partial derivatives) of the cost function for each parameter (weights and biases). To do this, the models are given training data iteratively. And, the gradient points are determined. The gradient consistently indicates the direction of the steepest increase in the loss function. The gradient descent algorithm proceeds by taking a step in the direction of the negative gradient to minimize the loss as efficiently as possible.

The selection of the learning rate has a substantial influence on the effectiveness of gradient descent. An excessively high learning rate may cause the algorithm to overshoot the minimum, while an excessively low learning rate may result in prolonged convergence times.
Due to its non-convexity, the loss function L(w) of a neural network is generally known to potentially have multiple local minima. When multiple local minima are present, it is highly likely that the algorithm may fail to converge to a global minimum. Thus, local minima pose significant challenges as they may cause the training process to stall rather than progress towards the global minimum.
Gradient descent, despite being a widely utilized optimization algorithm, does not ensure convergence to an optimum in all scenarios. Various factors can hinder convergence: Saddle Points: In high-dimensional spaces, gradient descent may become trapped at saddle points where the gradient is zero but does not correspond to a minimum.
Saddle points in a multivariable function are critical points where the function does not achieve either a local maximum or a local minimum value.
A common issue with both local minima and saddle points is the presence of plateaus with low curvature in the error landscape. Although gradient descent dynamics are repelled from a saddle point towards lower error by following directions of negative curvature, this repulsion can be slow due to the plateau.
Stochastic Gradient Descent (SGD) can occasionally escape simple saddle points if fluctuations occur in different directions and the step size is sufficiently large to overcome the flatness. However, saddle regions can sometimes be quite complex.
The gradient of error, defined over the difference between the actual and predicted outputs, approaches zero at a local minimum, causing progress to stall due to weight correction steps being proportional to the gradient's magnitude, which is near zero at a minimum. Techniques such as 'random weight initiation' can be employed to avoid this issue.

What is Inductive Bias in Machine Learning ?

An explicit or implicit assumption or prior information about the model that permits it to generalize outside of the training set of data is known as inductive bias.

Examples of inductive bias:

When it comes to decision trees, shorter trees work better than longer ones.
The response variable (y) in linear regression is thought to vary linearly in predictors (X).
In general, the belief that the most simplest hypothesis is more accurate than the more complicated one (Occam's razor) .

What are model training steps in machine learning ?

There may exist many possible models to solve a given problem at hand. Based on your modeling decision there are usually two different ways to complete the machine learning lifecycle.

1st scenario. Training a single model with a training dataset and final evaluation with the test set.
2nd scenario. Training multiple models with training/validation dataset and final evaluation with the test set.

In case of (1st scenario), you will follow the following approach:

Divide the data into training and test sets. (Usually 70/30 splits)
Select your preferable model.
Train it with a training dataset.
Assess the trained model in the test set. (no need to perform validation in your trained model)

In case of (2nd scenario), you will follow the following approach:

Divide the data into training, validation, and test sets. (Usually 50/25/25 splits)
Select the initial model/architecture.
Train the model with a training dataset.
Evaluate the model using the validation dataset.
Repeat steps (b) through (d) for different models or training parameters.
Select the best model based on evaluation and train the best model with combined (training + validation) datasets.
Assess the trained model in the test set.

what is model training in machine learning ?

The Machine Learning model is represented by the model parameters. Those parameters are the learnable parameters. Learning happens when these parameters are updated with suitable values and the model is able to solve the given tasks. Training is the process of feeding a training dataset to your model. The training process uses an objective function (example MSE) to get the feedback in each iteration. Since we are trying to improve the accuracy of the model on a given input, and lower the error between model prediction and actual output, we also called training process as a model optimization process.

What is machine learning ?

Understanding and extracting hidden patterns or features from the data is the learning process in machine learning. Instead of using explicit logic supplied by people, machine learning has the capacity to learn from experiences. Conventional systems are created with the use of well defined human-set rules. In order for machine learning algorithms to understand complicated patterns from inputs (x), they use outputs (y) as a feedback signal. Thus, an intelligent program is the ML system's final product.

We often use a logical method to solve any issue. We make an effort to break the task up into several smaller tasks and solve each smaller task using a distinct rationale. When dealing with extremely complicated jobs, like stock price prediction, the patterns are always changing, which has an impact on the results. That implies that, in order to answer this problem logically, we must adjust our handwritten logic for each new change in the outputs. Machine Learning (ML), on the other hand, creates the model using a vast amount of data. The data gives the model all of its historical experience, which helps it better understand the pattern. We just retrain the model with fresh instances whenever the data changes.

Paper Summary : Playing Atari With Deep Reinforcement Learning

Summary of the paper "Playing Atari with Deep Reinforcement Learning"

Motivation¶

Deep Learning (DL) has proven to work well when we have large amount of data. Unline supervised DL algorithm setup, Reinforcement Learning (RL) doesn't have direct access to the targets/labels. RL agent usually get "delayed and sparsed" rewards as a signal to understand about the environment and learn policy for a given environment. Another challenge is about the distribution of the inputs. In supervised learning, each batch in training loop is drawn randomly which make sure each inputs/samples are independent and the parameter updates won't overfit to some specific direction/class in the data. In case of RL, inputs are usually correlated. For example, when you collect image inputs/frames of video of games, their pixel positions won't change much. Therefore, many samples will look alike and this might lead to poor learning and local optimal solution. Another problem is the non-stationarity of the target. The target will be changing throughout the episodes when the agent learns new behaviour from the environment, or adopting well.

Contribution¶

Authors proposed 'Deep Q Network' (DQN) learning algorithm with experience replay. This approach solves both the correlated inputs and non-stationarity problems.

They uses CNN with a variant of Q-learning algorithm, and uses stochastic gradient descent (SGD) for the training. They maintained a buffer named - 'Experience Replay' of the transitions while the agent nagivates through the environment. While SGD training process, samples from this stored buffer is used to create mini-batches and used for the training of the NN. This refer this NN as Q-network with parameter, $ \theta $, which minimizes the sequences of loss functions $ L_i (\theta_i) $ :

$ L_i(\theta_i) $ = $ \mathbb{E_{s,a \sim \rho(.)}} [ (y_i - Q(s, a; \theta_i)^2 ] $

Where $ y_i = \mathbb{E_{s' \sim \varepsilon}} [ r + \gamma \underset{a'} max (s', a', \theta_{i-}) | s,a] $

is the target for iteration i.

They used the previous iteration parameter value ($ \theta_{i-1} $) in order to calculate the target ($y_i$). The parameter ($ \theta_{i-1} $) from previous iteration won't change for some long future iterations, which makes it stationary and training will be smooth. They also feed concatenation of four video frames as an input to the CNN in order to avoid the partial observation contraints in the learning. Using four frames, CNN will be able locate the movement direction, speed of the objects in the frames.

DQN is used to train on Atari 2600 games. The video frames from emulator are the observations based on discrete actions (up, down, left, rigth..) of the agent in the environment. The network consists of two convolutional layers and two fully connected layers. The last layer outputs the distribution over possible actions.

In [ ]:

Menu

CODEBUG (page 4)

Data Exploration...