Understanding High Bias in Machine Learning with Real-World Example

High bias in machine learning results in underfitting, characterized by the model making oversimplified assumptions about the relationships within the data. This leads to subpar performance on both training and test datasets, demonstrating that the model does not possess the necessary complexity to capture the underlying patterns.

In this example, we will check house price prediction using two methods:

  1. Linear Regression: The simple linear model does not adequately capture the complex relationships between features and house prices.
  2. Polynomial Regression: The PolynomialFeatures step enhances the dataset by generating polynomial and interaction terms from the original features. For instance, when you have a feature x with a degree of 2, it will produce new features such as x, x², and cross-terms like x1 * x2. This approach enables a linear regression model to effectively capture non-linear relationships.
In [31]:
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.datasets

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
In [21]:
house_price_dataset = sklearn.datasets.fetch_california_housing()
In [22]:
.. _california_housing_dataset:

California Housing dataset

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

In [23]:
# Loading the dataset to a pandas dataframe
df_house_data = pd.DataFrame(house_price_dataset.data, columns = house_price_dataset.feature_names)
In [24]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
In [25]:
# add the target column
# target is median house value in block group (in $100,000s).
df_house_data['price'] = house_price_dataset.target
In [26]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude price
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
House Price Prediction Using Linear Regression
In [34]:
# Prepare data
X = df_house_data.drop('price', axis=1)
y = df_house_data['price']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train linear model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
train_pred = model.predict(X_train_scaled)
test_pred = model.predict(X_test_scaled)

print("Training MSE:", mean_squared_error(y_train, train_pred))
print("Test MSE:", mean_squared_error(y_test, test_pred))

train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)

print(f"Training R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")
Training MSE: 0.5240457125963887
Test MSE: 0.5261093658365182
Training R² Score: 0.6081
Test R² Score: 0.5980
House Price Prediction Using Polynomial Regression
  • Observations:
    • The mean squared error is decreased in test set.

    • The R² score is increased in test set.

      -- R² Interpretation: An R-squared value of 0.75 indicates that 75% of the variation in house prices can be attributed to factors such as square footage, location, and the amenities included in the model.

Why Use both PolynomialFeatures and LinearRegression in the Pipeline:

- The first PolynomialFeatures transformation creates a more complex feature space.
- The LinearRegression then fits a linear model to these non-linear features.
- This effectively allows a linear model to approximate non-linear relationships.
In [35]:
def polynomial_regression_model(X_train, X_test, degree=2):
    # Create pipeline
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    # Fit and evaluate
    model.fit(X_train, y_train)

    # Evaluate
    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)

    print("Training MSE:", mean_squared_error(y_train, train_pred))
    print("Test MSE:", mean_squared_error(y_test, test_pred))

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"Training R² Score: {train_score:.4f}")
    print(f"Test R² Score: {test_score:.4f}")
    return model, X_train_poly, X_test_poly

# Example usage with housing data
model_poly, X_train_poly, X_test_poly = polynomial_regression_model(
    X_train_scaled, X_test_scaled
Training MSE: 0.4219834148836872
Test MSE: 0.42363567392919027
Training R² Score: 0.6844
Test R² Score: 0.6763
How to combat overfitting and underfitting in Machine Learning ?

Machine Learning models learn the relationship between input (features) and output (target) using learnable parameters. The size of these parameters defines the complexity and flexibility of a given model.

There are two typical scenarios. When the flexibility of a model is insufficient to capture the underlying pattern in a training dataset, the model is called underfitted. Conversely, when the model is too flexible to the underlying pattern, it is said that the model has “memorized” the training data, resulting in an overfitted model.

Consider a system that can be explained by a quadratic function, but we use a simple line to represent it, i.e., a single parameter to capture the underlying trends in the data. Because the function lacks the required complexity to fit the data (two parameters), we end up with a poor predictor. In this case, the model will have high bias, meaning we will get consistent but consistently wrong answers. This is called an underfitted model.

Now imagine that the true system is a parabola, but we use a higher-order polynomial to fit it. Due to natural noise in the data used to fit (deviations from the perfect parabola), the overly complex model treats these fluctuations and noise as intrinsic properties of the system and attempts to fit them. The result is a model with high variance.

More details:

What is the trade-off between bias and variance in machine learning ?

Short: A model with minimal parameters may exhibit high bias and low variance, while a model with numerous parameters may demonstrate high variance and low bias. Therefore, it is essential to achieve an optimal balance to avoid overfitting and underfitting the data. High bias arises from incorrect assumptions made by the learning algorithm, whereas variance arises from a model's sensitivity to minor variations in the training dataset.

Detail: During development, all algorithms exhibit some degree of bias and variance. Models can be adjusted to address either bias or variance, but it is impossible to reduce both to zero without adversely affecting the other. This introduces the concept of the bias-variance trade-off. Bias refers to the discrepancy between the average prediction of our model and the actual value being predicted, indicating the presence of systematic errors in the model. Every algorithm inherently possesses some level of bias due to assumptions made within the model to simplify learning the target function. High bias can lead to underfitting, where the algorithm fails to capture relevant relationships between features and target outputs. Simpler algorithms tend to introduce more bias, whereas nonlinear algorithms usually have lower bias. These errors can originate from various sources, including the selection of training data, feature choices, or the training algorithm itself. Variance measures how much a model's predictions change with different training sets, indicating the degree of over-specialization to a particular training set (overfitting). The goal is to assess the deviation of our model from the best possible model for the training data. The ideal model seeks to minimize both bias and variance, achieving a balance that is neither too simple nor too complex, thereby yielding minimal error. Low-variance models typically have a simple structure and are less sophisticated, but they risk being highly biased. Examples include Regression and Naive Bayes. Conversely, low-bias models generally have a more flexible and complex structure but are prone to high variance. Examples include Nearest Neighbors and Decision Trees. Overfitting arises when a model is overly complex and learns the noise in the data rather than the actual signals.

What are different types of gradient descent algorithm in machine learning ?

There exist three distinct types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Batch Gradient Descent (BGD)

In Batch Gradient Descent, the term 'batch' signifies the utilization of the entire training dataset during each iteration of the learning process. By incorporating all training examples for each update, Batch Gradient Descent ensures stable error gradients and a consistent trajectory towards the optimal solution, albeit with significant computational demands. This batching method enhances computational efficiency; however, it can still result in extended processing times for large training datasets due to the necessity of storing all data in memory. While Batch Gradient Descent typically yields a stable error gradient and reliable convergence, it occasionally converges to a local minimum rather than the global optimum.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) enhances parameter updates by leveraging individual data points during each iteration. By conducting a training epoch for each dataset example and updating parameters sequentially, SGD minimizes memory requirements, as only a single training example needs to be stored at any given time. These frequent updates, while providing detailed and rapid adjustments, may lead to decreased computational efficiency relative to batch gradient descent. Despite the potential for noisy gradients, which arise from these frequent updates, this noise can facilitate the escape from local minima, thereby aiding in the pursuit of a global minimum. The principle of SGD is characterized by its utilization of a single example per iteration, hence the term "stochastic" reflects the random selection of each example within the batch. Given sufficient iterations, SGD proves effective, albeit with inherent noisiness.

Mini Batch Gradient Descent

Mini-batch gradient descent integrates principles from both batch gradient descent and stochastic gradient descent. It partitions the training dataset into smaller batch sizes and executes updates on each of these batches. This methodology achieves a balance between the computational efficiency of batch gradient descent and the rapidity of stochastic gradient descent. Similar to stochastic gradient descent, the average cost over epochs in mini-batch gradient descent exhibits fluctuations due to the averaging of a limited number of examples at a time.

What is underfitting in Machine Learning ?

Underfitting occurs in machine leanrning / data science when a data model fails to accurately capture the relationship between input and output variables, resulting in high error rates on both the training set and unseen data. This also entails that the model has insufficient training duration or the input variables lack significance to establish a meaningful relationship between the input and output variables. As the model learns, its bias diminishes, but its variance may increase, leading to overfitting. The objective in model fitting is to identify the optimal balance between underfitting and overfitting (.i.e., finding the sweet spot), allowing the model to capture the dominant trend in the training data and generalize effectively to new datasets.

Important details:

  1. High biased model (underfitted) is not able to learn the very basic/important patterns in the training data.

  2. Adding more data and making your model simpler won't help to avoid underfitting.

  3. One should try other sophisticated models (e.g Decision tree in comparision to kNN) or add complexity in the current model.

  4. Using complex models (example : polynomial regression rather than linear one) may be useful to capture the relevant patterns in the training data.

  5. Adding more features (or derived features from existing one) will also increase the model capacity and helps to avoid underfitting.

  6. If you see unacceptably high training error and test error, the model is underfitted.

  7. High bias and low variance are the indicators of underfitting models.

  8. Underfitting is easier to track than overfitting since the performance can be measured during training phase.

What is overfitting in Machine Learning ?

Overfitting occurs when the model attempts to match the training set too closely. On fresh data, the overfitted model is unable to produce accurate predictions.

Important details:

  1. The model will attempt to match the data too closely and will pick up on noise in the data when the training data set is limited or the given model is complex.

  2. An overfitted model picks up patterns that are unique to the training set and overlooks the generic patterns.

  3. Regularization can reduce overfitting.

  4. Overfitting can also be decreased by training on a large and diversed training data points.

  5. Overfitting can be detected by high variation .i.e, if the test data has a high error rate while the training data has a low error rate.

  6. A high variance model will overfit the data and is flexible in capturing every detail—relevant or not—and noise in the data.

  7. A high variance model is also indicated as: Training error << Validation error.

  8. More training data will improve the generalization of the given model and avoids overfitting.

What is gradient descent ?


Gradient descent is an optimization algorithm used to determine the coefficients / parameters of a function (f) that minimize a cost function.



Gradient descent is a widely used optimization approach for training machine learning models and neural networks. Optimization is the process of minimizing or increasing an objective function. Optimization entails calculating the gradient (partial derivatives) of the cost function for each parameter (weights and biases). To do this, the models are given training data iteratively. And, the gradient points are determined. The gradient consistently indicates the direction of the steepest increase in the loss function. The gradient descent algorithm proceeds by taking a step in the direction of the negative gradient to minimize the loss as efficiently as possible.

  • The selection of the learning rate has a substantial influence on the effectiveness of gradient descent. An excessively high learning rate may cause the algorithm to overshoot the minimum, while an excessively low learning rate may result in prolonged convergence times.

  • Due to its non-convexity, the loss function L(w) of a neural network is generally known to potentially have multiple local minima. When multiple local minima are present, it is highly likely that the algorithm may fail to converge to a global minimum. Thus, local minima pose significant challenges as they may cause the training process to stall rather than progress towards the global minimum.

  • Gradient descent, despite being a widely utilized optimization algorithm, does not ensure convergence to an optimum in all scenarios. Various factors can hinder convergence: Saddle Points: In high-dimensional spaces, gradient descent may become trapped at saddle points where the gradient is zero but does not correspond to a minimum.

  • Saddle points in a multivariable function are critical points where the function does not achieve either a local maximum or a local minimum value.

  • A common issue with both local minima and saddle points is the presence of plateaus with low curvature in the error landscape. Although gradient descent dynamics are repelled from a saddle point towards lower error by following directions of negative curvature, this repulsion can be slow due to the plateau.

  • Stochastic Gradient Descent (SGD) can occasionally escape simple saddle points if fluctuations occur in different directions and the step size is sufficiently large to overcome the flatness. However, saddle regions can sometimes be quite complex.

  • The gradient of error, defined over the difference between the actual and predicted outputs, approaches zero at a local minimum, causing progress to stall due to weight correction steps being proportional to the gradient's magnitude, which is near zero at a minimum. Techniques such as 'random weight initiation' can be employed to avoid this issue.

What is Inductive Bias in Machine Learning ?

An explicit or implicit assumption or prior information about the model that permits it to generalize outside of the training set of data is known as inductive bias.

Examples of inductive bias:

  1. When it comes to decision trees, shorter trees work better than longer ones.

  2. The response variable (y) in linear regression is thought to vary linearly in predictors (X).

  3. In general, the belief that the most simplest hypothesis is more accurate than the more complicated one (Occam's razor) .

What are model training steps in machine learning ?

There may exist many possible models to solve a given problem at hand. Based on your modeling decision there are usually two different ways to complete the machine learning lifecycle.

  • 1st scenario. Training a single model with a training dataset and final evaluation with the test set.

  • 2nd scenario. Training multiple models with training/validation dataset and final evaluation with the test set.

In case of (1st scenario), you will follow the following approach:

  • Divide the data into training and test sets. (Usually 70/30 splits)

  • Select your preferable model.

  • Train it with a training dataset.

  • Assess the trained model in the test set. (no need to perform validation in your trained model)

In case of (2nd scenario), you will follow the following approach:

  • Divide the data into training, validation, and test sets. (Usually 50/25/25 splits)

  • Select the initial model/architecture.

  • Train the model with a training dataset.

  • Evaluate the model using the validation dataset.

  • Repeat steps (b) through (d) for different models or training parameters.

  • Select the best model based on evaluation and train the best model with combined (training + validation) datasets.

  • Assess the trained model in the test set.

what is model training in machine learning ?

The Machine Learning model is represented by the model parameters. Those parameters are the learnable parameters. Learning happens when these parameters are updated with suitable values and the model is able to solve the given tasks. Training is the process of feeding a training dataset to your model. The training process uses an objective function (example MSE) to get the feedback in each iteration. Since we are trying to improve the accuracy of the model on a given input, and lower the error between model prediction and actual output, we also called training process as a model optimization process.