Understanding High Bias in Machine Learning with Real-World Example

High bias in machine learning results in underfitting, characterized by the model making oversimplified assumptions about the relationships within the data. This leads to subpar performance on both training and test datasets, demonstrating that the model does not possess the necessary complexity to capture the underlying patterns.

In this example, we will check house price prediction using two methods:

  1. Linear Regression: The simple linear model does not adequately capture the complex relationships between features and house prices.
  2. Polynomial Regression: The PolynomialFeatures step enhances the dataset by generating polynomial and interaction terms from the original features. For instance, when you have a feature x with a degree of 2, it will produce new features such as x, x², and cross-terms like x1 * x2. This approach enables a linear regression model to effectively capture non-linear relationships.
In [31]:
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.datasets

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
In [21]:
house_price_dataset = sklearn.datasets.fetch_california_housing()
In [22]:
print(house_price_dataset.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

In [23]:
# Loading the dataset to a pandas dataframe
df_house_data = pd.DataFrame(house_price_dataset.data, columns = house_price_dataset.feature_names)
In [24]:
df_house_data.head()
Out[24]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
In [25]:
# add the target column
# target is median house value in block group (in $100,000s).
df_house_data['price'] = house_price_dataset.target
In [26]:
df_house_data.head()
Out[26]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude price
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
House Price Prediction Using Linear Regression
In [34]:
# Prepare data
X = df_house_data.drop('price', axis=1)
y = df_house_data['price']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train linear model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
train_pred = model.predict(X_train_scaled)
test_pred = model.predict(X_test_scaled)

print("Training MSE:", mean_squared_error(y_train, train_pred))
print("Test MSE:", mean_squared_error(y_test, test_pred))

train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)

print(f"Training R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")
Training MSE: 0.5240457125963887
Test MSE: 0.5261093658365182
Training R² Score: 0.6081
Test R² Score: 0.5980
House Price Prediction Using Polynomial Regression
  • Observations:
    • The mean squared error is decreased in test set.

    • The R² score is increased in test set.

      -- R² Interpretation: An R-squared value of 0.75 indicates that 75% of the variation in house prices can be attributed to factors such as square footage, location, and the amenities included in the model.

Why Use both PolynomialFeatures and LinearRegression in the Pipeline:

- The first PolynomialFeatures transformation creates a more complex feature space.
- The LinearRegression then fits a linear model to these non-linear features.
- This effectively allows a linear model to approximate non-linear relationships.
In [35]:
def polynomial_regression_model(X_train, X_test, degree=2):
    # Create pipeline
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
   
    # Fit and evaluate
    model.fit(X_train, y_train)

    # Evaluate
    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)

    print("Training MSE:", mean_squared_error(y_train, train_pred))
    print("Test MSE:", mean_squared_error(y_test, test_pred))

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"Training R² Score: {train_score:.4f}")
    print(f"Test R² Score: {test_score:.4f}")
    
    return model, X_train_poly, X_test_poly

# Example usage with housing data
model_poly, X_train_poly, X_test_poly = polynomial_regression_model(
    X_train_scaled, X_test_scaled
)
Training MSE: 0.4219834148836872
Test MSE: 0.42363567392919027
Training R² Score: 0.6844
Test R² Score: 0.6763
In [ ]:
 

Comments

Comments powered by Disqus