High bias in machine learning results in underfitting, characterized by the model making oversimplified assumptions about the relationships within the data. This leads to subpar performance on both training and test datasets, demonstrating that the model does not possess the necessary complexity to capture the underlying patterns.

In this example, we will check house price prediction using two methods:

Linear Regression: The simple linear model does not adequately capture the complex relationships between features and house prices.
Polynomial Regression: The PolynomialFeatures step enhances the dataset by generating polynomial and interaction terms from the original features. For instance, when you have a feature x with a degree of 2, it will produce new features such as x, x², and cross-terms like x1 * x2. This approach enables a linear regression model to effectively capture non-linear relationships.

In [31]:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.datasets

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

In [21]:

house_price_dataset = sklearn.datasets.fetch_california_housing()

In [22]:

print(house_price_dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

In [23]:

# Loading the dataset to a pandas dataframe
df_house_data = pd.DataFrame(house_price_dataset.data, columns = house_price_dataset.feature_names)

In [24]:

df_house_data.head()

Out[24]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25

In [25]:

# add the target column
# target is median house value in block group (in $100,000s).
df_house_data['price'] = house_price_dataset.target

In [26]:

df_house_data.head()

Out[26]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	price
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

House Price Prediction Using Linear Regression¶

In [34]:

# Prepare data
X = df_house_data.drop('price', axis=1)
y = df_house_data['price']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train linear model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
train_pred = model.predict(X_train_scaled)
test_pred = model.predict(X_test_scaled)

print("Training MSE:", mean_squared_error(y_train, train_pred))
print("Test MSE:", mean_squared_error(y_test, test_pred))

train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)

print(f"Training R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")

Training MSE: 0.5240457125963887
Test MSE: 0.5261093658365182
Training R² Score: 0.6081
Test R² Score: 0.5980

House Price Prediction Using Polynomial Regression¶

Observations:
- The mean squared error is decreased in test set.
- The R² score is increased in test set.
  
  -- R² Interpretation: An R-squared value of 0.75 indicates that 75% of the variation in house prices can be attributed to factors such as square footage, location, and the amenities included in the model.

Why Use both PolynomialFeatures and LinearRegression in the Pipeline:

- The first PolynomialFeatures transformation creates a more complex feature space.
- The LinearRegression then fits a linear model to these non-linear features.
- This effectively allows a linear model to approximate non-linear relationships.

In [35]:

def polynomial_regression_model(X_train, X_test, degree=2):
    # Create pipeline
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
   
    # Fit and evaluate
    model.fit(X_train, y_train)

    # Evaluate
    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)

    print("Training MSE:", mean_squared_error(y_train, train_pred))
    print("Test MSE:", mean_squared_error(y_test, test_pred))

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"Training R² Score: {train_score:.4f}")
    print(f"Test R² Score: {test_score:.4f}")
    
    return model, X_train_poly, X_test_poly

# Example usage with housing data
model_poly, X_train_poly, X_test_poly = polynomial_regression_model(
    X_train_scaled, X_test_scaled
)

Training MSE: 0.4219834148836872
Test MSE: 0.42363567392919027
Training R² Score: 0.6844
Test R² Score: 0.6763

In [ ]:

Menu

Understanding High Bias in Machine Learning with Real-World Example

House Price Prediction Using Linear Regression¶

House Price Prediction Using Polynomial Regression¶