High bias in machine learning results in underfitting, characterized by the model making oversimplified assumptions about the relationships within the data. This leads to subpar performance on both training and test datasets, demonstrating that the model does not possess the necessary complexity to capture the underlying patterns.
In this example, we will check house price prediction using two methods:
- Linear Regression: The simple linear model does not adequately capture the complex relationships between features and house prices.
- Polynomial Regression: The PolynomialFeatures step enhances the dataset by generating polynomial and interaction terms from the original features. For instance, when you have a feature x with a degree of 2, it will produce new features such as x, x², and cross-terms like x1 * x2. This approach enables a linear regression model to effectively capture non-linear relationships.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.datasets
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
house_price_dataset = sklearn.datasets.fetch_california_housing()
print(house_price_dataset.DESCR)
# Loading the dataset to a pandas dataframe
df_house_data = pd.DataFrame(house_price_dataset.data, columns = house_price_dataset.feature_names)
df_house_data.head()
# add the target column
# target is median house value in block group (in $100,000s).
df_house_data['price'] = house_price_dataset.target
df_house_data.head()
House Price Prediction Using Linear Regression¶
# Prepare data
X = df_house_data.drop('price', axis=1)
y = df_house_data['price']
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train linear model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Evaluate
train_pred = model.predict(X_train_scaled)
test_pred = model.predict(X_test_scaled)
print("Training MSE:", mean_squared_error(y_train, train_pred))
print("Test MSE:", mean_squared_error(y_test, test_pred))
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f"Training R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")
House Price Prediction Using Polynomial Regression¶
- Observations:
-
The mean squared error is decreased in test set.
-
The R² score is increased in test set.
-- R² Interpretation: An R-squared value of 0.75 indicates that 75% of the variation in house prices can be attributed to factors such as square footage, location, and the amenities included in the model.
-
Why Use both PolynomialFeatures
and LinearRegression
in the Pipeline:
- The first PolynomialFeatures transformation creates a more complex feature space.
- The LinearRegression then fits a linear model to these non-linear features.
- This effectively allows a linear model to approximate non-linear relationships.
def polynomial_regression_model(X_train, X_test, degree=2):
# Create pipeline
model = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=degree)),
('linear', LinearRegression())
])
# Fit and evaluate
model.fit(X_train, y_train)
# Evaluate
train_pred = model.predict(X_train_scaled)
test_pred = model.predict(X_test_scaled)
print("Training MSE:", mean_squared_error(y_train, train_pred))
print("Test MSE:", mean_squared_error(y_test, test_pred))
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")
return model, X_train_poly, X_test_poly
# Example usage with housing data
model_poly, X_train_poly, X_test_poly = polynomial_regression_model(
X_train_scaled, X_test_scaled
)