Out-of-sample accuracy estimation using Cross validation in python and scikit-learn
In this post, we will be continuing from our previous post:
K-Nearest Neighbors Algorithm using Python and Scikit-Learn?
Before starting with the implementation, let's discuss few important points in cross validation.
- Using Cross validation (CV), we splits our dataset into k folds (k generally setup by developer)
- Once you created k folds, you use each of the folds as test set during run and all remaining folds as train set.
- With cross validation, one can assess the average model performance (this post) or also for the hyperparameters selection (for example : selecting optimal neighbors size(k) in kNN) or selecting good feature combinations from given data features.
In [1]:
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# making results reproducible
np.random.seed(42)
In [2]:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, sep=',')
df.columns = ['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID', 'ASH', 'ALCALINITY','MAGNESIUM', 'PHENOLS',
'FLAVANOIDS', 'NON_FLAVANOID_PHENOL', 'PROANTHOCYANINS', 'COLOR_INTENSITY',
'HUE', 'OD280/OD315_DILUTED','PROLINE']
# Let us use only two features : 'ALCOHOL_LEVEL', 'MALIC_ACID' for this problem
df = df[['CLASS', 'ALCOHOL_LEVEL', 'MALIC_ACID']]
df.head()
Out[2]: