Let me start with an example. Suppose you have built a ML model under supervised learning which can detect faces from your family members. You simply feed all the family photos to your model. And, at the end, you would like to know how much your model learnt from your training process.
After training your model, you will find a function that maps input (image) to output labels(family members). If we represent your ML model with a function 'f',we can write this process as:
y = f(x), where x = input (your family photos) y = output (model prediction) (predictiong whether the photo is your father, mother or sister)
Suppose, you use
50 images for your training process and use the same images to see whether the model can predict the label correctly or not. If you found that
10 out of
50 images are not predicted well, you can say :
20% training error.
If you think about the intuition behind this calculation, it is saying that 'the model is not doing well with the training data itself'. That means it underfits the data. It does not want to carry all the information available in your data.
*NOTE: If your model 'f' is a linear model than it will be really hard for your model to classify the photos using linear function.
Let us consider the different scenario. You found that there is only
2% training error after training your model. And, you are happy that only
1 image out of
50 is being wrongly predicted. But, you are still suspicious about your model and apply different strategy this time. You captured new
10 photos of your family and gave it to the model. And, at this time all the
10 photos are wrongly predicted by your model.
What we can see is that 'your model learn everything from the training data without considering the generalization of data' (increased variance). We can also say that model overfits your training data. Obviously, learning everything from training data will reduce the biasness but it increases the variance in your data.