Overfitting Problem in Machine Learning and Regularization in Linear and Logistic Regression

Well, In Researchgate and few other platforms, the topic of the overfitting problem and its reduction process in Linear Regression and Logistic Regression is ubiquitous.

To the best of my knowledge, I am trying to write about this topic here...

Before writing anything about the reduction process, it is a must to know what is overfitting.

Okay, let's come to the point with an example first,

If you, unfortunately, buy jeans pant for you which is larger in size than yours and the shop from where you have bought the pant, they have no changing and returning options. So, finding no other options available, you have to wear that pant and try to fit yourself in that oversized jeans pant. This thing can be considered as Overfitting. 

Now I would like to describe Overfitting in Machine Learning, and here it is.......

When we train a statistical model with a huge amount of data, then overfitting occurs in the model or, in other words, the model is considered overfitted model.

Let's Come with a mathematical expression:

if we consider one 3rd order polynomial and one 5th order polynomial expression,
θ0​+θ1x+θ2x2         ........................................ (i)
θ0​+θ1x+θ2x2+θ3x3+θx4  ................................(ii)

Where, 
y= Output or Target variable 
x= No. of Features(input variable)
θ= parameters 


let's Draw this 👇



Here, the left-sided graph is for equation(i) and right-sided graph is for equation (ii). Observing the graph representation shows that the 3rd order polynomial graph is a better fit to the data than the 5th order polynomial graph even though the fitted curve passes through the data perfectly 5th order polynomial graph. 

Now coming to the explanation why???????

Because If we have too many features, the learned hypothesis (h) may fit the training set very well but fail to generalize to new examples. It is usually caused by a complicated function that creates many unnecessary curves and angles unrelated to the data.
In other words,
While a model is trained with lots of data, there is a possibility for the model to start learning from the noise and inaccurate data in the data set. Then the model can not categorize the data correctly because of too many details and noise.

Overfitting has a good performance on training data in a nutshell, but it has poor generalization quality to predict other new data.


Hope the brief description of the overfitting mentioned above can help you understand what is overfitting and how and why it occurs.

Here are the two main options to address the overfitting or high variance issue:

1) Reduce the Number of Features:

· Which features need to keep can be selected manually.

· Use a model selection algorithm.

2) Regularization:

· Here, you can keep all the features, but what you need to do is reduce the magnitude/values of parameters θj​.

· This way works precisely when you have a significant amount of slightly useful features means each of that features can contribute a bit to predict the target variable y.


Okay, Tough a bit to understand???? No problem, keep patience......😉

I am here to explain the facts of Regularization in Cost Function. Hope then you can clearly understand.

Cost Function:

If we get an overfitting problem from our hypothesis function (h), we can reduce the values or magnitudes of some of the parameters in our function that increased the cost function.

So,  (ii) can be more quadratic if you dismiss the influence of the θ3x3 and θ4x4.

θ0​+θ1x+θ2x2+θ3x3+θx4  ................................(ii)



But the point is how it can be possible?

If you penalize and make the values of the parameters θ3 and θ4  very small (near to zero).
Then the equation can be θ0​+θ1x+θ2x2​ , which is more quadratic.

Again,
Instead of  removing the features or changing the form of our hypothesis, you can modify the cost function.


As we make the values of 
θ3 and θ4 to near zero, that makes the cost function closer to zero, which means it minimizes the cost function.


 Also, it can be regularized by summing all of our theta parameters as:

 


Here, the ( λ) or lambda, is the regularization parameter, which determines the amount of the costs of our inflated theta parameters. From the equation above with the extra summation section make the output of our hypothesis function smooth to reduce overfitting by minimizing the parameters value 
θj.

Advantages of small values for parameters:
 ― “Simpler” hypothesis
 ― Less prone to overfitting


Let's know how it works in Linear and Logistic Regression. Here first came with Linear Regression. Have a look 👇


Regularized Linear Regression:

 To separate out θ0 from the other parameters, we modify our gradient descent because we do not want to penalize θ0.


Here, the hθ(x)=θT x



The Terms (λ/m)θj conducts the regularization here. we can write it as:

N.B: 1−α (λ/m​) is always less than one, reducing the value of θj​ by some amount on each update. 


Regularized Logistic Regression:

To avoid overfitting, you can follow the same process of regularized linear regression for regularizing logistic regression. 

The equation of the cost function for Logistic Regression:




To Regularize the Cost Function, it can be written with an extra term as:


The second sum(last sum of the equation ) means to explicitly exclude the bias term θ0​. 
N.B: the vector θ is indexing from 0 to n, and the sum apparently skips the o index and runs from 1 to n. 


And Finally, get the equations as same as linear regression. 👇 

Here, the hθ(x)= g(θTx)

For easy understand.......👇







Hope It will help you guys !!!!!!!!!!!!!!!!


References:

1. Coursera Course: Machine Learning offered by Stanford University( By Andrew Ng)



Comments

Ta-Seen Blog

Machine Learning Model Evaluation

Receiver Operating Characteristic Curve (ROC)