Author – Bhavesh Kumar
The word Regularize means to make things acceptable or Regular. Regularization is technique used to overcome the error by fitting a function appropriately on the training set and to avoid overfitting by introducing additional information to the cost function. Regularization can be applied to objective functions in optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
Table of Content:-
Problem with Overfitting and Underfitting.
Introduction to Regularization.
How can we address and reduce Overfitting?
Different Regularization techniques.
L1 (Lasso) Regularization.
L2 (Ridge) Regularization.
Regularization in Linear Regression.
Regularization in Logistic Regression.
Github URL
References
Problem with Overfitting and Underfitting:-
This situation where any given model is performing too well on the training data but the performance drops significantly over the test set is called an overfitting model.
On the other hand, if the model is performing poorly over the test and the train set, then we call that an underfitting model.
Bias is also called “error due to squared bias,”. Bias is the amount that a model’s prediction differs from the target value, compared to the training data. Bias error results from simplifying the assumptions used in a model so the target functions are easier to approximate.
Variance is the variability of model prediction for a data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
![](https://static.wixstatic.com/media/6e3b57_7eeb1b4949bf478c82ca4cecb3ecc8dc~mv2.jpg/v1/fill/w_756,h_307,al_c,q_80,enc_auto/6e3b57_7eeb1b4949bf478c82ca4cecb3ecc8dc~mv2.jpg)
Figure 1
In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and high variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data.
In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex.
To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.
![](https://static.wixstatic.com/media/6e3b57_69f2393d0a5f43278bddad965fff7130~mv2.jpg/v1/fill/w_616,h_354,al_c,q_80,enc_auto/6e3b57_69f2393d0a5f43278bddad965fff7130~mv2.jpg)
Figure 2
An optimal balance of bias and variance would never overfit or underfit the model.
There are two main options to address the issue of overfitting :
1) Reduce the number of feature
2) Regularization
How can we address and reduce Overfitting?
To Understand how we address and reduce Overfitting we take a example. Let suppose we have two functions represented by a green curve and a blue curve respectively. Both the curves fit those red points so well that we can consider they both bear zero loss. From the intuition of Overfitting, we may guess that green curve is the that overfits.
![](https://static.wixstatic.com/media/6e3b57_08187378cd3e46d493998cf13d24de2b~mv2.jpg/v1/fill/w_701,h_364,al_c,q_80,enc_auto/6e3b57_08187378cd3e46d493998cf13d24de2b~mv2.jpg)
Figure 3
But a question arises why the green curve (or any curve which is similar to it) is overfitting the data?
To understand that in a bit more mathematical way, let’s assume the two functions that are used to draw the graph above:
![](https://static.wixstatic.com/media/6e3b57_c5312712ed094ed5895a7f19511a0138~mv2.jpg/v1/fill/w_359,h_105,al_c,q_80,enc_auto/6e3b57_c5312712ed094ed5895a7f19511a0138~mv2.jpg)
The green curve:
The red curve:
If we look at each function’s equation, we will find that the green curve has larger coefficients, and that’s the primary cause of Overfitting. As mentioned before, Overfitting can be interpreted that your model fits the dataset so well, which it seems to memorize the data we showed rather than actually learn from it. Intuitively, having large coefficients can be seen as evidence of memorizing the data
Other case is if we add too many features, we will be punished with Overfitting. So we also have to avoid too many feature selection.
Regularization Term or Cost Function:-
Here we introduce mean square error cost function or Hypothesis Which we are using through regression by adding some penalty to it for regularization to overcome overfitting.
Mathematically,
![](https://static.wixstatic.com/media/6e3b57_04ce5749eea14f0daabe378f4e3cf67f~mv2.jpg/v1/fill/w_286,h_94,al_c,q_80,enc_auto/6e3b57_04ce5749eea14f0daabe378f4e3cf67f~mv2.jpg)
Where
![](https://static.wixstatic.com/media/6e3b57_34f39be5af014ee3a4c53caf064266f8~mv2.jpg/v1/fill/w_214,h_118,al_c,q_80,enc_auto/6e3b57_34f39be5af014ee3a4c53caf064266f8~mv2.jpg)
is the learned prediction given on the i^th input parameter Xi.
y^i) is the original prediction.
m is the total number of input sample.
Different Regularization techniques:-
Here we have a two types of Regularization techniques as follows.
L1(Lasso) Regularization.
L2 (Ridge) Regularization.
L1 (Lasso) Regularization:-
The “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean
Lasso Regression uses L1 regularization technique. It is used when we have more number of features because it automatically performs feature selection.
Mathematical equation of Lasso:-
Residual sum of squares (RSS) + λ * (Sum of the absolute value of the magnitude of coefficients)
![](https://static.wixstatic.com/media/6e3b57_b4d683f7eb904a31bc0efa5c4456d961~mv2.jpg/v1/fill/w_430,h_95,al_c,q_80,enc_auto/6e3b57_b4d683f7eb904a31bc0efa5c4456d961~mv2.jpg)
Where,
α denotes the amount of shrinkage.
α = 0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares is considered to build a predictive model
α = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and more features
The bias increases with increase in α
variance increases with decrease in α
Scikit learn implementation of Lasso:-
![](https://static.wixstatic.com/media/6e3b57_927b94a56cd44d38b49d3daa9af45b65~mv2.jpg/v1/fill/w_949,h_60,al_c,q_80,enc_auto/6e3b57_927b94a56cd44d38b49d3daa9af45b65~mv2.jpg)
L2 (Ridge) Regularization:-
L2 Regularization adds “squared magnitude” of coefficient as penalty term to the loss function scales them by some number 𝜆. This is called the Ridge penalty.
L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).
Mathematically,
![](https://static.wixstatic.com/media/6e3b57_4899b21894364a92b35023b59850bbd3~mv2.jpg/v1/fill/w_557,h_127,al_c,q_80,enc_auto/6e3b57_4899b21894364a92b35023b59850bbd3~mv2.jpg)
where
J(θ) = Cost Function
θ = Parameter
x(i) = Features
hθ(x(i)) = Hypothesis or function
y(i) = Actual or Target values
λ = Constant Value
The difference between ridge and lasso is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero
Scikit learn implementation of Ridge:-
![](https://static.wixstatic.com/media/6e3b57_9702ead1b25e42ee8ee1c09668d25216~mv2.jpg/v1/fill/w_819,h_44,al_c,q_80,enc_auto/6e3b57_9702ead1b25e42ee8ee1c09668d25216~mv2.jpg)
Regularization in Linear Regression:-
First of all here we introduce Gradient descent,
Mathematically,
Repeat till convergence to get extremum
![](https://static.wixstatic.com/media/6e3b57_c54ecfd974304e2286370e349bab01d9~mv2.jpg/v1/fill/w_272,h_66,al_c,q_80,enc_auto/6e3b57_c54ecfd974304e2286370e349bab01d9~mv2.jpg)
We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.
We will modify our gradient descent function to separate out
![](https://static.wixstatic.com/media/6e3b57_febbec720eef4c2b9d5d33af99d9963c~mv2.png/v1/fill/w_77,h_86,al_c,q_85,enc_auto/6e3b57_febbec720eef4c2b9d5d33af99d9963c~mv2.png)
From the rest of the parameters because we do not want to penalize
![](https://static.wixstatic.com/media/6e3b57_febbec720eef4c2b9d5d33af99d9963c~mv2.png/v1/fill/w_77,h_86,al_c,q_85,enc_auto/6e3b57_febbec720eef4c2b9d5d33af99d9963c~mv2.png)
![](https://static.wixstatic.com/media/6e3b57_c193438d9af24470a749334586c6824f~mv2.jpg/v1/fill/w_677,h_189,al_c,q_80,enc_auto/6e3b57_c193438d9af24470a749334586c6824f~mv2.jpg)
Here in this particular gradient descent our hypothesis function is best fit line i.e
![](https://static.wixstatic.com/media/6e3b57_e1c30199991d4cd78da38645e57a4cf7~mv2.jpg/v1/fill/w_158,h_48,al_c,q_80,enc_auto/6e3b57_e1c30199991d4cd78da38645e57a4cf7~mv2.jpg)
where,
k = slope of best fit
x = actual or Target data point
c = Intercept of that best fit Line.
α = Learning Rate
The term
![](https://static.wixstatic.com/media/6e3b57_178d85b15e674b6da967fb57fff9d88c~mv2.jpg/v1/fill/w_54,h_44,al_c,q_80,enc_auto/6e3b57_178d85b15e674b6da967fb57fff9d88c~mv2.jpg)
performs our regularization
By calculation,
![](https://static.wixstatic.com/media/6e3b57_fd8217edf62b4223a6aca98a92fb74ca~mv2.jpg/v1/fill/w_551,h_72,al_c,q_80,enc_auto/6e3b57_fd8217edf62b4223a6aca98a92fb74ca~mv2.jpg)
The first term in the above equation
![](https://static.wixstatic.com/media/6e3b57_8c67b33069a843d4b5ad25ceff5b8674~mv2.jpg/v1/fill/w_95,h_56,al_c,q_80,enc_auto/6e3b57_8c67b33069a843d4b5ad25ceff5b8674~mv2.jpg)
will always be less than 1.
We can see it as reducing the value of θj by some amount on every update. The second term is now exactly the same as it was before
After each iteration we converge toward our Extremum.
Regularization in Logistic Regression:-
We will regularize the logistic regression as similar we have done regularization in linear regression. In we given figure we have Underfitting , optimize and overfitting cases.
![](https://static.wixstatic.com/media/6e3b57_4e84c38a39624ae2a27d93b8377e4adf~mv2.jpg/v1/fill/w_738,h_363,al_c,q_80,enc_auto/6e3b57_4e84c38a39624ae2a27d93b8377e4adf~mv2.jpg)
Figure 4
Here, the hypothesis is slightly different from hypothesis of linear regression because here we classify the data points in binary class or multi class. so our Hypothesis should be under value of 0 to 1 that is our necessary condition for classification in logistic regression.
Mathematically, our hypothesis will be as follows.
Here we use the "Sigmoid Function," also called the "Logistic Function"
![](https://static.wixstatic.com/media/6e3b57_38ff74d84d2a4751acd6592e3d252f83~mv2.jpg/v1/fill/w_202,h_134,al_c,q_80,enc_auto/6e3b57_38ff74d84d2a4751acd6592e3d252f83~mv2.jpg)
Cost Function in Logistic Regression:-
![](https://static.wixstatic.com/media/6e3b57_7eb3ed60348d4acc82e1eb1e50133ca8~mv2.jpg/v1/fill/w_496,h_131,al_c,q_80,enc_auto/6e3b57_7eb3ed60348d4acc82e1eb1e50133ca8~mv2.jpg)
![](https://static.wixstatic.com/media/6e3b57_8abd0b06c2da43b2978067dd7e1b31f6~mv2.jpg/v1/fill/w_707,h_40,al_c,q_80,enc_auto/6e3b57_8abd0b06c2da43b2978067dd7e1b31f6~mv2.jpg)
We can regularize this equation by adding a term to the end:
![](https://static.wixstatic.com/media/6e3b57_4426cf43504d451ca86362ef7ebd2ac7~mv2.jpg/v1/fill/w_745,h_39,al_c,q_80,enc_auto/6e3b57_4426cf43504d451ca86362ef7ebd2ac7~mv2.jpg)
In second sum
![](https://static.wixstatic.com/media/6e3b57_43a68c13754e4af39945dd8490980326~mv2.jpg/v1/fill/w_110,h_37,al_c,q_80,enc_auto/6e3b57_43a68c13754e4af39945dd8490980326~mv2.jpg)
we exclude the term θ0 then we regularly update the complete equation.
Gradient Descent:-
change in the derivative of cost function
![](https://static.wixstatic.com/media/6e3b57_8342119aea6c4f9baed65f01dbfcf775~mv2.jpg/v1/fill/w_692,h_130,al_c,q_80,enc_auto/6e3b57_8342119aea6c4f9baed65f01dbfcf775~mv2.jpg)
The first term of cost function remains the same, so does the first term of the derivative. So taking derivative of second term gives one more term.
![](https://static.wixstatic.com/media/6e3b57_27e1a978c5d140a39179277959b0066c~mv2.jpg/v1/fill/w_711,h_118,al_c,q_80,enc_auto/6e3b57_27e1a978c5d140a39179277959b0066c~mv2.jpg)
Github URL:-
References:-
https://www.skillbasics.com/courses/machine-learning-for-beginners.
https://www.coursera.org/learn/machine-learning/resources/Zi29t
https://machinelearningmedium.com/2017/09/15/regularized-logistic-regression/
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
https://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11#Over-fitting_and_Under-fitting
Comments